
# Get top 10 news sources for this bias and sort index alphabetically On the other hand, our name variable is no longer a BeautifulSoup object because we called. That's because the rows become their own BeautifulSoup objects when we make a select from another BeautifulSoup object. You'll notice that we can run BeautifulSoup methods right off one of the rows. Many websites use whitespace as a way to visually pad the text inside elements so using strip() is always a good idea. strip() ensures all the whitespace surrounding the name is removed. text to give us the text between the tags. Bear in mind that using select or select_one will give you the whole element with the tags included, so we need. text is gets all text in that element, and since "ABC News" is the only text, that's all we need to do. Notice that we didn't need to worry about selecting the anchor tag a that contains the text. views-field looks to be just a class each row is given for styling and doesn't provide any uniqueness. The only class we needed to use in this case was. The selectors above get us pretty close to everything we would need for now. There's many more selectors for for doing various tasks, like selecting certain child elements, specific links, etc., that you can look up when needed. No need to do nested selectors when using ids. ids, such as, are unique so you can usually use the id selector by itself to get the right element.The space tells the selector that the class after the space is a child of the class before the space. example gets an element with class example nested inside of a parent element with class temp, E.g. temp a gets an anchor element nested inside of a parent element with class temp, E.g. temp.example gets an element with both classes temp and example, E.g. #temp gets an element with an id of temp, E.g.temp gets an element with a class of temp, E.g. select_one('a') gets an anchor/link element, select_one('body') gets the body element To get a tag, such as, , use the naked name for the tag.Both of these methods use CSS selectors to find elements, so if you're rusty on how CSS selectors work here's a quick refresher: To find elements and data inside our HTML we'll be using select_one, which returns a single element, and select, which returns a list of elements (even if only one item exists). Sometimes there will be a disallow all pages followed by allowed pages like this: Many times you'll see a * next to Allow or Disallow which means you are either allowed or not allowed to scrape everything on the site. On the other hand, we are disallowed from scraping anything from the /scripts/subfolder. In this example we're allowed to request anything in the /pages/subfolder which means anything that starts with /pages/. The Crawl-delay tells us the number of seconds to wait before requests, so in this example we need to wait 10 seconds before making another request.Īllow gives us specific URLs we're allowed to request with bots, and vice versa for Disallow. A * means that the following rules apply to all bots (that's us). We don't really need to provide a User-agent when scraping, so User-agent: * is what we would follow. Common bots are googlebot, bingbot, and applebot, all of which you can probably guess the purpose and origin of. Some robots.txt will have many User-agents with different rules. The User-agent field is the name of the bot and the rules that follow are what the bot should follow. Since this article is available as a Jupyter notebook, you will see how it works if you choose that format. If I'm just doing some quick tests, I'll usually start out in a Jupyter notebook because you can request a web page in one cell and have that web page available to every cell below it without making a new request. We don't want to be making a request every time our parsing or other logic doesn't work out, so we need to parse only after we've saved the page locally. Every time we scrape a website we want to attempt to make only one request per page. With this in mind, we want to be very careful with how we program scrapers to avoid crashing sites and causing damage. With a Python script that can execute thousands of requests a second if coded incorrectly, you could end up costing the website owner a lot of money and possibly bring down their site (see Denial-of-service attack (DoS)). Every time you load a web page you're making a request to a server, and when you're just a human with a browser there's not a lot of damage you can do.
