Google+
Shineyrock web design & consultancy

Shineyrock

blog

  • like 3 23

    Modern Web Scraping With BeautifulSoup and Selenium

    Overview

    HTML is almost intuitive. CSS is a great advancement that cleanly separates the structure of a page from its look and feel. JavaScript adds some pizazz. That's the theory. The real world is a little different.

    In this tutorial, you'll learn how the content you see in the browser actually gets rendered and how to go about scraping it when necessary. In particular, you'll learn how to count Disqus comments. Our tools will be Python and awesome packages like requests, BeautifulSoup, and Selenium.

    When Should You Use Web Scraping?

    Web scraping is the practice of automatically fetching the content of web pages designed for interaction with human users, parsing them, and extracting some information (possibly navigating links to other pages). It is sometimes necessary if there is no other way to extract the necessary information. Ideally, the application provides a dedicated API for accessing its data programmatically. There are several reasons web scraping should be your last resort:

    • It is fragile (the web pages you're scraping might change frequently).
    • It might be forbidden (some web apps have policies against scraping).
    • It might be slow and expansive (if you need to fetch and wade through a lot of noise).

    Understanding Real-World Web Pages

    Let's understand what we are up against, by looking at the output of some common web application code. In the article Introduction to Vagrant, there are some Disqus comments at the bottom of the page:

    Understanding Real-World Web Pages

    In order to scrape these comments, we need to find them on the page first.

    View Page Source

    Every browser since the dawn of time (the 1990s) has supported the ability to view the HTML of the current page. Here is a snippet from the view source of Introduction to Vagrant that starts with a huge chunk of minified and uglified JavaScript unrelated to the article itself. Here is a small portion of it:

    Page Source

    Here is some actual HTML from the page:

    HTML From the Page

    This looks pretty messy, but what is surprising is that you will not find the Disqus comments in the source of the page.

    The Mighty Inline Frame

    It turns out that the page is a mashup, and the Disqus comments are embedded as an iframe (inline frame) element. You can find it out by right-clicking on the comments area, and you'll see that there is frame information and source there:

    The Mighty Inline Frame

    That makes sense. Embedding third-party content as an iframe is one of the primary reasons to use iframes. Let's find the <iframe> tag then in the main page source. Foiled again! There is no <iframe> tag in the main page source.  

    JavaScript-Generated Markup

    The reason for this omission is that view page source shows you the content that was fetched from the server. But the final DOM (document object model) that gets rendered by the browser may be very different. JavaScript kicks in and can manipulate the DOM at will. The iframe can't be found, because it wasn't there when the page was retrieved from the server. 

    Static Scraping vs. Dynamic Scraping

    Static scraping ignores JavaScript. It fetches web pages from the server without the help of a browser. You get exactly what you see in "view page source", and then you slice and dice it. If the content you're looking for is available, you need to go no further. However, if the content is something like the Disqus comments iframe, you need dynamic scraping. 

    Dynamic scraping uses an actual browser (or a headless browser) and lets JavaScript do its thing. Then, it queries the DOM to extract the content it's looking for. Sometimes you need to automate the browser by simulating a user to get the content you need.

    Static Scraping With Requests and BeautifulSoup

    Let's see how static scraping works using two awesome Python packages: requests for fetching web pages and BeautifulSoup for parsing HTML pages.

    Installing Requests and BeautifulSoup

    Install pipenv first, and then: pipenv install requests beautifulsoup4 

    This will create a virtual environment for you too. If you're using the code from gitlab, you can just pipenv install.

    Fetching Pages

    Fetching a page with requests is a one liner: r = requests.get(url)

    The response object has a lot of attributes. The most important ones are ok and content. If the request fails then r.ok will be False and r.content will contain the error. The content is a stream of bytes. It is usually better to decode it to utf-8 when dealing with text:

    If everything is OK then r.content will contain the requested web page (same as view page source).

    Finding Elements With BeautifulSoup

    The get_page() function below fetches a web page by URL, decodes it to UTF-8, and parses it into a BeautifulSoup object using the HTML parser.

    Once we have a BeautifulSoup object, we can start extracting information from the page. BeautifulSoup provides many find functions to locate elements inside the page and drill down deep nested elements. 

    Tuts+ author pages contain multiple tutorials. Here is my author page. On each page, there are up to 12 tutorials. If you have more than 12 tutorials then you can navigate to the next page. The HTML for each article is enclosed in an <article> tag. The following function finds all the article elements on the page, drills down to their links, and extracts the href attribute to get the URL of the tutorial:

    The following code gets all the articles from my page and prints them (without the common prefix):

    Dynamic Scraping With Selenium

    Static scraping was good enough to get the list of articles, but as we saw earlier, the Disqus comments are embedded as an iframe element by JavaScript. In order to harvest the comments, we will need to automate the browser and interact with the DOM interactively. One of the best tools for the job is Selenium.

    Selenium is primarily geared towards automated testing of web applications, but it is great as a general-purpose browser automation tool.

    Installing Selenium

    Type this command to install Selenium: pipenv install selenium

    Choose Your Web Driver

    Selenium needs a web driver (the browser it automates). For web scraping, it usually doesn't matter which driver you choose. I prefer the Chrome driver. Follow the instructions in this Selenium guide.

    Chrome vs. PhantomJS

    In some cases you may prefer to use a headless browser, which means no UI is displayed. Theoretically, PhantomJS is just another web driver. But, in practice, people reported incompatibility issues where Selenium works properly with Chrome or Firefox and sometimes fails with PhantomJS. I prefer to remove this variable from the equation and use an actual browser web driver.

    Counting Disqus Comments

    Let's do some dynamic scraping and use Selenium to count Disqus comments on Tuts+ tutorials. Here are the necessary imports.

    The get_comment_count() function accepts a Selenium driver and URL. It uses the get() method of the driver to fetch the URL. This is similar to requests.get(), but the difference is that the driver object manages a live representation of the DOM.

    Then, it gets the title of the tutorial and locates the Disqus iframe using its parent id disqus_thread and then the iframe itself:

    The next step is to fetch the contents of the iframe itself. Note that we wait for the comment-count element to be present because the comments are loaded dynamically and not necessarily available yet.

    The last part is to return the last comment if it wasn't made by me. The idea is to detect comments I haven't responded to yet.

    Conclusion

    Web scraping is a useful practice when the information you need is accessible through a web application that doesn't provide an appropriate API. It takes some non-trivial work to extract data from modern web applications, but mature and well-designed tools like requests, BeautifulSoup, and Selenium make it worthwhile.

    Additionally, don’t hesitate to see what we have available for sale and for study in the Envato Market, and don't hesitate to ask any questions and provide your valuable feedback using the feed below.

    martijn broeders

    founder/ strategic creative at shineyrock web design & consultancy
    e-mail: .(JavaScript must be enabled to view this email address)
    phone: 434 210 0245

By - category

    By - date