Shineyrock

like 2 23
Modern Web Scraping With BeautifulSoup and Selenium
1. March 2018
2. Daily
3. read 389
Overview
HTML is almost intuitive. CSS is a great advancement that cleanly separates the structure of a page from its look and feel. JavaScript adds some pizazz. That's the theory. The real world is a little different.
In this tutorial, you'll learn how the content you see in the browser actually gets rendered and how to go about scraping it when necessary. In particular, you'll learn how to count Disqus comments. Our tools will be Python and awesome packages like requests, BeautifulSoup, and Selenium.
When Should You Use Web Scraping?
Web scraping is the practice of automatically fetching the content of web pages designed for interaction with human users, parsing them, and extracting some information (possibly navigating links to other pages). It is sometimes necessary if there is no other way to extract the necessary information. Ideally, the application provides a dedicated API for accessing its data programmatically. There are several reasons web scraping should be your last resort:
- It is fragile (the web pages you're scraping might change frequently).
- It might be forbidden (some web apps have policies against scraping).
- It might be slow and expansive (if you need to fetch and wade through a lot of noise).
Understanding Real-World Web Pages
Let's understand what we are up against, by looking at the output of some common web application code. In the article Introduction to Vagrant, there are some Disqus comments at the bottom of the page:
In order to scrape these comments, we need to find them on the page first.
View Page Source
Every browser since the dawn of time (the 1990s) has supported the ability to view the HTML of the current page. Here is a snippet from the view source of Introduction to Vagrant that starts with a huge chunk of minified and uglified JavaScript unrelated to the article itself. Here is a small portion of it:
Here is some actual HTML from the page:
This looks pretty messy, but what is surprising is that you will not find the Disqus comments in the source of the page.
The Mighty Inline Frame
It turns out that the page is a mashup, and the Disqus comments are embedded as an iframe (inline frame) element. You can find it out by right-clicking on the comments area, and you'll see that there is frame information and source there:
That makes sense. Embedding third-party content as an iframe is one of the primary reasons to use iframes. Let's find the <iframe> tag then in the main page source. Foiled again! There is no <iframe> tag in the main page source.
JavaScript-Generated Markup
The reason for this omission is that view page source shows you the content that was fetched from the server. But the final DOM (document object model) that gets rendered by the browser may be very different. JavaScript kicks in and can manipulate the DOM at will. The iframe can't be found, because it wasn't there when the page was retrieved from the server.
Static Scraping vs. Dynamic Scraping
Static scraping ignores JavaScript. It fetches web pages from the server without the help of a browser. You get exactly what you see in "view page source", and then you slice and dice it. If the content you're looking for is available, you need to go no further. However, if the content is something like the Disqus comments iframe, you need dynamic scraping.
Dynamic scraping uses an actual browser (or a headless browser) and lets JavaScript do its thing. Then, it queries the DOM to extract the content it's looking for. Sometimes you need to automate the browser by simulating a user to get the content you need.
Static Scraping With Requests and BeautifulSoup
Let's see how static scraping works using two awesome Python packages: requests for fetching web pages and BeautifulSoup for parsing HTML pages.
Installing Requests and BeautifulSoup
Install pipenv first, and then: pipenv install requests beautifulsoup4
This will create a virtual environment for you too. If you're using the code from gitlab, you can just pipenv install.
Fetching Pages
Fetching a page with requests is a one liner: r = requests.get(url)
The response object has a lot of attributes. The most important ones are ok and content. If the request fails then r.ok will be False and r.content will contain the error. The content is a stream of bytes. It is usually better to decode it to utf-8 when dealing with text:
```
>>> r = requests.get('http://www.c2.com/no-such-page')
>>> r.ok
False
>>> print(r.content.decode('utf-8'))
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /ggg was not found on this server.</p>
<hr>
<address>
Apache/2.0.52 (CentOS) Server at www.c2.com Port 80
</address>
</body></html>
```
If everything is OK then r.content will contain the requested web page (same as view page source).
Finding Elements With BeautifulSoup
The get_page() function below fetches a web page by URL, decodes it to UTF-8, and parses it into a BeautifulSoup object using the HTML parser.
```
def get_page(url):
    r = requests.get(url)
    content = r.content.decode('utf-8')
    return BeautifulSoup(content, 'html.parser')
```
Once we have a BeautifulSoup object, we can start extracting information from the page. BeautifulSoup provides many find functions to locate elements inside the page and drill down deep nested elements.
Tuts+ author pages contain multiple tutorials. Here is my author page. On each page, there are up to 12 tutorials. If you have more than 12 tutorials then you can navigate to the next page. The HTML for each article is enclosed in an <article> tag. The following function finds all the article elements on the page, drills down to their links, and extracts the href attribute to get the URL of the tutorial:
```
def get_page_articles(page):
    elements = page.findAll('article')
    articles = [e.a.attrs['href'] for e in elements]
    return articles
```
The following code gets all the articles from my page and prints them (without the common prefix):
```
page = get_page('https://tutsplus.com/authors/gigi-sayfan')
articles = get_page_articles(page)
prefix = 'https://code.tutsplus.com/tutorials'
for a in articles:
    print(a[len(prefix):])

Output:

building-games-with-python-3-and-pygame-part-5--cms-30085
building-games-with-python-3-and-pygame-part-4--cms-30084
building-games-with-python-3-and-pygame-part-3--cms-30083
building-games-with-python-3-and-pygame-part-2--cms-30082
building-games-with-python-3-and-pygame-part-1--cms-30081
mastering-the-react-lifecycle-methods--cms-29849
testing-data-intensive-code-with-go-part-5--cms-29852
testing-data-intensive-code-with-go-part-4--cms-29851
testing-data-intensive-code-with-go-part-3--cms-29850
testing-data-intensive-code-with-go-part-2--cms-29848
testing-data-intensive-code-with-go-part-1--cms-29847
make-your-go-programs-lightning-fast-with-profiling--cms-29809
```
Dynamic Scraping With Selenium
Static scraping was good enough to get the list of articles, but as we saw earlier, the Disqus comments are embedded as an iframe element by JavaScript. In order to harvest the comments, we will need to automate the browser and interact with the DOM interactively. One of the best tools for the job is Selenium.
Selenium is primarily geared towards automated testing of web applications, but it is great as a general-purpose browser automation tool.
Installing Selenium
Type this command to install Selenium: pipenv install selenium
Choose Your Web Driver
Selenium needs a web driver (the browser it automates). For web scraping, it usually doesn't matter which driver you choose. I prefer the Chrome driver. Follow the instructions in this Selenium guide.
Chrome vs. PhantomJS
In some cases you may prefer to use a headless browser, which means no UI is displayed. Theoretically, PhantomJS is just another web driver. But, in practice, people reported incompatibility issues where Selenium works properly with Chrome or Firefox and sometimes fails with PhantomJS. I prefer to remove this variable from the equation and use an actual browser web driver.
Counting Disqus Comments
Let's do some dynamic scraping and use Selenium to count Disqus comments on Tuts+ tutorials. Here are the necessary imports.
```
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import (
    presence_of_element_located)
from selenium.webdriver.support.wait import WebDriverWait
```
The get_comment_count() function accepts a Selenium driver and URL. It uses the get() method of the driver to fetch the URL. This is similar to requests.get(), but the difference is that the driver object manages a live representation of the DOM.
Then, it gets the title of the tutorial and locates the Disqus iframe using its parent id disqus_thread and then the iframe itself:
```
def get_comment_count(driver, url):
    driver.get(url)
    class_name = 'content-banner__title'
    name = driver.find_element_by_class_name(class_name).text
    e = driver.find_element_by_id('disqus_thread')
    disqus_iframe = e.find_element_by_tag_name('iframe')
    iframe_url = disqus_iframe.get_attribute('src')
```
The next step is to fetch the contents of the iframe itself. Note that we wait for the comment-count element to be present because the comments are loaded dynamically and not necessarily available yet.
```
    driver.get(iframe_url)
    wait = WebDriverWait(driver, 5)
    commentCountPresent = presence_of_element_located(
        (By.CLASS_NAME, 'comment-count'))
    wait.until(commentCountPresent)

    comment_count_span = driver.find_element_by_class_name(
        'comment-count')
    comment_count = int(comment_count_span.text.split()[0])
```
The last part is to return the last comment if it wasn't made by me. The idea is to detect comments I haven't responded to yet.
```
    last_comment = {}
    if comment_count > 0:
        e = driver.find_elements_by_class_name('author')[-1]
        last_author = e.find_element_by_tag_name('a')
        last_author = e.get_attribute('data-username')
        if last_author != 'the_gigi':
            e = driver.find_elements_by_class_name('post-meta')
            meta = e[-1].find_element_by_tag_name('a')
            last_comment = dict(
            	author=last_author,
                title=meta.get_attribute('title'),
                when=meta.text)
    return name, comment_count, last_comment
```
Conclusion
Web scraping is a useful practice when the information you need is accessible through a web application that doesn't provide an appropriate API. It takes some non-trivial work to extract data from modern web applications, but mature and well-designed tools like requests, BeautifulSoup, and Selenium make it worthwhile.
Additionally, don’t hesitate to see what we have available for sale and for study in the Envato Market, and don't hesitate to ask any questions and provide your valuable feedback using the feed below.
martijn broeders

founder/ strategic creative at shineyrock web design & consultancy
e-mail: .(JavaScript must be enabled to view this email address)
phone: 434 210 0245

Like
disLike

2

Add a comment...

Comments
Name*
Email*
location
URL
Comment{comment}
Remember my personal information Notify me of follow-up comments?

8 years ago Cathleen Wonderful, what a website it is! This…

8 years ago Sharyn Ѕimply a smiling ѵisitant heгe to…

8 years ago shelby What's up it's me, I am also visiting…

8 years ago Lottie I read this paragraph completely…

6 years ago earlene This text is invaluable. How can I find…

6 years ago amelie I needed to thank you for this great…

6 years ago noe vance Hi, for all time i used to check…

6 years ago brayden Pretty! This was an extremely wonderful…

6 years ago Peden Hello! I'm at work browsing your blog…

6 years ago cameron Very great post. I just stumbled upon…

7 years ago juliane mckibben WOW just what I was looking for. Came…

7 years ago carissa That is a good tip especially to those…

6 years ago carris This piece of writing presents clear…

6 years ago marissa Great write-up, I am normal visitor of…

6 years ago ena As a Newbie, I am always exploring…

6 years ago beth Thank you for another fantastic post.…

blog

Modern Web Scraping With BeautifulSoup and Selenium

Overview

When Should You Use Web Scraping?

Understanding Real-World Web Pages

View Page Source

The Mighty Inline Frame

JavaScript-Generated Markup

Static Scraping vs. Dynamic Scraping

Static Scraping With Requests and BeautifulSoup

Installing Requests and BeautifulSoup

Fetching Pages

Finding Elements With BeautifulSoup

Dynamic Scraping With Selenium

Installing Selenium

Choose Your Web Driver

Chrome vs. PhantomJS

Counting Disqus Comments

Conclusion

martijn broeders

By - category

RSS

By - date

December, 2023

October, 2022

September, 2022

August, 2022

April, 2022

January, 2022

December, 2021

November, 2021

July, 2021

June, 2021

May, 2021

April, 2021

March, 2021

February, 2021

May, 2020

April, 2020

March, 2020

September, 2017

February, 2020

January, 2020

December, 2019

September, 2019

July, 2019

April, 2019

March, 2019

February, 2019

January, 2019

December, 2018

November, 2018

October, 2018

September, 2018

August, 2018

July, 2018

June, 2018

May, 2018

April, 2018

March, 2018

February, 2018

January, 2018

December, 2017

November, 2017

October, 2017

September, 2017

August, 2017

July, 2017

June, 2017