HTML is almost intuitive. CSS is a great advancement that cleanly separates the structure of a page from its look and feel. JavaScript adds some pizazz. That's the theory. The real world is a little different.
In this tutorial, you'll learn how the content you see in the browser actually gets rendered and how to go about scraping it when necessary. In particular, you'll learn how to count Disqus comments. Our tools will be Python and awesome packages like requests, BeautifulSoup, and Selenium.
Web scraping is the practice of automatically fetching the content of web pages designed for interaction with human users, parsing them, and extracting some information (possibly navigating links to other pages). It is sometimes necessary if there is no other way to extract the necessary information. Ideally, the application provides a dedicated API for accessing its data programmatically. There are several reasons web scraping should be your last resort:
Let's understand what we are up against, by looking at the output of some common web application code. In the article Introduction to Vagrant, there are some Disqus comments at the bottom of the page:
In order to scrape these comments, we need to find them on the page first.
Every browser since the dawn of time (the 1990s) has supported the ability to view the HTML of the current page. Here is a snippet from the view source of Introduction to Vagrant that starts with a huge chunk of minified and uglified JavaScript unrelated to the article itself. Here is a small portion of it:
Here is some actual HTML from the page:
This looks pretty messy, but what is surprising is that you will not find the Disqus comments in the source of the page.
It turns out that the page is a mashup, and the Disqus comments are embedded as an iframe (inline frame) element. You can find it out by right-clicking on the comments area, and you'll see that there is frame information and source there:
That makes sense. Embedding third-party content as an iframe is one of the primary reasons to use iframes. Let's find the <iframe>
tag then in the main page source. Foiled again! There is no <iframe>
tag in the main page source.
The reason for this omission is that view page source
shows you the content that was fetched from the server. But the final DOM (document object model) that gets rendered by the browser may be very different. JavaScript kicks in and can manipulate the DOM at will. The iframe can't be found, because it wasn't there when the page was retrieved from the server.
Static scraping ignores JavaScript. It fetches web pages from the server without the help of a browser. You get exactly what you see in "view page source", and then you slice and dice it. If the content you're looking for is available, you need to go no further. However, if the content is something like the Disqus comments iframe, you need dynamic scraping.
Dynamic scraping uses an actual browser (or a headless browser) and lets JavaScript do its thing. Then, it queries the DOM to extract the content it's looking for. Sometimes you need to automate the browser by simulating a user to get the content you need.
Let's see how static scraping works using two awesome Python packages: requests for fetching web pages and BeautifulSoup for parsing HTML pages.
Install pipenv first, and then: pipenv install requests beautifulsoup4
This will create a virtual environment for you too. If you're using the code from gitlab, you can just pipenv install
.
Fetching a page with requests is a one liner: r = requests.get(url)
The response object has a lot of attributes. The most important ones are ok
and content
. If the request fails then r.ok
will be False and r.content
will contain the error. The content is a stream of bytes. It is usually better to decode it to utf-8 when dealing with text:
>>> r = requests.get('http://www.c2.com/no-such-page') >>> r.ok False >>> print(r.content.decode('utf-8')) <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>404 Not Found</title> </head><body> <h1>Not Found</h1> <p>The requested URL /ggg was not found on this server.</p> <hr> <address> Apache/2.0.52 (CentOS) Server at www.c2.com Port 80 </address> </body></html>
If everything is OK then r.content
will contain the requested web page (same as view page source).
The get_page()
function below fetches a web page by URL, decodes it to UTF-8, and parses it into a BeautifulSoup object using the HTML parser.
def get_page(url): r = requests.get(url) content = r.content.decode('utf-8') return BeautifulSoup(content, 'html.parser')
Once we have a BeautifulSoup object, we can start extracting information from the page. BeautifulSoup provides many find functions to locate elements inside the page and drill down deep nested elements.
Tuts+ author pages contain multiple tutorials. Here is my author page. On each page, there are up to 12 tutorials. If you have more than 12 tutorials then you can navigate to the next page. The HTML for each article is enclosed in an <article>
tag. The following function finds all the article elements on the page, drills down to their links, and extracts the href attribute to get the URL of the tutorial:
def get_page_articles(page): elements = page.findAll('article') articles = [e.a.attrs['href'] for e in elements] return articles
The following code gets all the articles from my page and prints them (without the common prefix):
page = get_page('https://tutsplus.com/authors/gigi-sayfan') articles = get_page_articles(page) prefix = 'https://code.tutsplus.com/tutorials' for a in articles: print(a[len(prefix):]) Output: building-games-with-python-3-and-pygame-part-5--cms-30085 building-games-with-python-3-and-pygame-part-4--cms-30084 building-games-with-python-3-and-pygame-part-3--cms-30083 building-games-with-python-3-and-pygame-part-2--cms-30082 building-games-with-python-3-and-pygame-part-1--cms-30081 mastering-the-react-lifecycle-methods--cms-29849 testing-data-intensive-code-with-go-part-5--cms-29852 testing-data-intensive-code-with-go-part-4--cms-29851 testing-data-intensive-code-with-go-part-3--cms-29850 testing-data-intensive-code-with-go-part-2--cms-29848 testing-data-intensive-code-with-go-part-1--cms-29847 make-your-go-programs-lightning-fast-with-profiling--cms-29809
Static scraping was good enough to get the list of articles, but as we saw earlier, the Disqus comments are embedded as an iframe element by JavaScript. In order to harvest the comments, we will need to automate the browser and interact with the DOM interactively. One of the best tools for the job is Selenium.
Selenium is primarily geared towards automated testing of web applications, but it is great as a general-purpose browser automation tool.
Type this command to install Selenium: pipenv install selenium
Selenium needs a web driver (the browser it automates). For web scraping, it usually doesn't matter which driver you choose. I prefer the Chrome driver. Follow the instructions in this Selenium guide.
In some cases you may prefer to use a headless browser, which means no UI is displayed. Theoretically, PhantomJS is just another web driver. But, in practice, people reported incompatibility issues where Selenium works properly with Chrome or Firefox and sometimes fails with PhantomJS. I prefer to remove this variable from the equation and use an actual browser web driver.
Let's do some dynamic scraping and use Selenium to count Disqus comments on Tuts+ tutorials. Here are the necessary imports.
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.expected_conditions import ( presence_of_element_located) from selenium.webdriver.support.wait import WebDriverWait
The get_comment_count()
function accepts a Selenium driver and URL. It uses the get()
method of the driver to fetch the URL. This is similar to requests.get()
, but the difference is that the driver object manages a live representation of the DOM.
Then, it gets the title of the tutorial and locates the Disqus iframe using its parent id disqus_thread
and then the iframe itself:
def get_comment_count(driver, url): driver.get(url) class_name = 'content-banner__title' name = driver.find_element_by_class_name(class_name).text e = driver.find_element_by_id('disqus_thread') disqus_iframe = e.find_element_by_tag_name('iframe') iframe_url = disqus_iframe.get_attribute('src')
The next step is to fetch the contents of the iframe itself. Note that we wait for the comment-count
element to be present because the comments are loaded dynamically and not necessarily available yet.
driver.get(iframe_url) wait = WebDriverWait(driver, 5) commentCountPresent = presence_of_element_located( (By.CLASS_NAME, 'comment-count')) wait.until(commentCountPresent) comment_count_span = driver.find_element_by_class_name( 'comment-count') comment_count = int(comment_count_span.text.split()[0])
The last part is to return the last comment if it wasn't made by me. The idea is to detect comments I haven't responded to yet.
last_comment = {} if comment_count > 0: e = driver.find_elements_by_class_name('author')[-1] last_author = e.find_element_by_tag_name('a') last_author = e.get_attribute('data-username') if last_author != 'the_gigi': e = driver.find_elements_by_class_name('post-meta') meta = e[-1].find_element_by_tag_name('a') last_comment = dict( author=last_author, title=meta.get_attribute('title'), when=meta.text) return name, comment_count, last_comment
Web scraping is a useful practice when the information you need is accessible through a web application that doesn't provide an appropriate API. It takes some non-trivial work to extract data from modern web applications, but mature and well-designed tools like requests, BeautifulSoup, and Selenium make it worthwhile.
Additionally, don’t hesitate to see what we have available for sale and for study in the Envato Market, and don't hesitate to ask any questions and provide your valuable feedback using the feed below.
The Best Small Business Web Designs by DesignRush
/Create Modern Vue Apps Using Create-Vue and Vite
/How to Fix the “There Has Been a Critical Error in Your Website” Error in WordPress
How To Fix The “There Has Been A Critical Error in Your Website” Error in WordPress
/How Long Does It Take to Learn JavaScript?
/The Best Way to Deep Copy an Object in JavaScript
/Adding and Removing Elements From Arrays in JavaScript
/Create a JavaScript AJAX Post Request: With and Without jQuery
/5 Real-Life Uses for the JavaScript reduce() Method
/How to Enable or Disable a Button With JavaScript: jQuery vs. Vanilla
/How to Enable or Disable a Button With JavaScript: jQuery vs Vanilla
/Confirm Yes or No With JavaScript
/How to Change the URL in JavaScript: Redirecting
/15+ Best WordPress Twitter Widgets
/27 Best Tab and Accordion Widget Plugins for WordPress (Free & Premium)
/21 Best Tab and Accordion Widget Plugins for WordPress (Free & Premium)
/30 HTML Best Practices for Beginners
/31 Best WordPress Calendar Plugins and Widgets (With 5 Free Plugins)
/25 Ridiculously Impressive HTML5 Canvas Experiments
/How to Implement Email Verification for New Members
/How to Create a Simple Web-Based Chat Application
/30 Popular WordPress User Interface Elements
/Top 18 Best Practices for Writing Super Readable Code
/Best Affiliate WooCommerce Plugins Compared
/18 Best WordPress Star Rating Plugins
/10+ Best WordPress Twitter Widgets
/20+ Best WordPress Booking and Reservation Plugins
/Working With Tables in React: Part Two
/Best CSS Animations and Effects on CodeCanyon
/30 CSS Best Practices for Beginners
/How to Create a Custom WordPress Plugin From Scratch
/10 Best Responsive HTML5 Sliders for Images and Text… and 3 Free Options
/16 Best Tab and Accordion Widget Plugins for WordPress
/18 Best WordPress Membership Plugins and 5 Free Plugins
/25 Best WooCommerce Plugins for Products, Pricing, Payments and More
/10 Best WordPress Twitter Widgets
1 /12 Best Contact Form PHP Scripts for 2020
/20 Popular WordPress User Interface Elements
/10 Best WordPress Star Rating Plugins
/12 Best CSS Animations on CodeCanyon
/12 Best WordPress Booking and Reservation Plugins
/12 Elegant CSS Pricing Tables for Your Latest Web Project
/24 Best WordPress Form Plugins for 2020
/14 Best PHP Event Calendar and Booking Scripts
/Create a Blog for Each Category or Department in Your WooCommerce Store
/8 Best WordPress Booking and Reservation Plugins
/Best Exit Popups for WordPress Compared
/Best Exit Popups for WordPress Compared
/11 Best Tab & Accordion WordPress Widgets & Plugins
/12 Best Tab & Accordion WordPress Widgets & Plugins
1New Course: Practical React Fundamentals
/Preview Our New Course on Angular Material
/Build Your Own CAPTCHA and Contact Form in PHP
/Object-Oriented PHP With Classes and Objects
/Best Practices for ARIA Implementation
/Accessible Apps: Barriers to Access and Getting Started With Accessibility
/Dramatically Speed Up Your React Front-End App Using Lazy Loading
/15 Best Modern JavaScript Admin Templates for React, Angular, and Vue.js
/15 Best Modern JavaScript Admin Templates for React, Angular and Vue.js
/19 Best JavaScript Admin Templates for React, Angular, and Vue.js
/New Course: Build an App With JavaScript and the MEAN Stack
/Hands-on With ARIA: Accessibility Recipes for Web Apps
/10 Best WordPress Facebook Widgets
13 /Hands-on With ARIA: Accessibility for eCommerce
/New eBooks Available for Subscribers
/Hands-on With ARIA: Homepage Elements and Standard Navigation
/Site Accessibility: Getting Started With ARIA
/How Secure Are Your JavaScript Open-Source Dependencies?
/New Course: Secure Your WordPress Site With SSL
/Testing Components in React Using Jest and Enzyme
/Testing Components in React Using Jest: The Basics
/15 Best PHP Event Calendar and Booking Scripts
/Create Interactive Gradient Animations Using Granim.js
/How to Build Complex, Large-Scale Vue.js Apps With Vuex
1 /Examples of Dependency Injection in PHP With Symfony Components
/Set Up Routing in PHP Applications Using the Symfony Routing Component
1 /A Beginner’s Guide to Regular Expressions in JavaScript
/Introduction to Popmotion: Custom Animation Scrubber
/Introduction to Popmotion: Pointers and Physics
/New Course: Connect to a Database With Laravel’s Eloquent ORM
/How to Create a Custom Settings Panel in WooCommerce
/Building the DOM faster: speculative parsing, async, defer and preload
1 /20 Useful PHP Scripts Available on CodeCanyon
3 /How to Find and Fix Poor Page Load Times With Raygun
/Introduction to the Stimulus Framework
/Single-Page React Applications With the React-Router and React-Transition-Group Modules
12 Best Contact Form PHP Scripts
1 /Getting Started With the Mojs Animation Library: The ShapeSwirl and Stagger Modules
/Getting Started With the Mojs Animation Library: The Shape Module
Getting Started With the Mojs Animation Library: The HTML Module
/Project Management Considerations for Your WordPress Project
/8 Things That Make Jest the Best React Testing Framework
/Creating an Image Editor Using CamanJS: Layers, Blend Modes, and Events
/New Short Course: Code a Front-End App With GraphQL and React
/Creating an Image Editor Using CamanJS: Applying Basic Filters
/Creating an Image Editor Using CamanJS: Creating Custom Filters and Blend Modes
/Modern Web Scraping With BeautifulSoup and Selenium
/Challenge: Create a To-Do List in React
1Deploy PHP Web Applications Using Laravel Forge
/Getting Started With the Mojs Animation Library: The Burst Module
/10 Things Men Can Do to Support Women in Tech
/A Gentle Introduction to Higher-Order Components in React: Best Practices
/Challenge: Build a React Component
/A Gentle Introduction to HOC in React: Learn by Example
/A Gentle Introduction to Higher-Order Components in React
/Creating Pretty Popup Messages Using SweetAlert2
/Creating Stylish and Responsive Progress Bars Using ProgressBar.js
/18 Best Contact Form PHP Scripts for 2022
/How to Make a Real-Time Sports Application Using Node.js
/Creating a Blogging App Using Angular & MongoDB: Delete Post
/Set Up an OAuth2 Server Using Passport in Laravel
/Creating a Blogging App Using Angular & MongoDB: Edit Post
/Creating a Blogging App Using Angular & MongoDB: Add Post
/Introduction to Mocking in Python
/Creating a Blogging App Using Angular & MongoDB: Show Post
/Creating a Blogging App Using Angular & MongoDB: Home
/Creating a Blogging App Using Angular & MongoDB: Login
/Creating Your First Angular App: Implement Routing
/Persisted WordPress Admin Notices: Part 4
/Creating Your First Angular App: Components, Part 2
/Persisted WordPress Admin Notices: Part 3
/Creating Your First Angular App: Components, Part 1
/How Laravel Broadcasting Works
/Persisted WordPress Admin Notices: Part 2
/Create Your First Angular App: Storing and Accessing Data
/Persisted WordPress Admin Notices: Part 1
/Error and Performance Monitoring for Web & Mobile Apps Using Raygun
Using Luxon for Date and Time in JavaScript
7 /How to Create an Audio Oscillator With the Web Audio API
/How to Cache Using Redis in Django Applications
/20 Essential WordPress Utilities to Manage Your Site
/Introduction to API Calls With React and Axios
/Beginner’s Guide to Angular 4: HTTP
/Rapid Web Deployment for Laravel With GitHub, Linode, and RunCloud.io
/Beginners Guide to Angular 4: Routing
/Beginner’s Guide to Angular 4: Services
/Beginner’s Guide to Angular 4: Components
/Creating a Drop-Down Menu for Mobile Pages
/Introduction to Forms in Angular 4: Writing Custom Form Validators
/10 Best WordPress Booking & Reservation Plugins
/Getting Started With Redux: Connecting Redux With React
/Getting Started With Redux: Learn by Example
/Getting Started With Redux: Why Redux?
/How to Auto Update WordPress Salts
/How to Download Files in Python
/Eloquent Mutators and Accessors in Laravel
1 /10 Best HTML5 Sliders for Images and Text
/Site Authentication in Node.js: User Signup
/Creating a Task Manager App Using Ionic: Part 2
/Creating a Task Manager App Using Ionic: Part 1
/Introduction to Forms in Angular 4: Reactive Forms
/Introduction to Forms in Angular 4: Template-Driven Forms
/24 Essential WordPress Utilities to Manage Your Site
/25 Essential WordPress Utilities to Manage Your Site
/Get Rid of Bugs Quickly Using BugReplay
1 /Manipulating HTML5 Canvas Using Konva: Part 1, Getting Started
/10 Must-See Easy Digital Downloads Extensions for Your WordPress Site
22 Best WordPress Booking and Reservation Plugins
/Understanding ExpressJS Routing
/15 Best WordPress Star Rating Plugins
/Creating Your First Angular App: Basics
/Inheritance and Extending Objects With JavaScript
/Introduction to the CSS Grid Layout With Examples
1Performant Animations Using KUTE.js: Part 5, Easing Functions and Attributes
Performant Animations Using KUTE.js: Part 4, Animating Text
/Performant Animations Using KUTE.js: Part 3, Animating SVG
/New Course: Code a Quiz App With Vue.js
/Performant Animations Using KUTE.js: Part 2, Animating CSS Properties
Performant Animations Using KUTE.js: Part 1, Getting Started
/10 Best Responsive HTML5 Sliders for Images and Text (Plus 3 Free Options)
/Single-Page Applications With ngRoute and ngAnimate in AngularJS
/Deferring Tasks in Laravel Using Queues
/Site Authentication in Node.js: User Signup and Login
/Working With Tables in React, Part Two
/Working With Tables in React, Part One
/How to Set Up a Scalable, E-Commerce-Ready WordPress Site Using ClusterCS
/New Course on WordPress Conditional Tags
/TypeScript for Beginners, Part 5: Generics
/Building With Vue.js 2 and Firebase
6 /Best Unique Bootstrap JavaScript Plugins
/Essential JavaScript Libraries and Frameworks You Should Know About
/Vue.js Crash Course: Create a Simple Blog Using Vue.js
/Build a React App With a Laravel RESTful Back End: Part 1, Laravel 5.5 API
/API Authentication With Node.js
/Beginner’s Guide to Angular: HTTP
/Beginner’s Guide to Angular: Routing
/Beginners Guide to Angular: Routing
/Beginner’s Guide to Angular: Services
/Beginner’s Guide to Angular: Components
/How to Create a Custom Authentication Guard in Laravel
/Learn Computer Science With JavaScript: Part 3, Loops
/Build Web Applications Using Node.js
/Learn Computer Science With JavaScript: Part 4, Functions
/Learn Computer Science With JavaScript: Part 2, Conditionals
/Create Interactive Charts Using Plotly.js, Part 5: Pie and Gauge Charts
/Create Interactive Charts Using Plotly.js, Part 4: Bubble and Dot Charts
Create Interactive Charts Using Plotly.js, Part 3: Bar Charts
/Awesome JavaScript Libraries and Frameworks You Should Know About
/Create Interactive Charts Using Plotly.js, Part 2: Line Charts
/Bulk Import a CSV File Into MongoDB Using Mongoose With Node.js
/Build a To-Do API With Node, Express, and MongoDB
/Getting Started With End-to-End Testing in Angular Using Protractor
/TypeScript for Beginners, Part 4: Classes
/Object-Oriented Programming With JavaScript
/10 Best Affiliate WooCommerce Plugins Compared
/Stateful vs. Stateless Functional Components in React
/Make Your JavaScript Code Robust With Flow
/Build a To-Do API With Node and Restify
/Testing Components in Angular Using Jasmine: Part 2, Services
/Testing Components in Angular Using Jasmine: Part 1
/Creating a Blogging App Using React, Part 6: Tags
/React Crash Course for Beginners, Part 3
/React Crash Course for Beginners, Part 2
/React Crash Course for Beginners, Part 1
/Set Up a React Environment, Part 4
1 /Set Up a React Environment, Part 3
/New Course: Get Started With Phoenix
/Set Up a React Environment, Part 2
/Set Up a React Environment, Part 1
/Command Line Basics and Useful Tricks With the Terminal
/How to Create a Real-Time Feed Using Phoenix and React
/Build a React App With a Laravel Back End: Part 2, React
/Build a React App With a Laravel RESTful Back End: Part 1, Laravel 9 API
/Creating a Blogging App Using React, Part 5: Profile Page
/Pagination in CodeIgniter: The Complete Guide
/JavaScript-Based Animations Using Anime.js, Part 4: Callbacks, Easings, and SVG
/JavaScript-Based Animations Using Anime.js, Part 3: Values, Timeline, and Playback
/Learn to Code With JavaScript: Part 1, The Basics
/10 Elegant CSS Pricing Tables for Your Latest Web Project
/Getting Started With the Flux Architecture in React
/Getting Started With Matter.js: The Composites and Composite Modules
Getting Started With Matter.js: The Engine and World Modules
/10 More Popular HTML5 Projects for You to Use and Study
/Understand the Basics of Laravel Middleware
/Iterating Fast With Django & Heroku
/Creating a Blogging App Using React, Part 4: Update & Delete Posts
/Creating a jQuery Plugin for Long Shadow Design
/How to Register & Use Laravel Service Providers
2 /Unit Testing in React: Shallow vs. Static Testing
/Creating a Blogging App Using React, Part 3: Add & Display Post
/Creating a Blogging App Using React, Part 2: User Sign-Up
20 /Creating a Blogging App Using React, Part 1: User Sign-In
/Creating a Grocery List Manager Using Angular, Part 2: Managing Items
/9 Elegant CSS Pricing Tables for Your Latest Web Project
/Dynamic Page Templates in WordPress, Part 3
/Angular vs. React: 7 Key Features Compared
/Creating a Grocery List Manager Using Angular, Part 1: Add & Display Items
New eBooks Available for Subscribers in June 2017
/Create Interactive Charts Using Plotly.js, Part 1: Getting Started
/The 5 Best IDEs for WordPress Development (And Why)
/33 Popular WordPress User Interface Elements
/New Course: How to Hack Your Own App
/How to Install Yii on Windows or a Mac
/What Is a JavaScript Operator?
/How to Register and Use Laravel Service Providers
/
waly Good blog post. I absolutely love this…