This post will compare Python’s BeautifulSoup package to R’s rvest package for web scraping. We’ll also talk about additional functionality in rvest (that doesn’t exist in BeautifulSoup) in comparison to a couple of other Python packages (including pandas and RoboBrowser).
BeautifulSoup and rvest both involve creating an object that we can use to parse the HTML from a webpage. However, one immediate difference is that BeautifulSoup is just a web parser, so it doesn’t connect to webpages. rvest, on the other hand, can connect to a webpage and scrape / parse its HTML in a single package.
In BeautifulSoup, our initial setup looks like this:
# load packages from bs4 import BeautifulSoup import requests # connect to webpage resp = requests.get(""https://www.azlyrics.com/b/beatles.html"") # get BeautifulSoup object soup = BeautifulSoup(resp.content)
In comparison, here’s what using rvest is like:
# load rvest package library(rvest) # get HTML object html_data <- read_html("https://www.azlyrics.com/b/beatles.html")
Next, let’s take our parser objects and find all the links on the page.
If you’re familiar with BeautifulSoup, that would look like:
links = soup.find_all("a")
In rvest, however, we use syntax similar to dplyr and other tidyverse packages by using %>%.
links <- html_data %>% html_nodes("a") urls <- links %>% html_attr("href")
In BeautifulSoup, we use the find_all method to extract a list of all of a specific tag’s objects from a webpage. Thus, in the links example, we specify we want to get all of the anchor tags (or “a” tags), which create HTML links on the page. If we wanted to scrape other types of tags, such as div tags or p tags, we just need to switch out “a” with “div”, “p”, or whatever tag we want.
# get all div tags soup.find_all("div") # get all h1 tags soup.find_all("h1")
With rvest, we can get specific tags from HTML using html_nodes. Thus, if we wanted to scrape different tags, such as the div tags or h1 tags, we could do this:
# scrape all div tags html_data %>% html_nodes("div") # scrape header h1 tags html_data %>% html_nodes("h1")
In BeautifulSoup, we get attributes from HTML tags using the get method. We can use a list comprehension to get the href attribute of each link (the href attribute of a link is its destination URL).
urls = [link.get("href") for link in links]
To get other attributes, we just need to change our input to the get method.
# get the target attribute from each link [link.get("target") for link in links] # get the ID attribute of each div tag [div.get("id") for div in soup.find_all("div")]
Using rvest, the html_attr function can be used to get attributes from tags. So to get the URL of each link object we scrape, we need to specify that we want to get the href attribute from each link, similarly to BeautifulSoup:
urls <- links %>% html_attr("href")
Likewise, if we want to scrape the IDs from the div tags, we can do this:
html_data %>% html_nodes("div") %>% html_attr("id")
Notice how, like other tidyverse packages, we can chain together multiple operations.
If we want to scrape the text from each of the links, we can use html_text:
links %>% html_text()
BeautifulSoup’s way of accomplishing this is by using the text method of a tag object:
[link.text for link in links]
Let’s look at another example for scraping HTML tables.
We can scrape HTML tables using rvest’s html_table method. This method will extract all tables found on the input webpage. The fill = TRUE parameter is specifying that we want to fill any rows that have less than the maximum number of columns in a table with NAs. The tables will be stored as a list of data frames.
city_data <- read_html("http://www.city-data.com/city/Florida.html") city_data %>% html_table(fill = TRUE)
Scraping tables with BeautifulSoup into a data frame object is a bit different. One way we can scrape tables with Python is to loop through the tr (row) or td (data cell in table) tags. But the closest analogy of rvest’s functionality here is to use pandas:
import pandas as pd pd.read_html("http://www.city-data.com/city/Florida.html")
Like using html_table, this will return a list of data frames corresponding to the tables found on the webpage.
An additional feature of rvest is that it can perform browser simulation. BeautifulSoup cannot do this; however, Python offers several alternatives including requests_html and RoboBrowser (each discussed here).
With rvest, we can start a browser session using the html_session function:
site = "https://www.azlyrics.com/b/beatles.html" session <- html_session(site)
With our session object, we can navigate to different links on the page, just like a real web browser. There’s a couple ways of doing this. One is to input the index of the link we want to go to. For example, to navigate to the third link of the page, we would write the below code:
session %>% follow_link(3)
Here’s a couple more examples:
# navigate to the 5th link on the page session %>% follow_link(5) # navigate to the 10th link on the page session %>% follow_link(10)
You can also simulate clicking on links based off text. For example, the below code will navigate to the first link containing the text “Sun”. The input is case sensitive.
session %>% follow_link("Sun")
You can use the session object to navigate directly to other webpages using the jump_to function.
session %>% jump_to("https://www.azlyrics.com/a.html")
If we use the RoboBrowser package in Python to somewhat replicate the R code above, we could write this:
from robobrowser import RoboBrowser # create a RoboBrowser object browser = RoboBrowser(history = True) # navigate to webpage browser.open("https://www.azlyrics.com/b/beatles.html") # get links links = browser.get_links() # follow link browser.follow_link(links[5]) # follow different link browser.follow_link(links[10])
The open method used above is analogous to rvest’s jump_to function. The follow_link method in RoboBrowser serves a similar purpose to the function of the same name in rvest, but behaves a little differently. This method takes a link object as input, rather than the index of a link, or text within a link that you’re searching for. Thus, we can use the links object above to specify a link that we want to “follow” or click. If we want to click on a link based off text like we did in the rvest example above, we could write the below code.
# filter the list of links to only links containing "Sun" in their text sun_links = filter(lambda link: "Sun" in link.text, links) # Click on the first link containing the word "Sun" browser.follow_link(next(sun_links))
To learn more about browser simulation in Python, click here.
That’s it for this article! Please click here to follow my blog on Twitter.
Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for…
Ever had long-running code that you don't know when it's going to finish running? If…
Background If you've done any type of data analysis in Python, chances are you've probably…
In this post, we will investigate the pandas_profiling and sweetviz packages, which can be used…
In this post, we're going to cover how to plot XGBoost trees in R. XGBoost…
In this post, we'll discuss the underrated Python collections package, which is part of the…