rvest Archives - Open Source Automation

23Jul 2019 by Andrew Treadway

BeautifulSoup vs. Rvest

This post will compare Python's BeautifulSoup package to R's rvest package for web scraping. We'll also talk about additional functionality in rvest (that doesn't exist in BeautifulSoup) in comparison to a couple of other Python packages (including pandas and RoboBrowser). Getting started BeautifulSoup and rvest both involve creating an object that we can use to parse the HTML from a webpage. However, one immediate difference is that BeautifulSoup is just a web parser, so it doesn't connect to webpages. rvest, on the other hand, can connect to a webpage and scrape / parse its HTML in a single package. In BeautifulSoup, our initial setup looks like this: [code lang="python"] # load packages from bs4 import BeautifulSoup import requests # connect to webpage resp = requests.get(""https://www.azlyrics.com/b/beatles.html"") # get BeautifulSoup object soup…

29Jan 2019 by Andrew Treadway

Creating a word cloud on R-bloggers posts

R, Web Scraping

This post will go through how to create a word cloud of article titles scraped from the awesome R-bloggers. Our goal will be to use R's rvest package to search through 50 successive pages on the site for article titles. The stringr and tm packages will be used for string cleaning and for creating a term document frequency matrix (with tm). We will then create a word cloud based off the words comprising these titles. First, we'll load the packages we need. [code lang="R"] # load packages library(rvest) library(stringr) library(tm) library(wordcloud) [/code] Let's write a function that will take a webpage as input and return all the scraped article titles. [code lang="R"] scrape_post_titles <- function(site) { # scrape HTML from input site source_html <- read_html(site) # grab the title attributes…

Tag: rvest