Creating a word cloud on R-bloggers posts

This post will go through how to create a word cloud of article titles scraped from the awesome R-bloggers. Our goal will be to use R’s rvest package to search through 50 successive pages on the site for article titles. The stringr and tm packages will be used for string cleaning and for creating a term document frequency matrix (with tm). We will then create a word cloud based off the words comprising these titles.

First, we’ll load the packages we need.


# load packages
library(rvest)
library(stringr)
library(tm)
library(wordcloud)

Let’s write a function that will take a webpage as input and return all the scraped article titles.


scrape_post_titles <- function(site)
{
    # scrape HTML from input site
    source_html <- read_html(site)
    
    # grab the title attributes from link (anchor) tags within H2 header tags
    titles <- source_html %>% html_nodes("h2") 
                          %>% html_nodes("a") 
                          %>% html_attr("title")
    
    # filter out any titles that are NA (where no title was found)
    titles <- titles[!is.na(titles)]
    
    # parse out just the article title (removing the words "Permalink to ")
    titles <- gsub("Permalink to ", "", titles)

    # return vector of titles
    return(titles)
}

The above function takes an input, called site, which will be the URL of a specific webpage on R-bloggers. We then use rvest’s read_html function to scrape the HTML from the webpage. Next, we parse out the titles by searching through the H2 tags, and parsing out the title attributes from the links within those header tags i.e. we search through each H2 tag, find the “a” tag (anchor, or link), and then pull the title from that tag.

The remaining code above is for cleaning up the titles we parsed. We take out any titles we parsed that are NA – i.e. any link tags that did not have a title attribute (these are not post titles). At this point, each of the title attributes we have has the words “Permalink to “. The gsub line of code is just getting rid of this in each title.


    # filter out any titles that are NA (where no title was found)
    titles <- titles[!is.na(titles)]
    
    # parse out just the article title (removing the words "Permalink to ")
    titles <- gsub("Permalink to ", "", titles)

Now, let’s get the vector of webpages we need to scrape. Each successive page containing article links has the following pattern:

“https://www.r-bloggers.com/page/index” where index is some positive integer.

https://www.r-bloggers.com/page/1
https://www.r-bloggers.com/page/2
https://www.r-bloggers.com/page/3
https://www.r-bloggers.com/page/4
…
…
…

Thus, we can just use the paste0 function to generate all the URLs we want.


root <- "https://www.r-bloggers.com/"

# get each webpage URL we need
all_pages <- c(root, paste0(root, "page/", 2:50))

Next, let’s scrape the post titles from each webpage using our scrape_post_titles function. Then, we’ll collapse the titles into a single vector.


# use our function to scrape the title of each post
all_titles <- lapply(all_pages, scrape_post_titles)

# collapse the titles into a vector
all_titles <- unlist(all_titles)

After we have the titles scraped, we need to perform some cleaning operations, such as converting each title to lowercase, and getting rid of numbers, punctuation, and stop words.


## Clean up the titles vector
#############################

# convert all titles to lowercase
cleaned <- tolower(cleaned)

# remove any numbers from the titles
cleaned <- removeNumbers(cleaned)

# remove English stopwords
cleaned <- removeWords(cleaned, stopwords("en"))

# remove punctuation
cleaned <- removePunctuation(cleaned)

# remove spaces at the beginning and end of each title
cleaned <- str_trim(cleaned)

Next, we use the tm package to convert our cleaned vector of titles to a corpus. On the next line, we stem each word in the titles to get the root of each word (e.g. model, models, and modeling will each count as the same word, model).


# convert vector of titles to a corpus
cleaned_corpus <- Corpus(VectorSource(cleaned))

# steam each word in each title
cleaned_corpus <- tm_map(cleaned_corpus, stemDocument)

With the cleaned corpus, we can get a term document matrix. This will give us a frequency of how often each word occurs.


doc_object <- TermDocumentMatrix(cleaned_corpus)
doc_matrix <- as.matrix(doc_object)

# get counts of each word
counts <- sort(rowSums(doc_matrix),decreasing=TRUE)

# filter out any words that contain non-letters
counts <- counts[grepl("^[a-z]+$", names(counts))]

# create data frame from word frequency info
frame_counts <- data.frame(word = names(counts), freq = counts)

Lastly, we use the wordcloud package to generate a word cloud based off the words across all the titles.


set.seed(1000)
wordcloud(words = frame_counts$word, freq = frame_counts$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.2, 
          colors=brewer.pal(8, "Dark2"))

Above, we can see that “data” is the most popular word. Variations of “model”, “analysis”, and “package” are also popular.

That’s it for this post! Click here to read more R articles.

Andrew Treadway