Creating a word cloud on R-bloggers posts

Creating a word cloud on R-bloggers posts

R, Web Scraping
This post will go through how to create a word cloud of article titles scraped from the awesome R-bloggers. Our goal will be to use R's rvest package to search through 50 successive pages on the site for article titles. The stringr and tm packages will be used for string cleaning and for creating a term document frequency matrix (with tm). We will then create a word cloud based off the words comprising these titles. First, we'll load the packages we need. [code lang="R"] # load packages library(rvest) library(stringr) library(tm) library(wordcloud) [/code] Let's write a function that will take a webpage as input and return all the scraped article titles. [code lang="R"] scrape_post_titles <- function(site) { # scrape HTML from input site source_html <- read_html(site) # grab the title attributes…
Read More

How to change a file’s last modified date with R

File Manipulation, R
This relatively quick post goes through how to change a file's last modified date with base R. How to change a file's modified time with R Let's say we have a file, test.txt. What if we want to change the last modified date of the file (let's suppose the file's not that important)? Let's say, for instance, we want to make a file have a last modified date back in the 1980's. We can do that with one line of code. First, let's use file.info to check the current modified date of some file called test.txt. [code lang="R"] file.info("test.txt") [/code] We can see above by looking at mtime that this file was last modified December 4th, 2018. Now, we can use a function called Sys.setFileTime to change the modified date…
Read More
10 R functions for Linux commands and vice-versa

10 R functions for Linux commands and vice-versa

File Manipulation, R, System Administration
This post will go through 10 different Linux commands and their R alternatives. If you're interested in learning more R functions for working with files like some of those below, also check out this post. How to list all the files in a directory Linux R What does it do? ls list.files() Lists all the files in a directory ls -R list.files(recursive = TRUE) Recursively lists all the files in a directory and all sub-directories ls | grep "something" list.files(pattern = "something") Lists all the files in a directory containing the regex "something" R [code lang="R"] list.files("/path/to/directory") list.files("/path/to/do/directory", recursive = TRUE) # search for files containing "something" in their name list.files("/path/to/do/directory", pattern = "something") # search for all CSV files list.files("/path/to/do/directory", pattern = ".csv") [/code] Linux [code lang="bash"] ls /path/to/directory…
Read More
Those “other” apply functions…

Those “other” apply functions…

R
So you know lapply, sapply, and apply...but...what about rapply, vapply, or eapply? These are generally a little less known as far as the apply family of functions in R go, so this post will explore how they work. rapply Let's start with rapply. This function has a couple of different purposes. One is to recursively apply a function to a list. We'll get to that in a moment. The other use of rapply is to a apply a function to only those elements in a list (or columns in a data frame) that belong to a specified class. For example, let's say we have a data frame with a mix of categorical and numeric variables, but we want to evaluate a function only on the numeric variables. Use rapply to…
Read More
How to run R from the Task Scheduler

How to run R from the Task Scheduler

R, System Administration
In a prior post, we covered how to run Python from the Task Scheduler on Windows. This article is similar, but it'll show how to run R from the Task Scheduler, instead. Similar to before, let's first cover how to R from the command line, as knowing this is useful for running it from the Task Scheduler. Running R from the Command Line To open up the command prompt, just press the windows key and search for cmd. When R is installed, it comes with a utility called Rscript. This allows you to run R commands from the command line. If Rscript is in your PATH, then typing Rscript into the command line, and pressing enter, will not result in an error. Otherwise, you might get a message saying "'Rscript'…
Read More
How to build a logistic regression model from scratch in R

How to build a logistic regression model from scratch in R

R
In a previous post, we showed how using vectorization in R can vastly speed up fuzzy matching. Here, we will show you how to use R's vectorization functionality to efficiently build a logistic regression model. Now we could just use the caret or stats packages to create a model, but building algorithms from scratch is a great way to develop a better understanding of how they work under the hood. In writing the logistic regression algorithm from scratch, we will consider the following definitions and assumptions: x = A dxn matrix of d predictor variables, where each column xi represents the vector of predictors corresponding to one data point (with n such columns i.e. n data points) d = The number of predictor variables (i.e. the number of dimensions) n…
Read More
Getting data from PDFs the easy way with R

Getting data from PDFs the easy way with R

R
Earlier this year, a new package called tabulizer was released in R, which allows you to automatically pull out tables and text from PDFs. Note, this package only works if the PDF's text is highlightable (if it's typed) -- i.e. it won't work for scanned-in PDFs, or image files converted to PDFs. If you don't have tabulizer installed, just run install.packages("tabulizer") to get started. Initial Setup After you have tabulizer installed, we'll load it, and define a variable referencing an example PDF. [code lang="R"] library(tabulizer) site <- "http://www.sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf" [/code] The PDFs you manipulate with this package don't have to be located on your machine -- you can use tabulizer to reference a PDF by a URL. For our first example, we're going to use a sample PDF file found here:…
Read More
R: How to create, delete, move, and more with files

R: How to create, delete, move, and more with files

File Manipulation, R, System Administration
Though Python is usually thought of over R for doing system administration tasks, R is actually quite useful in this regard. In this post we're going to talk about using R to create, delete, move, and obtain information on files. How to get and change the current working directory Before working with files, it's usually a good idea to first know what directory you're working in. The working directory is the folder that any files you create or refer to without explicitly spelling out the full path fall within. In R, you can figure this out with the getwd function. To change this directory, you can use the aptly named setwd function. [code lang="R"] # get current working directory getwd() # set working directory setwd("C:/Users") [/code] Creating Files and Directories…
Read More
Underrated R Functions

Underrated R Functions

R
I wanted to write a post about a couple of handy functions in R that don't always get the recognition they deserve. This article will talk about a few functions that form part of R's core functional programming capabilities. R has thousands of functions, so this is just a short list, and I'll probably write other articles like this in the future to discuss some different R functions. Reduce Let's start with the Reduce function (note the capital "R"). Reduce takes a list or vector as input, and reduces it down to a single element. It works by applying a function to the first two elements of the vector or list, and then applying the same function to that result with the third element. This new result gets passed with…
Read More
Vectorize Fuzzy Matching

Vectorize Fuzzy Matching

R
One of the best things about R is its ability to vectorize code. This allows you to run code much faster than you would if you were using a for or while loop. In this post, we're going to show you how to use vectorization to speed up fuzzy matching. First, a little bit of background will be covered. If you're familiar with vectorization and / or fuzzy matching, feel free to skip further down the post. What is vectorization? Vectorization works by performing operations on entire vectors, or by extension, matrices, rather than iterating through each element in a collection of objects one at a time. A basic example is adding two vectors together. This can be done like this: [code lang="R"] a <- c(3, 4, 5) b <-…
Read More