I wanted to write a post about a couple of handy functions in R that don’t always get the recognition they deserve. This article will talk about a few functions that form part of R’s core functional programming capabilities. R has thousands of functions, so this is just a short list, and I’ll probably write other articles like this in the future to discuss some different R functions.
Let’s start with the Reduce function (note the capital “R”). Reduce takes a list or vector as input, and reduces it down to a single element. It works by applying a function to the first two elements of the vector or list, and then applying the same function to that result with the third element. This new result gets passed with the fourth element into the function and so on until a single object remains. If the input is a vector, the result will be a single number or character. On the other hand, inputting a list can have interesting results. A list of data frames can be reduced down to a single data frame, a list of vectors can be collapsed into a matrix, and so on.
A simple, though not entirely useful, example of how this works is like so:
test <- 1:10 result <- Reduce(sum, test)
Here, result will equal 55, which happens to be the sum of the vector test i.e. the sum of the integers 1 through 10. Reduce solves for this by first applying the sum function to 1 and 2 (the first two elements in test). This equals 3, which then gets summed with the next element in the vector, 3. This total of 6 gets added to 4, which equals 10, and so on. The process can be seen below.
1 + 2 = 3
3 + 3 = 6
6 + 4 = 10
10 + 5 = 15
15 + 6 = 21
21 + 7 = 28
28 + 8 = 36
36 + 9 = 45
45 + 10 = 55
Now, how about something a little more useful? What if you had a list of vectors and you wanted to combine them into a matrix?
test <- list(1:3, 4:6, 7:9, 10:12, 13:15, 16:18) matrix_result <- Reduce(rbind, test)
In this case, we have a list of six three-element vectors. Reduce applies rbind to the first two vectors, 1:3 and 4:6 initially. This creates a 2 x 3 matrix, where the first row is 1:3, and the second row is 4:6.
1 2 3
4 5 6
Then, the above result is combined (via rbind) to the next vector in the list, 7:9.
1 2 3
4 5 6
7 8 9
This process continues, as you can see below:
1 2 3
4 5 6
7 8 9
10 11 12
Next:
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
Finally:
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
Thus, the final result is a single object — but in this case, is a 6 x 3 matrix because rbind collapsed all of the vectors of the list, test, into a single matrix.
Similarly, you could run this example using cbind instead of rbind and that would collapse the vectors column-wise, rather than row-wise.
Another example where Reduce comes in handy might be if you want to combine a collection of data frames into a single one.
state_data <- list(FL = data.frame(state = c("FL","FL","FL"), city = c("Miami","Jacksonville","Saint Augustine")) NY = data.frame(state = c("NY","NY","NY"), city = c("NYC","Buffalo","Rochester")), MD = data.frame(state = c("MD","MD","MD"), city = c("Baltimore","Annapolis","Ocean City") ) combined <- data.frame(Reduce(rbind, state_data))
The Filter function does basically what it sounds like — it applies a filter to a vector, list, or data frame (which is actually a type of list). It takes two main inputs, a function that applies the filter, and the object for which the filter applies.
Here’s a simple example:
test <- 1:10 less_than_5 <- Filter(function(x) x < 5, test)
This, once again, creates a vector of the first 10 positive integers. The Filter function applies function(x) x < 5 to each element, x, in the vector, test. In other words, it checks each element, x, for the Boolean expression, x < 5. If an element is not less than 5, it gets filtered out.
So you might be thinking…can’t this be done like this?
less_than_5 <- test[test < 5]
…and the answer is…yes. It can be done that way. Filter is more useful as a function in cases involving data frames or lists. Suppose, for instance, you want to remove all constant columns from a data frame. This is something that may be done when preprocessing data prior to modeling, as a constant attribute isn’t particular useful.
This is can be done in one line using Filter
df <- data.frame(a = c(2,2,2), b = c(1,2,3), c = c(1,1,1), d = c(3,4,5)) without_constants <- Filter(function(x) length(unique(x)) > 1, df)
Alternatively, using dplyr’s n_distinct function, which counts the number of distinct elements in a vector, you could do this:
library(dplyr) df <- data.frame(a = c(2,2,2), b = c(1,2,3), c = c(1,1,1), d = c(3,4,5)) without_constants <- Filter(function(x) n_distinct(x) > 1, df)
In the example, we create a data frame with four columns — two of them are constant. Filter tests whether there is more than one unique value in each column. If there is only one unique value, then we know the column is constant, and it gets filtered out. Each element x is a vector, or column, in the data frame.
If you wanted to just drop all columns that are all NAs, you could make a minor tweak like this:
df <- data.frame(a = c(2,2,2), b = c(1,2,3), c = c(1,1,1), d = c(NA, NA, NA)) without_nas <- Filter(function(x) !all(is.na(x)), df)
Filter can also be used on a regular list as well. Suppose you have a list of vectors, where some of the vectors are characters, while others are numeric. If want to filter out all of the non-numeric vectors, you could call Filter:
sample_list <- list(a = c(1,2,3), b = c("is","a","character"), c = c(4,5,6), d = c("is","another","character")) only_numeric <- Filter(function(x) is.numeric(x), sample_list)
The rapply function is part of the apply family of functions in R. It has a few different uses, but one of my favorite applications for it is to apply a function to columns of a data frame that belong to a specific class, or have a particular data type.
Let’s say you want to get the sum of all of the numeric columns.
df <- data.frame(a = c(2,2,2), b = c(1,2,3), c = c("r","is","awesome"), d = c(3,4,5), e=c("some","other","character")) summed_columns <- rapply(df, sum, class = "numeric")
Similar to sapply or lapply, rapply takes a list / vector / data frame as input, along with a function to be applied. However, it can also take a “class” parameter, which allows us to specify what class of object we want our function to be used for.
rapply can also be used to recursively apply functions to nested lists (see examples from its documentation here).
The last function I want to mention for this post is the rep function. This can be used to repeat a value as many times as you want. So if you want to create a vector of 1000 5’s, it could be done like this:
rep(5, 1000)
Here’s a couple other examples:
rep("a", 500) rep("repeat this", 100)
If you pass a vector with more than one element to rep, the entire vector gets repeated the number of times you specify.
rep(c(1,2,3), 100)
The above code will create a vector with 300 elements — the number of elements in c(1,2,3) times 100, repeating 1, 2, 3 over and over.
That’s it for now! Check out other R posts of mine here: http://theautomatic.net/category/r/
Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for…
Ever had long-running code that you don't know when it's going to finish running? If…
Background If you've done any type of data analysis in Python, chances are you've probably…
In this post, we will investigate the pandas_profiling and sweetviz packages, which can be used…
In this post, we're going to cover how to plot XGBoost trees in R. XGBoost…
In this post, we'll discuss the underrated Python collections package, which is part of the…