R

Running R Code in Parallel

Background

Running R code in parallel can be very useful in speeding up performance. Basically, parallelization allows you to run multiple processes in your code simultaneously, rather than than iterating over a list one element at a time, or running a single process at a time. Thankfully, running R code in parallel is relatively simple using the parallel package. This package provides parallelized versions of sapply, lapply, and rapply.

Parallelizing code works best when you need to call a function or perform an operation on different elements of a list or vector when doing so on any particular element of the list (or vector) has no impact on the evaluation of any other element. This could be running a large number of models across different elements of a list, scraping data from many webpages, or a host of other activities.

Testing for Primes in Parallel

In the example below, we’re going to use the parallel package to loop over 1 million integers to test whether each of them is a prime (or not). If you were doing this without the parallel package, you might try to speed up the operation by using sapply (rather than a for loop). This is fine, but the drawback is that sapply will only be able to test each number in the set one at a time. Using the parallelized version of sapply, called parSapply in the parallel package, we can test multiple numbers simulatenously for primality.



# load parallel package
require(parallel)

# define function to test whether an number is prime
is_prime <- function(num)
{
    # if input equals 2 or 3, then we know it's prime
    if(num == 2 | num == 3) 
      return(TRUE)
    # if input equals 1, then we know it's not prime
    if(num == 1) 
      return(FALSE)
  
    # else if num is greater than 2
    # and divisible by 2, then can't be even
    if(num %% 2 == 0) 
      return(FALSE)
  
    # else use algorithm to figure out
    # what factors, if any, input has
    
    # get square root of num, rounded down
    root <- floor(sqrt(num))
    
    # try to divide each odd number up to root
    # into num; if any leave a remainder of zero,
    # then we know num is not prime
    for(elt in seq(5,root))
    {
        if (num %% elt == 0)
          return(FALSE)
      
    }
    # otherwise, num has no divisors except 1 and itself
    # thus, num must be prime
    return(TRUE)
  
}

# get random sample of 1 million integers from integers between 1 and 
# 10 million
# set seed so the random sample will be the same every time
set.seed(2)
sample_numbers <- sample(10000000, 1000000)


# do a couple checks of function
is_prime(17) # 17 is prime

is_prime(323) # 323 = 17 * 19; not prime

# create cluster object
cl <- makeCluster(3)

# test each number in sample_numbers for primality
results <- parSapply(cl , sample_numbers , is_prime)

# close
stopCluster(cl)



The main piece of the code above is this:


# create cluster object
cl <- makeCluster(3)

# test each number in sample_numbers for primality
results <- parSapply(cl , sample_numbers , is_prime)

# close cluster object
stopCluster(cl)

The makeCluster function creates a cluster of R engines to run code in parallel. In other words, calling makeCluster creates multiple instances of R. Passing the number 3 as input to this function means three separate instances of R will be created. If you’re running on Windows, you can see these instances by looking at running processes in the Task Manager.

After this cluster is created, we call parSapply, which works almost exactly like sapply, except that instead of looping over each element in the vector, sample_numbers, one at a time, it uses the cluster of R instances to test multiple numbers in the vector for primality simultaneously. As you’ll see a little bit later, this saves a nice chunk of time.

Once our operation is done, we close the cluster object using the stopCluster function. This is important to do each time you use the parallel package; otherwise you could end up with lots of R instances on your machine.

How fast is running R code in parallel?

Alright, so let’s test how much time we can save by parallelizing our code. We’ll start by running the same is_prime function above on the same list of 1 million integers using regular sapply — so no parallelization. We will time the operational execution by using R’s builtin function, proc.time, before and after we run sapply; this gives us a time stamp at the start of the code run and at the end, so we can subtract these to see how much time it took for our code to run.


start <- proc.time()
results <- sapply(sample_numbers , is_prime)
end <- proc.time()

print(end - start) # 125.34

So the code takes 125.34 seconds to run.


start <- proc.time()
cl <- makeCluster(2)
results <- parSapply(cl , sample_numbers , is_prime)
stopCluster(cl)

end <- proc.time()

print(end - start) # 70.01

As you can see, using just two cores has lessened the amount of run time down to 70.01 seconds! What if we use three cores, like in our initial example?


start <- proc.time()
cl <- makeCluster(3)
results <- parSapply(cl , sample_numbers , is_prime)
stopCluster(cl)

end <- proc.time()

print(end - start) # 47.81

Using three cores runs our process in 47.81 seconds, which is much faster than using regular sapply. The exact amount of time you’ll save using parallelization will vary depending upon what operations you’re performing, and on the processor speed of the machine you’re working on, but in general, parallelization can definitely increase efficiency in your code. Creating a cluster of R processes, as well as merging together results from those instances, does take some amount of time. This means that parallelizing code over a small list or vector may not be worth it if the computation involved is not very intensive. However, in the case above of involving a larger vector of numbers, parallelization helps immensely.

How many parallelized instances should we use?

Above, we tested using 2 and 3 cores, respectively. But why not some other amount? The number of cores we should use is related to the number of cores on your machine. A good rule of thumb is to generally not exceed this number. There are exceptions, but often, creating more processes than cores will end up slowing down a computation, rather than increasing the speed. This has to do with how an operating system handles multiprocessing. For a more detailed explanation, see this link.

To figure out how many cores your machine has, you can run the detectCores function from the parallel package:


detectCores()

You may also want to balance the number of cores you use with other computations or applications you have running on your machine simultaneously.

That’s the end for this post. Have fun coding!

To learn more about useful tips in R, check out Hadley Wickham’s book, Advanced R.

Please check out other articles of mine by looking at the related posts linked below, or perusing through http://theautomatic.net/blog/.

Andrew Treadway

Recent Posts

Software Engineering for Data Scientists (New book!)

Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for…

2 years ago

How to stop long-running code in Python

Ever had long-running code that you don't know when it's going to finish running? If…

3 years ago

Faster alternatives to pandas

Background If you've done any type of data analysis in Python, chances are you've probably…

3 years ago

Automated EDA with Python

In this post, we will investigate the pandas_profiling and sweetviz packages, which can be used…

4 years ago

How to plot XGBoost trees in R

In this post, we're going to cover how to plot XGBoost trees in R. XGBoost…

4 years ago

Python collections tutorial

In this post, we'll discuss the underrated Python collections package, which is part of the…

4 years ago