R

Vectorize Fuzzy Matching

One of the best things about R is its ability to vectorize code. This allows you to run code much faster than you would if you were using a for or while loop. In this post, we’re going to show you how to use vectorization to speed up fuzzy matching. First, a little bit of background will be covered. If you’re familiar with vectorization and / or fuzzy matching, feel free to skip further down the post.

What is vectorization?

Vectorization works by performing operations on entire vectors, or by extension, matrices, rather than iterating through each element in a collection of objects one at a time. A basic example is adding two vectors together. This can be done like this:


a <- c(3, 4, 5)
b <- c(6, 7, 8)

sums <- a + b

In this example, sums now equals c(9, 11, 13). This is because the addition operator, “+”, was applied pairwise to each vector i.e. the first element of a was added to the first element of b, the second element of a was added to the second element of b, and so on. Vectorization is faster than traditional for loops because it uses parallel operations under the hood. For loops in R, on the other hand, are notoriously slow.

This example may seem simple, but vectorization can be used much more powerfully to speed up a process like fuzzy matching, the topic of this article.

What is fuzzy matching?

Fuzzy matching is the process of finding strings that follow similar patterns. For example, suppose you’re trying to join two data sets together on a city field. Data for these cities may be entered into a system manually, allowing for spelling or formatting differences i.e. “Mt. Hood” might be coded as “Mount Hood” etc. For a human, we can see those are clearly the same, but we need an algorithmic way to determine that so we don’t have to manually go through large numbers of possibilities.

The stringdist package

To perform fuzzy matching, we’re going to use a package called stringdist. This contains a function we need called stringsim which gives a measure of similarity between a pair of strings. This function allows several different algorithms to compare the similarity between two strings, and returns a value between 0 (very dissimilar) and 1 (very similar or equal). For instance:


# load stringdist package
require(stringdist)

# compare two sample strings
stringsim("this is a test", "this is the test", method = "jw")

Simplistic Approach

Alright, so to illustrate how much time vectorization saves, let’s do fuzzy matching in a less clever way. We’re going to use the zipcode package to get a sample list of cities across the US. These get stored in the vector, cities. Note, since this data set is actually at a zip code level, we’re going to have some duplicates in our vector, but that’s not a concern for this example because we’re trying to demonstrate performance differences between vectorization versus non-vectorization. Thus, our cities vector contains 44,336 elements.

Also, we create a vector below, called input, containing the misspelled or adjusted spellings of several cities in Florida (“Mt. Dora” vs. “Mount Dora”, “Sun City, FL” vs. “Sun City” etc.).


# load packages
require(stringdist)
require(zipcode)
data("zipcode")

cities <- zipcode$city

# misspell / change spellings from a few Florida cities
input <- c("Centry", "Boca R.", "Mt. Dora", "winterhaven", "Sun City, FL")

Now, one way of doing fuzzy matching here would be to loop through each city in our input vector, and then loop through each city in the cities vector and check the string similarity between each possible pair of strings i.e. check the similarity between the first element in input against every single element in cities. Since the length of input is 5 and the length of cities is 44,336, this requires 5 * 44,336 = 221,680 calls to stringsim.


# Naive approach
start <- proc.time()
results <- c()
for(city in input)
{
  
    best_dist <- -Inf
    for(check in cities)
    {
        dist <- stringsim(city, check, method = "jw")
        if(dist > best_dist)
        {
            closest_match <- check
            best_dist <- dist
          
        }
      
    }
    
    results <- append(results, closest_match)
  
}
end <- proc.time()

If we run end – start, we find that this process takes just under 26 seconds. However, let’s see what happens when we ditch the for loops, and use vectorization instead.

Vectorizing Fuzzy Matching


# define function to search for best string match in cities
get_best_match <- function(city, cities = cities)
{
    
    max_index <- which.max(stringsim(city, cities, method = "jw"))
    
    return(cities[max_index])
  
}


vector_start <- proc.time()
vector_results <- sapply(input, function(city) get_best_match(city, cities))
vector_end <- proc.time()

If we run vector_end – vector_start, we see this only takes 0.03 seconds! The main reason for the speed up can be seen in this line:


max_index <- which.max(stringsim(city, cities, method = "jw"))

The inner function, stringsim, gets the string similarity between the input city (e.g. “Mt. Dora”) versus every element in the cities vector. This is done in one line, rather than using a for or while loop construct. The output of this function call is a vector that shows the similarity measures (between 0 and 1) of the input city versus everything in cities.

We then apply the which.max function to get the index of the element in cities that is the closest match to the input city. Using this index, we can get the actual name of the city with the closest match, like below.


return(cities[max_index])

The other reason for the performance increase is using sapply to actually loop over the elements in our input vector, passing element in turn to the get_best_match function. Since sapply is written in C under the hood, it usually runs much faster than a traditional for loop in R.

That’s it for this post! Check out other posts of mine here: http://theautomatic.net/blog/.

Andrew Treadway

Recent Posts

Software Engineering for Data Scientists (New book!)

Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for…

2 years ago

How to stop long-running code in Python

Ever had long-running code that you don't know when it's going to finish running? If…

3 years ago

Faster alternatives to pandas

Background If you've done any type of data analysis in Python, chances are you've probably…

3 years ago

Automated EDA with Python

In this post, we will investigate the pandas_profiling and sweetviz packages, which can be used…

4 years ago

How to plot XGBoost trees in R

In this post, we're going to cover how to plot XGBoost trees in R. XGBoost…

4 years ago

Python collections tutorial

In this post, we'll discuss the underrated Python collections package, which is part of the…

4 years ago