R

Evaluate your R model with MLmetrics

This post will explore using R’s MLmetrics to evaluate machine learning models. MLmetrics provides several functions to calculate common metrics for ML models, including AUC, precision, recall, accuracy, etc.

Building an example model

Firstly, we need to build a model to use as an example. For this post, we’ll be using a dataset on pulsar stars from Kaggle. Let’s save the file as “pulsar_stars.csv”. Each record in the file represents a pulsar star candidate. The goal will be to predict if a record is a pulsar star based upon the attributes available.

To get started, let’s load the packages we’ll need and read in our dataset.


library(MLmetrics)
library(dplyr)

stars = read.csv("pulsar_stars.csv")

Next, let’s split our data into train vs. test. We’ll do a standard 70/30 split here.



set.seed(0)
train_indexes = sample(1:nrow(stars), .7 * nrow(stars))

train_set <- stars[train_indexes,]
test_set <- stars[-train_indexes,]

Now, let’s build a simple logistic regression model.


train_set <- data.frame(train_set %>% select(target_class), train_set %>% select(-target_class))

# build model
model <- glm(formula(train_set), train_set, family = "binomial")

# get predictions on train and test datasets
train_pred <- predict(model, train_set, type = "response")
test_pred <- predict(model, test_set, type = "response")


AUC / precision / recall / accuracy

Let’s calculate a few metrics. One of the most common metrics for classification is calculating AUC, which can be done using MLMetrics’ AUC function. Intuitively, AUC is a score between 0 and 1 that measures how well a model rank-orders predictions. See here for a more detailed explanation.


# get AUC on test and train set
AUC(test_pred, test_set$target_class) # 0.974172
AUC(train_pred, train_set$target_class) # 0.9773794

As a refresher, here’s a quick overview of precision, recall, and accuracy:

  • Precision: The true positive rate. If the model predicts there are 10 pulsar stars, and 8 of those 10 actually are pulsars, then the precision would be 8 / 10, or 80%.
  • Recall:The proportion of the positive labels that are captured with the model. For example, suppose there are 10 pulsar stars in the data and that the model predicts 7 of those to be pulsar stars. That would mean the recall is 7 / 10, or 70%.
  • Accuracy:Generally the most intuitive of the metrics here. Accuracy is simply the number of correct predictions divided by the total number of predictions.

  • Notice how each above metric requires whole number inputs. To handle this, we need to set a threshold on our predicted probabilities. One way to do this would be to assign any prediction above 50% as a predicted pulsar star, while any prediction that is less than 50% would get assigned as not a pulsar star.

    For example, if we pick 0.5 as a threshold, our precision on the test set would be 0.9114219.

    
    Precision(test_set$target_class, ifelse(test_pred >= .5, 1, 0), positive = 1) # 0.9114219
    
    
    

    Rather than just picking 0.5, though, we can try to optimize the cutoff we choose. One method of accomplishing this is to choose the threshold that optimizes the F1 Score. F1 Score is defined as the harmonic mean between precision and recall (see more here).

    Below, we calculate the F1 Score for each threshold 0.01, 0.02, 0.03,…0.99. The threshold that gives the optimal cutoff (optimal F1 Score) is .32, or 32%.

    
    
    f1_scores <- sapply(seq(0.01, 0.99, .01), function(thresh) F1_Score(train_set$target_class, ifelse(train_pred >= thresh, 1, 0), positive = 1))
    
    which.max(f1_scores) # 32
    
    

    Using this cutoff, we can calculate precision, recall, and accuracy.

    
    Precision(test_set$target_class, ifelse(test_pred >= .32, 1, 0), positive = 1)
    
    Recall(test_set$target_class, ifelse(test_pred >= .32, 1, 0), positive = 1)
    
    Accuracy(ifelse(test_pred >= .32, 1, 0), test_set$target_class)
    
    

    In general, there will be a trade-off between precision and recall, so the selection of a threshold may also vary depending on which of those metrics is more valued. Optimizing based off F1 Score is a good way to try to optimize the threshold based off both precision and recall.

    Gini

    Another metric that can be used in evaluating classification models is the Gini coefficient. Gini is calculated as 2 * AUC – 1. Thus, we get 0.974172 * 2 – 1 = 0.948344.

    
    Gini(test_pred, test_set$target_class) # 0.948344
    
    

    Other metrics

    MLmetrics also has functions for non-classification metrics as well, such as RMSE and RAE.

    That’s it for this post! If you liked this article, please follow my blog on Twitter, or check out some recommended books here.

    Andrew Treadway

    Recent Posts

    Software Engineering for Data Scientists (New book!)

    Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for…

    2 years ago

    How to stop long-running code in Python

    Ever had long-running code that you don't know when it's going to finish running? If…

    3 years ago

    Faster alternatives to pandas

    Background If you've done any type of data analysis in Python, chances are you've probably…

    3 years ago

    Automated EDA with Python

    In this post, we will investigate the pandas_profiling and sweetviz packages, which can be used…

    3 years ago

    How to plot XGBoost trees in R

    In this post, we're going to cover how to plot XGBoost trees in R. XGBoost…

    4 years ago

    Python collections tutorial

    In this post, we'll discuss the underrated Python collections package, which is part of the…

    4 years ago