Software Engineering for Data Scientists (New book!)

Machine Learning, Python
Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for Data Scientists is available now! Check it out at this link. Use promo code au35tre to save 30% on this book and any products sold from Manning. Why Software Engineering for Data Scientists? Data science and software engineering have been merging more and more, especially over the last decade. Software Engineering for Data Scientists is my upcoming book that will help you learn more about software engineering and how it can make your life easier as a data scientist! This book covers the following key topics: Source control How to implement exception handling and write robust code Object-oriented programming for data scientists How to monitor the progress of training machine learning models Scaling your Python…
Read More
How to plot XGBoost trees in R

How to plot XGBoost trees in R

Machine Learning, R
In this post, we're going to cover how to plot XGBoost trees in R. XGBoost is a very popular machine learning algorithm, which is frequently used in Kaggle competitions and has many practical use cases. Let's start by loading the packages we'll need. Note that plotting XGBoost trees requires the DiagrammeR package to be installed, so even if you have xgboost installed already, you'll need to make sure you have DiagrammeR also. [code lang="R"] # load libraries library(xgboost) library(caret) library(dplyr) library(DiagrammeR) [/code] Next, let's read in our dataset. In this post, we'll be using this customer churn dataset. The label we'll be trying to predict is called "Exited" and is a binary variable with 1 meaning the customer churned (canceled account) vs. 0 meaning the customer did not churn (did…
Read More
What to study if you’re under quarantine

What to study if you’re under quarantine

Python, R
If you're staying indoors more often recently because of the current COVID-19 outbreak and looking for new things to study, here's a few ideas! Free 365 Data Science Courses 365 Data Science is making all of their courses free until April 15. They have a variety of courses across R, Python, SQL, and more. Their platform also has courses that give a great mathematical foundation behind machine learning, which helps you a lot as you get deeper into data science. They also have courses on deep learning, which is a hot field right now. In addition to pure data science, 365 Data Science also covers material on Git / Github, which is essential for any data scientist nowadways. Another nice feature of 365 Data Science is that they also offer…
Read More
How is information gain calculated?

How is information gain calculated?

Machine Learning, R
This post will explore the mathematics behind information gain. We'll start with the base intuition behind information gain, but then explain why it has the calculation that it does. What is information gain? Information gain is a measure frequently used in decision trees to determine which variable to split the input dataset on at each step in the tree. Before we formally define this measure we need to first understand the concept of entropy. Entropy measures the amount of information or uncertainty in a variable's possible values. How to calculate entropy Entropy of a random variable X is given by the following formula: -Σi[p(Xi) * log2(p(Xi))] Here, each Xi represents each possible (ith) value of X. p(xi) is the probability of a particular (the ith) possible value of X. Why…
Read More
Evaluate your R model with MLmetrics

Evaluate your R model with MLmetrics

Machine Learning, R
This post will explore using R's MLmetrics to evaluate machine learning models. MLmetrics provides several functions to calculate common metrics for ML models, including AUC, precision, recall, accuracy, etc. Building an example model Firstly, we need to build a model to use as an example. For this post, we'll be using a dataset on pulsar stars from Kaggle. Let's save the file as "pulsar_stars.csv". Each record in the file represents a pulsar star candidate. The goal will be to predict if a record is a pulsar star based upon the attributes available. To get started, let's load the packages we'll need and read in our dataset. [code lang="R"] library(MLmetrics) library(dplyr) stars = read.csv("pulsar_stars.csv") [/code] Next, let's split our data into train vs. test. We'll do a standard 70/30 split here.…
Read More
How to get an AUC confidence interval

How to get an AUC confidence interval

Machine Learning, R
Background AUC is an important metric in machine learning for classification. It is often used as a measure of a model's performance. In effect, AUC is a measure between 0 and 1 of a model's performance that rank-orders predictions from a model. For a detailed explanation of AUC, see this link. Since AUC is widely used, being able to get a confidence interval around this metric is valuable to both better demonstrate a model's performance, as well as to better compare two or more models. For example, if model A has an AUC higher than model B, but the 95% confidence interval around each AUC value overlaps, then the models may not be statistically different in performance. We can get a confidence interval around AUC using R's pROC package, which…
Read More