3 recommended books on learning R

3 recommended books on learning R

R
I sometimes get asked how I got started learning R. I thought I would use this post to go through a few books I read along the way which have been highly useful. The Art of R Programming The Art of R Programming: A Tour of Statistical Software Design is one of the first R books I read. If you read the table of contents of this book, you'll see it doesn't cover much data science-related content. However, the book is great at covering the main data structures you need to actually program in R. You'll learn the ins and outs of vectors, data frames, matrices, lists, and so on. Another point I like about the book is that it's good at explaining the primary structures that you need to…
Read More
How is information gain calculated?

How is information gain calculated?

Machine Learning, R
This post will explore the mathematics behind information gain. We'll start with the base intuition behind information gain, but then explain why it has the calculation that it does. What is information gain? Information gain is a measure frequently used in decision trees to determine which variable to split the input dataset on at each step in the tree. Before we formally define this measure we need to first understand the concept of entropy. Entropy measures the amount of information or uncertainty in a variable's possible values. How to calculate entropy Entropy of a random variable X is given by the following formula: -Σi[p(Xi) * log2(p(Xi))] Here, each Xi represents each possible (ith) value of X. p(xi) is the probability of a particular (the ith) possible value of X. Why…
Read More
Handling dates with Python’s maya package

Handling dates with Python’s maya package

Python
Background In this package we'll discuss Python's maya package for parsing dates from strings. A previous article talked about the dateutil and dateparser libraries for finding dates in strings. maya is really great for standardizing variations in a field or list of dates. maya can be installed using pip: pip install maya Standardizing dates with maya Let's start with a basic example. First, we just need to import maya. Next, we'll use its parse method to convert the text into a MayaDT object. We can append the datetime method to this to get a datetime from the string. [code lang="python"] import maya maya.parse("march 1 2019").datetime() [/code] [code lang="python"] maya.parse("9th of february 2019").datetime() [/code] [code lang="python"] maya.parse("1/1/2020").datetime() [/code] Below are several more examples of different date variations. [code lang="python"] maya.parse("1-1-2020").datetime() maya.parse("1…
Read More
Evaluate your R model with MLmetrics

Evaluate your R model with MLmetrics

Machine Learning, R
This post will explore using R's MLmetrics to evaluate machine learning models. MLmetrics provides several functions to calculate common metrics for ML models, including AUC, precision, recall, accuracy, etc. Building an example model Firstly, we need to build a model to use as an example. For this post, we'll be using a dataset on pulsar stars from Kaggle. Let's save the file as "pulsar_stars.csv". Each record in the file represents a pulsar star candidate. The goal will be to predict if a record is a pulsar star based upon the attributes available. To get started, let's load the packages we'll need and read in our dataset. [code lang="R"] library(MLmetrics) library(dplyr) stars = read.csv("pulsar_stars.csv") [/code] Next, let's split our data into train vs. test. We'll do a standard 70/30 split here.…
Read More
How to read PDF files with Python

How to read PDF files with Python

Python
Background In a previous article, we talked about how to scrape tables from PDF files with Python. In this post, we'll cover how to extract text from several types of PDFs. To read PDF files with Python, we can focus most of our attention on two packages - pdfminer and pytesseract. pdfminer (specifically pdfminer.six, which is a more up-to-date fork of pdfminer) is an effective package to use if you're handling PDFs that are typed and you're able to highlight the text. On the other hand, to read scanned-in PDF files with Python, the pytesseract package comes in handy, which we'll see later in the post. Scraping hightlightable text For the first example, let's scrape a 10-k form from Apple (see here). First, we'll just download this file to a…
Read More
How to import Python classes into R

How to import Python classes into R

Pandas, Python, R
Background This post is going to talk about how to import Python classes into R, which can be done using a really awesome package in R called reticulate. reticulate allows you to call Python code from R, including sourcing Python scripts, using Python packages, and porting functions and classes. To install reticulate, we can run: [code lang="R"] install.packages("reticulate") [/code] Creating a Python class Let's create a simple class in Python. [code lang="python"] import pandas as pd # define a python class class explore: def __init__(self, df): self.df = df def metrics(self): desc = self.df.describe() return desc def dummify_vars(self): for field in self.df.columns: if isinstance(self.df[field][0], str): temp = pd.get_dummies(self.df[field]) self.df = pd.concat([self.df, temp], axis = 1) self.df.drop(columns = [field], inplace = True) [/code] Porting the Python class to R There's a…
Read More