R Archives - Page 2 of 4 - Open Source Automation

17Mar 2020 by Andrew Treadway

How to create decorators in R

Python, R

Introduction One of the coolest features of Python is its nice ability to create decorators. In short, decorators allow us to modify how a function behaves without changing the function's source code. This can often make code cleaner and easier to modify. For instance, decorators are also really useful if you have a collection of functions where each function has similar repeating code. Fortunately, decorators can now also be created in R! The first example below is in Python - we'll get to R in a moment. If you already know about decorators in Python, feel free to skip below to the R section. Below we have a function that prints today's date. In our example, we create two functions. The first - print_start_end takes another function as input. It…

25Feb 2020 by Andrew Treadway

3 recommended books on learning R

R

I sometimes get asked how I got started learning R. I thought I would use this post to go through a few books I read along the way which have been highly useful. The Art of R Programming The Art of R Programming: A Tour of Statistical Software Design is one of the first R books I read. If you read the table of contents of this book, you'll see it doesn't cover much data science-related content. However, the book is great at covering the main data structures you need to actually program in R. You'll learn the ins and outs of vectors, data frames, matrices, lists, and so on. Another point I like about the book is that it's good at explaining the primary structures that you need to…

18Feb 2020 by Andrew Treadway

How is information gain calculated?

Machine Learning, R

This post will explore the mathematics behind information gain. We'll start with the base intuition behind information gain, but then explain why it has the calculation that it does. What is information gain? Information gain is a measure frequently used in decision trees to determine which variable to split the input dataset on at each step in the tree. Before we formally define this measure we need to first understand the concept of entropy. Entropy measures the amount of information or uncertainty in a variable's possible values. How to calculate entropy Entropy of a random variable X is given by the following formula: -Σi[p(Xi) * log2(p(Xi))] Here, each Xi represents each possible (ith) value of X. p(xi) is the probability of a particular (the ith) possible value of X. Why…

29Jan 2020 by Andrew Treadway

Evaluate your R model with MLmetrics

Machine Learning, R

This post will explore using R's MLmetrics to evaluate machine learning models. MLmetrics provides several functions to calculate common metrics for ML models, including AUC, precision, recall, accuracy, etc. Building an example model Firstly, we need to build a model to use as an example. For this post, we'll be using a dataset on pulsar stars from Kaggle. Let's save the file as "pulsar_stars.csv". Each record in the file represents a pulsar star candidate. The goal will be to predict if a record is a pulsar star based upon the attributes available. To get started, let's load the packages we'll need and read in our dataset. [code lang="R"] library(MLmetrics) library(dplyr) stars = read.csv("pulsar_stars.csv") [/code] Next, let's split our data into train vs. test. We'll do a standard 70/30 split here.…

14Jan 2020 by Andrew Treadway

How to import Python classes into R

Pandas, Python, R

Background This post is going to talk about how to import Python classes into R, which can be done using a really awesome package in R called reticulate. reticulate allows you to call Python code from R, including sourcing Python scripts, using Python packages, and porting functions and classes. To install reticulate, we can run: [code lang="R"] install.packages("reticulate") [/code] Creating a Python class Let's create a simple class in Python. [code lang="python"] import pandas as pd # define a python class class explore: def __init__(self, df): self.df = df def metrics(self): desc = self.df.describe() return desc def dummify_vars(self): for field in self.df.columns: if isinstance(self.df[field][0], str): temp = pd.get_dummies(self.df[field]) self.df = pd.concat([self.df, temp], axis = 1) self.df.drop(columns = [field], inplace = True) [/code] Porting the Python class to R There's a…

30Dec 2019 by Andrew Treadway

mapply and Map in R

R

An older post on this blog talked about several alternative base apply functions. This post will talk about how to apply a function across multiple vectors or lists with Map and mapply in R. These functions are generalizations of sapply and lapply, which allow you to more easily loop over multiple vectors or lists simultaneously. Map Suppose we have two lists of vectors and we want to divide the nth vector in one list by the nth vector in the second list. Map makes this straightforward to accomplish, while keeping the code clean to read. Map returns a list by default, similar to lapply. Below, we create two sample lists of vectors. [code lang="R"] values1 <- list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8,…

20Aug 2019 by Andrew Treadway

How to get an AUC confidence interval

Machine Learning, R

Background AUC is an important metric in machine learning for classification. It is often used as a measure of a model's performance. In effect, AUC is a measure between 0 and 1 of a model's performance that rank-orders predictions from a model. For a detailed explanation of AUC, see this link. Since AUC is widely used, being able to get a confidence interval around this metric is valuable to both better demonstrate a model's performance, as well as to better compare two or more models. For example, if model A has an AUC higher than model B, but the 95% confidence interval around each AUC value overlaps, then the models may not be statistically different in performance. We can get a confidence interval around AUC using R's pROC package, which…

16Aug 2019 by Andrew Treadway

Really large numbers in R

R

This post will discuss ways of handling huge numbers in R using the gmp package. The gmp package The gmp package provides us a way of dealing with really large numbers in R. For example, let's suppose we want to multiple 10250 by itself. Mathematically we know the result should be 10500. But if we try this calculation in base R we get Inf for infinity. [code lang="R"] num = 10^250 num^2 # Inf [/code] However, we can get around this using the gmp package. Here, we can convert the integer 10 to an object of the bigz class. This is an implementation that allows us to handle very large numbers. Once we convert an integer to a bigz object, we can use it to perform calculations with regular numbers…

23Jul 2019 by Andrew Treadway

BeautifulSoup vs. Rvest

Python, R, Web Scraping

This post will compare Python's BeautifulSoup package to R's rvest package for web scraping. We'll also talk about additional functionality in rvest (that doesn't exist in BeautifulSoup) in comparison to a couple of other Python packages (including pandas and RoboBrowser). Getting started BeautifulSoup and rvest both involve creating an object that we can use to parse the HTML from a webpage. However, one immediate difference is that BeautifulSoup is just a web parser, so it doesn't connect to webpages. rvest, on the other hand, can connect to a webpage and scrape / parse its HTML in a single package. In BeautifulSoup, our initial setup looks like this: [code lang="python"] # load packages from bs4 import BeautifulSoup import requests # connect to webpage resp = requests.get(""https://www.azlyrics.com/b/beatles.html"") # get BeautifulSoup object soup…

12Jul 2019 by Andrew Treadway

Testing the Collatz Conjecture with R

R

Background The Collatz Conjecture is a famous unsolved problem in number theory. If you're not familiar with it - the conjecture is very simple to understand, yet, no one has been able to mathematically prove that the conjecture is true (though it's been shown to be true for an enormous number of cases). The conjecture states the following: Start with any whole number. If the number is even, divide by two. If the number is odd, multiply the number by three and add one. Then, repeat this logic with the result number. Eventually you'll end up with the number one. Effectively, this is can written like this: For any whole number n: If n mod 2 == 0, then n = n / 2 Else n = 3 * n…

Category: R