Blog

Reading from databases with Python

Reading from databases with Python

Pandas, Python
Background - Reading from Databases with Python This post will talk about several packages for working with databases using Python. We'll start by covering pyodbc, which is one of the more standard packages used for working with databases, but we'll also cover a very useful module called turbodbc, which allows you to run SQL queries efficiently (and generally faster) within Python. pyodbc pyodbc can be installed using pip: [code] pip install pyodbc [/code] Let's start by writing a simple SQL query using pyodbc. To do that, we first need to connect to a specific database. In the examples laid out here, we will be using a SQLite database on my machine. However, you can do this with many other database systems, as well, such as SQL Server, MySQL, Oracle, etc.…
Read More
3 recommended books on learning R

3 recommended books on learning R

R
I sometimes get asked how I got started learning R. I thought I would use this post to go through a few books I read along the way which have been highly useful. The Art of R Programming The Art of R Programming: A Tour of Statistical Software Design is one of the first R books I read. If you read the table of contents of this book, you'll see it doesn't cover much data science-related content. However, the book is great at covering the main data structures you need to actually program in R. You'll learn the ins and outs of vectors, data frames, matrices, lists, and so on. Another point I like about the book is that it's good at explaining the primary structures that you need to…
Read More
How is information gain calculated?

How is information gain calculated?

Machine Learning, R
This post will explore the mathematics behind information gain. We'll start with the base intuition behind information gain, but then explain why it has the calculation that it does. What is information gain? Information gain is a measure frequently used in decision trees to determine which variable to split the input dataset on at each step in the tree. Before we formally define this measure we need to first understand the concept of entropy. Entropy measures the amount of information or uncertainty in a variable's possible values. How to calculate entropy Entropy of a random variable X is given by the following formula: -Σi[p(Xi) * log2(p(Xi))] Here, each Xi represents each possible (ith) value of X. p(xi) is the probability of a particular (the ith) possible value of X. Why…
Read More
Handling dates with Python’s maya package

Handling dates with Python’s maya package

Python
Background In this package we'll discuss Python's maya package for parsing dates from strings. A previous article talked about the dateutil and dateparser libraries for finding dates in strings. maya is really great for standardizing variations in a field or list of dates. maya can be installed using pip: pip install maya Standardizing dates with maya Let's start with a basic example. First, we just need to import maya. Next, we'll use its parse method to convert the text into a MayaDT object. We can append the datetime method to this to get a datetime from the string. [code lang="python"] import maya maya.parse("march 1 2019").datetime() [/code] [code lang="python"] maya.parse("9th of february 2019").datetime() [/code] [code lang="python"] maya.parse("1/1/2020").datetime() [/code] Below are several more examples of different date variations. [code lang="python"] maya.parse("1-1-2020").datetime() maya.parse("1…
Read More
Evaluate your R model with MLmetrics

Evaluate your R model with MLmetrics

Machine Learning, R
This post will explore using R's MLmetrics to evaluate machine learning models. MLmetrics provides several functions to calculate common metrics for ML models, including AUC, precision, recall, accuracy, etc. Building an example model Firstly, we need to build a model to use as an example. For this post, we'll be using a dataset on pulsar stars from Kaggle. Let's save the file as "pulsar_stars.csv". Each record in the file represents a pulsar star candidate. The goal will be to predict if a record is a pulsar star based upon the attributes available. To get started, let's load the packages we'll need and read in our dataset. [code lang="R"] library(MLmetrics) library(dplyr) stars = read.csv("pulsar_stars.csv") [/code] Next, let's split our data into train vs. test. We'll do a standard 70/30 split here.…
Read More
How to read PDF files with Python

How to read PDF files with Python

Python
Background In a previous article, we talked about how to scrape tables from PDF files with Python. In this post, we'll cover how to extract text from several types of PDFs. To read PDF files with Python, we can focus most of our attention on two packages - pdfminer and pytesseract. pdfminer (specifically pdfminer.six, which is a more up-to-date fork of pdfminer) is an effective package to use if you're handling PDFs that are typed and you're able to highlight the text. On the other hand, to read scanned-in PDF files with Python, the pytesseract package comes in handy, which we'll see later in the post. Scraping hightlightable text For the first example, let's scrape a 10-k form from Apple (see here). First, we'll just download this file to a…
Read More
How to import Python classes into R

How to import Python classes into R

Pandas, Python, R
Background This post is going to talk about how to import Python classes into R, which can be done using a really awesome package in R called reticulate. reticulate allows you to call Python code from R, including sourcing Python scripts, using Python packages, and porting functions and classes. To install reticulate, we can run: [code lang="R"] install.packages("reticulate") [/code] Creating a Python class Let's create a simple class in Python. [code lang="python"] import pandas as pd # define a python class class explore: def __init__(self, df): self.df = df def metrics(self): desc = self.df.describe() return desc def dummify_vars(self): for field in self.df.columns: if isinstance(self.df[field][0], str): temp = pd.get_dummies(self.df[field]) self.df = pd.concat([self.df, temp], axis = 1) self.df.drop(columns = [field], inplace = True) [/code] Porting the Python class to R There's a…
Read More
mapply and Map in R

mapply and Map in R

R
An older post on this blog talked about several alternative base apply functions. This post will talk about how to apply a function across multiple vectors or lists with Map and mapply in R. These functions are generalizations of sapply and lapply, which allow you to more easily loop over multiple vectors or lists simultaneously. Map Suppose we have two lists of vectors and we want to divide the nth vector in one list by the nth vector in the second list. Map makes this straightforward to accomplish, while keeping the code clean to read. Map returns a list by default, similar to lapply. Below, we create two sample lists of vectors. [code lang="R"] values1 <- list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8,…
Read More
Updates to yahoo_fin package

Updates to yahoo_fin package

Python, Web Scraping
Package updates First of all - thank you to everyone who has contacted me regarding the yahoo_fin package! Due to some changes in Yahoo Finance's website, I've updated the source code of yahoo_fin. To upgrade to the latest version (0.8.4), you can use pip: [code] pip install yahoo_fin --upgrade [/code] Get weekly and monthly stock prices The most recent version includes all functionality from the previous version, but now also includes the ability to pull weekly and monthly historical stock prices in the get_data method. [code lang="python"] from yahoo_fin import stock_info as si # default daily data daily_data = si.get_data("amzn") # get weekly data weekly_data = si.get_data("amzn", interval = "1wk") # get monthly data monthly_data = si.get_data("amzn", interval = "1mo") [/code] Speed-up in functions The options module includes an update…
Read More
3 Packages to Build a Spell Checker in Python

3 Packages to Build a Spell Checker in Python

Python
This post is going to talk about three different packages for coding a spell checker in Python - pyspellchecker, TextBlob, and autocorrect. pyspellchecker The pyspellchecker package allows you to perform spelling corrections, as well as see candidate spellings for a misspelled word. To install the package, you can use pip: [code] pip install pyspellchecker [/code] Once installed, the pyspellchecker is really straightforward to use. Note that even though we use "pyspellchecker" when installing via pip, we just type "spellchecker" in the package import statement. The first piece is to create a SpellChecker object, which we'll just call "spell". [code lang="python"] from spellchecker import SpellChecker spell = SpellChecker() [/code] Now, we're ready to test this out with a few misspellings. We'll use a few words from this list of commonly misspelled…
Read More