mapply and Map in R

mapply and Map in R

R
An older post on this blog talked about several alternative base apply functions. This post will talk about how to apply a function across multiple vectors or lists with Map and mapply in R. These functions are generalizations of sapply and lapply, which allow you to more easily loop over multiple vectors or lists simultaneously. Map Suppose we have two lists of vectors and we want to divide the nth vector in one list by the nth vector in the second list. Map makes this straightforward to accomplish, while keeping the code clean to read. Map returns a list by default, similar to lapply. Below, we create two sample lists of vectors. [code lang="R"] values1 <- list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8,…
Read More
Updates to yahoo_fin package

Updates to yahoo_fin package

Python, Web Scraping
Package updates First of all - thank you to everyone who has contacted me regarding the yahoo_fin package! Due to some changes in Yahoo Finance's website, I've updated the source code of yahoo_fin. To upgrade to the latest version (0.8.4), you can use pip: [code] pip install yahoo_fin --upgrade [/code] Get weekly and monthly stock prices The most recent version includes all functionality from the previous version, but now also includes the ability to pull weekly and monthly historical stock prices in the get_data method. [code lang="python"] from yahoo_fin import stock_info as si # default daily data daily_data = si.get_data("amzn") # get weekly data weekly_data = si.get_data("amzn", interval = "1wk") # get monthly data monthly_data = si.get_data("amzn", interval = "1mo") [/code] Speed-up in functions The options module includes an update…
Read More
3 Packages to Build a Spell Checker in Python

3 Packages to Build a Spell Checker in Python

Python
This post is going to talk about three different packages for coding a spell checker in Python - pyspellchecker, TextBlob, and autocorrect. pyspellchecker The pyspellchecker package allows you to perform spelling corrections, as well as see candidate spellings for a misspelled word. To install the package, you can use pip: [code] pip install pyspellchecker [/code] Once installed, the pyspellchecker is really straightforward to use. Note that even though we use "pyspellchecker" when installing via pip, we just type "spellchecker" in the package import statement. The first piece is to create a SpellChecker object, which we'll just call "spell". [code lang="python"] from spellchecker import SpellChecker spell = SpellChecker() [/code] Now, we're ready to test this out with a few misspellings. We'll use a few words from this list of commonly misspelled…
Read More
Guide to Fuzzy Matching with Python

Guide to Fuzzy Matching with Python

Python
This post is going to delve into the textdistance package in Python, which provides a large collection of algorithms to do fuzzy matching. The textdistance package Similar to the stringdist package in R, the textdistance package provides a collection of algorithms that can be used for fuzzy matching. To install textdistance using just the pure Python implementations of the algorithms, you can use pip like below: [code] pip install textdistance [/code] However, if you want to get the best possible speed out of the algorithms, you can tweak the pip install command like this: [code] pip install textdistance[extras] [/code] Once installed, we can import textdistance like below: [code lang="python"] import textdistance [/code] Levenshtein distance Levenshtein distance measures the minimum number of insertions, deletions, and substitutions required to change one string…
Read More
How to read Word documents with Python

How to read Word documents with Python

Pandas, Python
This post will talk about how to read Word Documents with Python. We're going to cover three different packages - docx2txt, docx, and my personal favorite: docx2python. The docx2txt package Let's talk about docx2text first. This is a Python package that allows you to scrape text and images from Word Documents. The example below reads in a Word Document containing the Zen of Python. As you can see, once we've imported docx2txt, all we need is one line of code to read in the text from the Word Document. We can read in the document using a method in the package called process, which takes the name of the file as input. Regular text, listed items, hyperlink text, and table text will all be returned in a single string. [code…
Read More
Python, Basket Analysis, and Pymining

Python, Basket Analysis, and Pymining

Python
Background Python's pymining package provides a collection of useful algorithms for item set mining, association mining, and more. We'll explore some of its functionality during this post by using it to apply basket analysis to tennis. When basket analysis is discussed, it's often in the context of retail - analyzing what combinations of products are typically bought together (or in the same "basket"). For example, in grocery shopping, milk and butter may be frequently purchased together. We can take ideas from basket analysis and apply them in many other scenarios. As an example - let's say we're looking at events like tennis tournaments where each tournament has different successive rounds i.e. quarterfinals, semifinals, finals etc. How would you figure out what combinations of players typically show up in the same…
Read More
How to get an AUC confidence interval

How to get an AUC confidence interval

Machine Learning, R
Background AUC is an important metric in machine learning for classification. It is often used as a measure of a model's performance. In effect, AUC is a measure between 0 and 1 of a model's performance that rank-orders predictions from a model. For a detailed explanation of AUC, see this link. Since AUC is widely used, being able to get a confidence interval around this metric is valuable to both better demonstrate a model's performance, as well as to better compare two or more models. For example, if model A has an AUC higher than model B, but the 95% confidence interval around each AUC value overlaps, then the models may not be statistically different in performance. We can get a confidence interval around AUC using R's pROC package, which…
Read More
Really large numbers in R

Really large numbers in R

R
This post will discuss ways of handling huge numbers in R using the gmp package. The gmp package The gmp package provides us a way of dealing with really large numbers in R. For example, let's suppose we want to multiple 10250 by itself. Mathematically we know the result should be 10500. But if we try this calculation in base R we get Inf for infinity. [code lang="R"] num = 10^250 num^2 # Inf [/code] However, we can get around this using the gmp package. Here, we can convert the integer 10 to an object of the bigz class. This is an implementation that allows us to handle very large numbers. Once we convert an integer to a bigz object, we can use it to perform calculations with regular numbers…
Read More
BeautifulSoup vs. Rvest

BeautifulSoup vs. Rvest

Python, R, Web Scraping
This post will compare Python's BeautifulSoup package to R's rvest package for web scraping. We'll also talk about additional functionality in rvest (that doesn't exist in BeautifulSoup) in comparison to a couple of other Python packages (including pandas and RoboBrowser). Getting started BeautifulSoup and rvest both involve creating an object that we can use to parse the HTML from a webpage. However, one immediate difference is that BeautifulSoup is just a web parser, so it doesn't connect to webpages. rvest, on the other hand, can connect to a webpage and scrape / parse its HTML in a single package. In BeautifulSoup, our initial setup looks like this: [code lang="python"] # load packages from bs4 import BeautifulSoup import requests # connect to webpage resp = requests.get(""https://www.azlyrics.com/b/beatles.html"") # get BeautifulSoup object soup…
Read More
Testing the Collatz Conjecture with R

Testing the Collatz Conjecture with R

R
Background The Collatz Conjecture is a famous unsolved problem in number theory. If you're not familiar with it - the conjecture is very simple to understand, yet, no one has been able to mathematically prove that the conjecture is true (though it's been shown to be true for an enormous number of cases). The conjecture states the following: Start with any whole number. If the number is even, divide by two. If the number is odd, multiply the number by three and add one. Then, repeat this logic with the result number. Eventually you'll end up with the number one. Effectively, this is can written like this: For any whole number n: If n mod 2 == 0, then n = n / 2 Else n = 3 * n…
Read More