Andrew Treadway, Author at Open Source Automation

13Nov 2019 by Andrew Treadway

Guide to Fuzzy Matching with Python

Python

This post is going to delve into the textdistance package in Python, which provides a large collection of algorithms to do fuzzy matching. The textdistance package Similar to the stringdist package in R, the textdistance package provides a collection of algorithms that can be used for fuzzy matching. To install textdistance using just the pure Python implementations of the algorithms, you can use pip like below: [code] pip install textdistance [/code] However, if you want to get the best possible speed out of the algorithms, you can tweak the pip install command like this: [code] pip install textdistance[extras] [/code] Once installed, we can import textdistance like below: [code lang="python"] import textdistance [/code] Levenshtein distance Levenshtein distance measures the minimum number of insertions, deletions, and substitutions required to change one string…

14Oct 2019 by Andrew Treadway

How to read Word documents with Python

Pandas, Python

This post will talk about how to read Word Documents with Python. We're going to cover three different packages - docx2txt, docx, and my personal favorite: docx2python. The docx2txt package Let's talk about docx2text first. This is a Python package that allows you to scrape text and images from Word Documents. The example below reads in a Word Document containing the Zen of Python. As you can see, once we've imported docx2txt, all we need is one line of code to read in the text from the Word Document. We can read in the document using a method in the package called process, which takes the name of the file as input. Regular text, listed items, hyperlink text, and table text will all be returned in a single string. [code…

21Aug 2019 by Andrew Treadway

Python, Basket Analysis, and Pymining

Python

Background Python's pymining package provides a collection of useful algorithms for item set mining, association mining, and more. We'll explore some of its functionality during this post by using it to apply basket analysis to tennis. When basket analysis is discussed, it's often in the context of retail - analyzing what combinations of products are typically bought together (or in the same "basket"). For example, in grocery shopping, milk and butter may be frequently purchased together. We can take ideas from basket analysis and apply them in many other scenarios. As an example - let's say we're looking at events like tennis tournaments where each tournament has different successive rounds i.e. quarterfinals, semifinals, finals etc. How would you figure out what combinations of players typically show up in the same…

20Aug 2019 by Andrew Treadway

How to get an AUC confidence interval

Machine Learning, R

Background AUC is an important metric in machine learning for classification. It is often used as a measure of a model's performance. In effect, AUC is a measure between 0 and 1 of a model's performance that rank-orders predictions from a model. For a detailed explanation of AUC, see this link. Since AUC is widely used, being able to get a confidence interval around this metric is valuable to both better demonstrate a model's performance, as well as to better compare two or more models. For example, if model A has an AUC higher than model B, but the 95% confidence interval around each AUC value overlaps, then the models may not be statistically different in performance. We can get a confidence interval around AUC using R's pROC package, which…

16Aug 2019 by Andrew Treadway

Really large numbers in R

R

This post will discuss ways of handling huge numbers in R using the gmp package. The gmp package The gmp package provides us a way of dealing with really large numbers in R. For example, let's suppose we want to multiple 10250 by itself. Mathematically we know the result should be 10500. But if we try this calculation in base R we get Inf for infinity. [code lang="R"] num = 10^250 num^2 # Inf [/code] However, we can get around this using the gmp package. Here, we can convert the integer 10 to an object of the bigz class. This is an implementation that allows us to handle very large numbers. Once we convert an integer to a bigz object, we can use it to perform calculations with regular numbers…

23Jul 2019 by Andrew Treadway

BeautifulSoup vs. Rvest

Python, R, Web Scraping

This post will compare Python's BeautifulSoup package to R's rvest package for web scraping. We'll also talk about additional functionality in rvest (that doesn't exist in BeautifulSoup) in comparison to a couple of other Python packages (including pandas and RoboBrowser). Getting started BeautifulSoup and rvest both involve creating an object that we can use to parse the HTML from a webpage. However, one immediate difference is that BeautifulSoup is just a web parser, so it doesn't connect to webpages. rvest, on the other hand, can connect to a webpage and scrape / parse its HTML in a single package. In BeautifulSoup, our initial setup looks like this: [code lang="python"] # load packages from bs4 import BeautifulSoup import requests # connect to webpage resp = requests.get(""https://www.azlyrics.com/b/beatles.html"") # get BeautifulSoup object soup…

12Jul 2019 by Andrew Treadway

Testing the Collatz Conjecture with R

R

Background The Collatz Conjecture is a famous unsolved problem in number theory. If you're not familiar with it - the conjecture is very simple to understand, yet, no one has been able to mathematically prove that the conjecture is true (though it's been shown to be true for an enormous number of cases). The conjecture states the following: Start with any whole number. If the number is even, divide by two. If the number is odd, multiply the number by three and add one. Then, repeat this logic with the result number. Eventually you'll end up with the number one. Effectively, this is can written like this: For any whole number n: If n mod 2 == 0, then n = n / 2 Else n = 3 * n…

25Jun 2019 by Andrew Treadway

How to hide a password in R with the keyring package

R

This post will introduce using the keyring package to hide a password. Short background The keyring package is a library designed to let you access your operating system's credential store. In essence, it lets you store and retrieve passwords in your operating system, which allows you to avoid having a password in plaintext in an R script. Storing a password Storing a password with keyring is really straightforward. First, we just need to load the keyring package. Then we call a function called key_set_with_value. In this function, we'll input three different parameters - service, username and password. [code lang="R"] # load keyring package library(keyring) # Store email username with password key_set_with_value(service = "user_email", username = "your_address@example.com", password = "test password") [/code] The username and password stored are just that -…

22Jun 2019 by Andrew Treadway

Web Browsing and Parsing with RoboBrowser and requests_html

Python, Web Scraping

Background So you've learned all about BeautifulSoup. What's next? Python is a great language for automating web operations. In a previous article we went through how to use BeautifulSoup and requests to scrape stock-related articles from Nasdaq's website. This post talks about a couple of alternatives to using BeautifulSoup directly. One way of scraping and crawling the web is to use Python's RoboBrowser package, which is built on top of requests and BeautifulSoup. Because it's built using each of these packages, writing code to scrape the web is a bit simplified as we'll see below. RoboBrowser works similarly to the older Python 2.x package mechanize in that it allows you to simulate a web browser. A second option is using requests_html, which was also discussed here, and which we'll also…

11Jun 2019 by Andrew Treadway

Does “Sell in May, Go Away” really work?

R

If you follow the stock market, you've probably heard the expression "Sell in May, Go Away." This expression generally refers to the perceived idea that the stock market goes up between the end of October and end of April, but one should sell at the beginning of May to avoid losses. The general recommendation according to the theory is to hold money in a money market account during the "short period" of May through October, and then reinvest in the stock market in November. But how does this myth hold up in reality? Let's use R to find out! Our analysis will look strictly at the S&P 500 performance during the years 1970 to the present (so we won't dive into interest rate levels, money market accounts, etc.). Getting started…

Author: Andrew Treadway