Reading from databases with Python

Reading from databases with Python

Pandas, Python
Background - Reading from Databases with Python This post will talk about several packages for working with databases using Python. We'll start by covering pyodbc, which is one of the more standard packages used for working with databases, but we'll also cover a very useful module called turbodbc, which allows you to run SQL queries efficiently (and generally faster) within Python. pyodbc pyodbc can be installed using pip: [code] pip install pyodbc [/code] Let's start by writing a simple SQL query using pyodbc. To do that, we first need to connect to a specific database. In the examples laid out here, we will be using a SQLite database on my machine. However, you can do this with many other database systems, as well, such as SQL Server, MySQL, Oracle, etc.…
Read More
Handling dates with Python’s maya package

Handling dates with Python’s maya package

Python
Background In this package we'll discuss Python's maya package for parsing dates from strings. A previous article talked about the dateutil and dateparser libraries for finding dates in strings. maya is really great for standardizing variations in a field or list of dates. maya can be installed using pip: pip install maya Standardizing dates with maya Let's start with a basic example. First, we just need to import maya. Next, we'll use its parse method to convert the text into a MayaDT object. We can append the datetime method to this to get a datetime from the string. [code lang="python"] import maya maya.parse("march 1 2019").datetime() [/code] [code lang="python"] maya.parse("9th of february 2019").datetime() [/code] [code lang="python"] maya.parse("1/1/2020").datetime() [/code] Below are several more examples of different date variations. [code lang="python"] maya.parse("1-1-2020").datetime() maya.parse("1…
Read More
How to read PDF files with Python

How to read PDF files with Python

Python
Background In a previous article, we talked about how to scrape tables from PDF files with Python. In this post, we'll cover how to extract text from several types of PDFs. To read PDF files with Python, we can focus most of our attention on two packages - pdfminer and pytesseract. pdfminer (specifically pdfminer.six, which is a more up-to-date fork of pdfminer) is an effective package to use if you're handling PDFs that are typed and you're able to highlight the text. On the other hand, to read scanned-in PDF files with Python, the pytesseract package comes in handy, which we'll see later in the post. Scraping hightlightable text For the first example, let's scrape a 10-k form from Apple (see here). First, we'll just download this file to a…
Read More
How to import Python classes into R

How to import Python classes into R

Pandas, Python, R
Background This post is going to talk about how to import Python classes into R, which can be done using a really awesome package in R called reticulate. reticulate allows you to call Python code from R, including sourcing Python scripts, using Python packages, and porting functions and classes. To install reticulate, we can run: [code lang="R"] install.packages("reticulate") [/code] Creating a Python class Let's create a simple class in Python. [code lang="python"] import pandas as pd # define a python class class explore: def __init__(self, df): self.df = df def metrics(self): desc = self.df.describe() return desc def dummify_vars(self): for field in self.df.columns: if isinstance(self.df[field][0], str): temp = pd.get_dummies(self.df[field]) self.df = pd.concat([self.df, temp], axis = 1) self.df.drop(columns = [field], inplace = True) [/code] Porting the Python class to R There's a…
Read More
Updates to yahoo_fin package

Updates to yahoo_fin package

Python, Web Scraping
Package updates First of all - thank you to everyone who has contacted me regarding the yahoo_fin package! Due to some changes in Yahoo Finance's website, I've updated the source code of yahoo_fin. To upgrade to the latest version (0.8.4), you can use pip: [code] pip install yahoo_fin --upgrade [/code] Get weekly and monthly stock prices The most recent version includes all functionality from the previous version, but now also includes the ability to pull weekly and monthly historical stock prices in the get_data method. [code lang="python"] from yahoo_fin import stock_info as si # default daily data daily_data = si.get_data("amzn") # get weekly data weekly_data = si.get_data("amzn", interval = "1wk") # get monthly data monthly_data = si.get_data("amzn", interval = "1mo") [/code] Speed-up in functions The options module includes an update…
Read More
3 Packages to Build a Spell Checker in Python

3 Packages to Build a Spell Checker in Python

Python
This post is going to talk about three different packages for coding a spell checker in Python - pyspellchecker, TextBlob, and autocorrect. pyspellchecker The pyspellchecker package allows you to perform spelling corrections, as well as see candidate spellings for a misspelled word. To install the package, you can use pip: [code] pip install pyspellchecker [/code] Once installed, the pyspellchecker is really straightforward to use. Note that even though we use "pyspellchecker" when installing via pip, we just type "spellchecker" in the package import statement. The first piece is to create a SpellChecker object, which we'll just call "spell". [code lang="python"] from spellchecker import SpellChecker spell = SpellChecker() [/code] Now, we're ready to test this out with a few misspellings. We'll use a few words from this list of commonly misspelled…
Read More
Guide to Fuzzy Matching with Python

Guide to Fuzzy Matching with Python

Python
This post is going to delve into the textdistance package in Python, which provides a large collection of algorithms to do fuzzy matching. The textdistance package Similar to the stringdist package in R, the textdistance package provides a collection of algorithms that can be used for fuzzy matching. To install textdistance using just the pure Python implementations of the algorithms, you can use pip like below: [code] pip install textdistance [/code] However, if you want to get the best possible speed out of the algorithms, you can tweak the pip install command like this: [code] pip install textdistance[extras] [/code] Once installed, we can import textdistance like below: [code lang="python"] import textdistance [/code] Levenshtein distance Levenshtein distance measures the minimum number of insertions, deletions, and substitutions required to change one string…
Read More
How to read Word documents with Python

How to read Word documents with Python

Pandas, Python
This post will talk about how to read Word Documents with Python. We're going to cover three different packages - docx2txt, docx, and my personal favorite: docx2python. The docx2txt package Let's talk about docx2text first. This is a Python package that allows you to scrape text and images from Word Documents. The example below reads in a Word Document containing the Zen of Python. As you can see, once we've imported docx2txt, all we need is one line of code to read in the text from the Word Document. We can read in the document using a method in the package called process, which takes the name of the file as input. Regular text, listed items, hyperlink text, and table text will all be returned in a single string. [code…
Read More
Python, Basket Analysis, and Pymining

Python, Basket Analysis, and Pymining

Python
Background Python's pymining package provides a collection of useful algorithms for item set mining, association mining, and more. We'll explore some of its functionality during this post by using it to apply basket analysis to tennis. When basket analysis is discussed, it's often in the context of retail - analyzing what combinations of products are typically bought together (or in the same "basket"). For example, in grocery shopping, milk and butter may be frequently purchased together. We can take ideas from basket analysis and apply them in many other scenarios. As an example - let's say we're looking at events like tennis tournaments where each tournament has different successive rounds i.e. quarterfinals, semifinals, finals etc. How would you figure out what combinations of players typically show up in the same…
Read More
BeautifulSoup vs. Rvest

BeautifulSoup vs. Rvest

Python, R, Web Scraping
This post will compare Python's BeautifulSoup package to R's rvest package for web scraping. We'll also talk about additional functionality in rvest (that doesn't exist in BeautifulSoup) in comparison to a couple of other Python packages (including pandas and RoboBrowser). Getting started BeautifulSoup and rvest both involve creating an object that we can use to parse the HTML from a webpage. However, one immediate difference is that BeautifulSoup is just a web parser, so it doesn't connect to webpages. rvest, on the other hand, can connect to a webpage and scrape / parse its HTML in a single package. In BeautifulSoup, our initial setup looks like this: [code lang="python"] # load packages from bs4 import BeautifulSoup import requests # connect to webpage resp = requests.get(""https://www.azlyrics.com/b/beatles.html"") # get BeautifulSoup object soup…
Read More