Faster alternatives to pandas

Faster alternatives to pandas

Pandas, Python
Background If you've done any type of data analysis in Python, chances are you've probably used pandas. Though widely used in the data world, if you've run into space or computational issues with it, you're not alone. This post discusses several faster alternatives to pandas. R's data table in Python If you've used R, you're probably familiar with the data.table package. A port of this library is also available in Python. In this example, we show how you can read in a CSV file faster than using standard pandas. For our purposes, we'll be using an open source dataset from the UCI repository. [code lang="python"] import datatable start = time.time() os_scan_data = datatable.fread("OS Scan_dataset.csv", header = None) end = time.time() print(end - start) [/code] Using datatable, we can read in…
Read More

Automated EDA with Python

Pandas, Python
In this post, we will investigate the pandas_profiling and sweetviz packages, which can be used to speed up EDA (exploratory data analysis) with Python. In a previous article, we talked about an analagous package in R (see this link). Getting started with pandas_profiling pandas_profiling can be installed using pip, like this: [code] pip install pandas-profiling[notebook] [/code] Next, let's read in our dataset. The data we'll be using is a heart attack-related dataset, which can be found here. [code lang="python"] import pandas as pd heart_data = pd.read_csv("heart.csv") heart_data.head() [/code] Now, let's import ProfileReport from pandas_profiling. [code lang="python"] from pandas_profiling import ProfileReport report = ProfileReport(heart_data, title = "Sample Report") report [/code] If you're running this code in Jupyter Notebook, you should see the report generated within your notebook file. The report shows…
Read More
Reading from databases with Python

Reading from databases with Python

Pandas, Python
Background - Reading from Databases with Python This post will talk about several packages for working with databases using Python. We'll start by covering pyodbc, which is one of the more standard packages used for working with databases, but we'll also cover a very useful module called turbodbc, which allows you to run SQL queries efficiently (and generally faster) within Python. pyodbc pyodbc can be installed using pip: [code] pip install pyodbc [/code] Let's start by writing a simple SQL query using pyodbc. To do that, we first need to connect to a specific database. In the examples laid out here, we will be using a SQLite database on my machine. However, you can do this with many other database systems, as well, such as SQL Server, MySQL, Oracle, etc.…
Read More
How to import Python classes into R

How to import Python classes into R

Pandas, Python, R
Background This post is going to talk about how to import Python classes into R, which can be done using a really awesome package in R called reticulate. reticulate allows you to call Python code from R, including sourcing Python scripts, using Python packages, and porting functions and classes. To install reticulate, we can run: [code lang="R"] install.packages("reticulate") [/code] Creating a Python class Let's create a simple class in Python. [code lang="python"] import pandas as pd # define a python class class explore: def __init__(self, df): self.df = df def metrics(self): desc = self.df.describe() return desc def dummify_vars(self): for field in self.df.columns: if isinstance(self.df[field][0], str): temp = pd.get_dummies(self.df[field]) self.df = pd.concat([self.df, temp], axis = 1) self.df.drop(columns = [field], inplace = True) [/code] Porting the Python class to R There's a…
Read More
How to read Word documents with Python

How to read Word documents with Python

Pandas, Python
This post will talk about how to read Word Documents with Python. We're going to cover three different packages - docx2txt, docx, and my personal favorite: docx2python. The docx2txt package Let's talk about docx2text first. This is a Python package that allows you to scrape text and images from Word Documents. The example below reads in a Word Document containing the Zen of Python. As you can see, once we've imported docx2txt, all we need is one line of code to read in the text from the Word Document. We can read in the document using a method in the package called process, which takes the name of the file as input. Regular text, listed items, hyperlink text, and table text will all be returned in a single string. [code…
Read More
2 packages for extracting dates from a string of text in Python

2 packages for extracting dates from a string of text in Python

Pandas, Python
This post will cover two different ways to extract dates from strings with Python. The main purpose here is that the strings we will parse contain additional text - not just the date. Scraping a date out of text can be useful in many different situations. Option 1) dateutil The first option we'll show is using the dateutil package. Here's an example: [code lang="python"] from dateutil.parser import parse parse("Today is 12-01-18", fuzzy_with_tokens=True) [/code] Above, we use a method in dateutil called parse. The first parameter to this method is the string of the text we want to use to search for a date. The second parameter, fuzzy_with_tokens, is set equal to True - this causes the method to return where the date is found in the string. In other words,…
Read More

Data Analysis with Python Course: How to read, wrangle, and analyze data

Pandas, Python
For anyone in the NYC area, I will be holding an in-person training session December 3rd on doing data analysis with Python. We will be covering the pandas, pyodbc, and matplotlib packages. Please register at Eventbrite here: https://www.eventbrite.com/e/data-analysis-with-python-how-to-read-wrangle-and-analyze-data-tickets-51945542516. Overview Learn how to apply Python to read, wrangle, visualize, and analyze data!  This course provides a hands-on session where we'll walk through a prepared curriculum on doing data analysis with Python.  All code and practice exercises during the session will be made available after the course is complete.     About the course During this hands-on class, you will learn the fundamentals of doing data analysis in Python, the powerful pandas package, and pyodbc for connecting to databases. We will walk through using Python to analyze and answer key questions on sales…
Read More
Parsing Dates with Pandas

Parsing Dates with Pandas

Pandas, Python
The pandas package is one of the most powerful Python packages available. One useful feature of pandas is its Timestamp method. This provides functionality to convert strings in a variety of formats to dates. The problem we're trying to solve in this article is how to parse dates from strings that may contain additional text / words. We will look at this problem using pandas. In the first step, we'll load the pandas package. [code lang="python"] '''Load pandas package ''' import pandas as pd [/code] Next, let's create a sample string containing a made-up date with other text. For now, assume the dates will not contain spaces (we will re-examine this later). Taking this assumption, we use the split method, available for strings in Python, to create a list of…
Read More