Blog

Software Engineering for Data Scientists (New book!)

Machine Learning, Python
Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for Data Scientists is available now! Check it out at this link. Use promo code au35tre to save 30% on this book and any products sold from Manning. Why Software Engineering for Data Scientists? Data science and software engineering have been merging more and more, especially over the last decade. Software Engineering for Data Scientists is my upcoming book that will help you learn more about software engineering and how it can make your life easier as a data scientist! This book covers the following key topics: Source control How to implement exception handling and write robust code Object-oriented programming for data scientists How to monitor the progress of training machine learning models Scaling your Python…
Read More
How to stop long-running code in Python

How to stop long-running code in Python

Python
Ever had long-running code that you don't know when it's going to finish running? If you have, then Python's stopit library is for you. In a previous post, we talked about how to create a progress bar to monitor Python code. This post will show you how to automatically stop long-running code with the stopit package. Getting started with stopit To get started with stopit, you can install it via pip: [code] pip install stopit [/code] In our first example, we'll use a context manager to stop the code we want to execute after a timeout limit is reached. [code lang="python"] import stopit with stopit.ThreadingTimeout(5) as context_manager: # sample code we want to run... for i in range(10**8): i = i * 2 # Did code finish running in under…
Read More
Faster alternatives to pandas

Faster alternatives to pandas

Pandas, Python
Background If you've done any type of data analysis in Python, chances are you've probably used pandas. Though widely used in the data world, if you've run into space or computational issues with it, you're not alone. This post discusses several faster alternatives to pandas. R's data table in Python If you've used R, you're probably familiar with the data.table package. A port of this library is also available in Python. In this example, we show how you can read in a CSV file faster than using standard pandas. For our purposes, we'll be using an open source dataset from the UCI repository. [code lang="python"] import datatable start = time.time() os_scan_data = datatable.fread("OS Scan_dataset.csv", header = None) end = time.time() print(end - start) [/code] Using datatable, we can read in…
Read More

Automated EDA with Python

Pandas, Python
In this post, we will investigate the pandas_profiling and sweetviz packages, which can be used to speed up EDA (exploratory data analysis) with Python. In a previous article, we talked about an analagous package in R (see this link). Getting started with pandas_profiling pandas_profiling can be installed using pip, like this: [code] pip install pandas-profiling[notebook] [/code] Next, let's read in our dataset. The data we'll be using is a heart attack-related dataset, which can be found here. [code lang="python"] import pandas as pd heart_data = pd.read_csv("heart.csv") heart_data.head() [/code] Now, let's import ProfileReport from pandas_profiling. [code lang="python"] from pandas_profiling import ProfileReport report = ProfileReport(heart_data, title = "Sample Report") report [/code] If you're running this code in Jupyter Notebook, you should see the report generated within your notebook file. The report shows…
Read More
How to plot XGBoost trees in R

How to plot XGBoost trees in R

Machine Learning, R
In this post, we're going to cover how to plot XGBoost trees in R. XGBoost is a very popular machine learning algorithm, which is frequently used in Kaggle competitions and has many practical use cases. Let's start by loading the packages we'll need. Note that plotting XGBoost trees requires the DiagrammeR package to be installed, so even if you have xgboost installed already, you'll need to make sure you have DiagrammeR also. [code lang="R"] # load libraries library(xgboost) library(caret) library(dplyr) library(DiagrammeR) [/code] Next, let's read in our dataset. In this post, we'll be using this customer churn dataset. The label we'll be trying to predict is called "Exited" and is a binary variable with 1 meaning the customer churned (canceled account) vs. 0 meaning the customer did not churn (did…
Read More
Python collections tutorial

Python collections tutorial

Python
In this post, we'll discuss the underrated Python collections package, which is part of the standard library. Collections allows you to utilize several data structures beyond base Python. How to get a count of all the elements in a list One very useful function in collections is the Counter method, which you can use to return a count of all the elements in a list. [code lang="python"] nums = [3, 3, 4, 1, 10, 10, 10, 10, 5] collections.Counter(nums) [/code] The Counter object that gets returned is also modifiable. Let's define a variable equal to the result above. [code lang="python"] counts = collections.Counter(nums) counts[20] += 1 [/code] Notice how we can add the number 20 to our Counter object without having to initialize it with a 0 value. Counter can…
Read More
How to create PDF files with Python

How to create PDF files with Python

Python
In a previous article we talked about several ways to read PDF files with Python. This post will cover two packages used to create PDF files with Python, including pdfkit and ReportLab. Create PDF files with Python and pdfkit pdfkit was the first library I learned for creating PDF files. A nice feature of pdfkit is that you can use it to create PDF files from URLs. To get started, you'll need to install it along with a utility called wkhtmltopdf. Use pip to install pdfkit from PyPI: [code] pip install pdfkit [/code] Once you're set up, you can start using pdfkit. In the example below, we download Wikipedia's main page as a PDF file. To get pdfkit working, you'll need to either add wkhtmltopdf to your PATH, or configure…
Read More
Faster data exploration with DataExplorer

Faster data exploration with DataExplorer

R
Data exploration is an important part of the modeling process. It can also take up a fair amount of time. The awesome DataExplorer package in R aims to make this process easier. To get started with DataExplorer, you'll need to install it like below: [code lang="R"] install.packages("DataExplorer") [/code] Let's use DataExplorer to explore a dataset on diabetes. [code lang="R"] # load DataExplorer library(DataExplorer) # read in dataset diabetes_data <- read.csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv", header = FALSE) # fix column names names(diabetes_data) <- c("number_of_times_pregnant", "plasma_glucose_conc", "diastolic_bp", "triceps_skinfold_thickness", "two_hr_serum_insulin", "bmi", "diabetes_pedigree_function", "age", "label") # create report create_report(diabetes_data) [/code] Running the create_report line of code above will generate an HTML report file containing a collection of useful information about the data. This includes: Basic statistics, such as number of rows and columns, number of columns with…
Read More
How to get stock earnings data with Python

How to get stock earnings data with Python

Python
In this post, we'll walk through a few examples for getting stock earnings data with Python. We will be using yahoo_fin, which was recently updated. The latest version now includes functionality to easily pull earnings calendar information for individual stocks or dates. If you need to install yahoo_fin, you can use pip: [code] pip install yahoo_fin [/code] If you already have it installed and need to upgrade, you can update your version like this: [code] pip install yahoo_fin --upgrade [/code] To get started, let's import yahoo_fin: [code lang="python"] import yahoo_fin.stock_info as si [/code] Getting stock earnings calendar data The first method we'll cover is the get_earnings_history function. get_earnings_history returns a list of dictionaries. Each dictionary contains an earnings date along with EPS actual / expected information. Let's test it out…
Read More
Technical analysis with Python

Technical analysis with Python

Python
In this post, we will introduce how to do technical analysis with Python. Python has several libraries for performing technical analysis of investments. We're going to compare three libraries - ta, pandas_ta, and bta-lib. The ta library for technical analysis One of the nicest features of the ta package is that it allows you to add dozen of technical indicators all at once. To get started, install the ta library using pip: [code] pip install ta [/code] Next, let's import the packages we need. We'll be using yahoo_fin to pull in stock price data. Now, data contains the historical prices for AAPL. [code lang="python"] # load packages import yahoo_fin.stock_info as si import pandas as pd from ta import add_all_ta_features # pull data from Yahoo Finance data = si.get_data("aapl") [/code] Next,…
Read More