Software Engineering for Data Scientists (New book!)

Machine Learning, Python
Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for Data Scientists is available now! Check it out at this link. Use promo code au35tre to save 30% on this book and any products sold from Manning. Why Software Engineering for Data Scientists? Data science and software engineering have been merging more and more, especially over the last decade. Software Engineering for Data Scientists is my upcoming book that will help you learn more about software engineering and how it can make your life easier as a data scientist! This book covers the following key topics: Source control How to implement exception handling and write robust code Object-oriented programming for data scientists How to monitor the progress of training machine learning models Scaling your Python…
Read More
How to stop long-running code in Python

How to stop long-running code in Python

Python
Ever had long-running code that you don't know when it's going to finish running? If you have, then Python's stopit library is for you. In a previous post, we talked about how to create a progress bar to monitor Python code. This post will show you how to automatically stop long-running code with the stopit package. Getting started with stopit To get started with stopit, you can install it via pip: [code] pip install stopit [/code] In our first example, we'll use a context manager to stop the code we want to execute after a timeout limit is reached. [code lang="python"] import stopit with stopit.ThreadingTimeout(5) as context_manager: # sample code we want to run... for i in range(10**8): i = i * 2 # Did code finish running in under…
Read More
Faster alternatives to pandas

Faster alternatives to pandas

Pandas, Python
Background If you've done any type of data analysis in Python, chances are you've probably used pandas. Though widely used in the data world, if you've run into space or computational issues with it, you're not alone. This post discusses several faster alternatives to pandas. R's data table in Python If you've used R, you're probably familiar with the data.table package. A port of this library is also available in Python. In this example, we show how you can read in a CSV file faster than using standard pandas. For our purposes, we'll be using an open source dataset from the UCI repository. [code lang="python"] import datatable start = time.time() os_scan_data = datatable.fread("OS Scan_dataset.csv", header = None) end = time.time() print(end - start) [/code] Using datatable, we can read in…
Read More

Automated EDA with Python

Pandas, Python
In this post, we will investigate the pandas_profiling and sweetviz packages, which can be used to speed up EDA (exploratory data analysis) with Python. In a previous article, we talked about an analagous package in R (see this link). Getting started with pandas_profiling pandas_profiling can be installed using pip, like this: [code] pip install pandas-profiling[notebook] [/code] Next, let's read in our dataset. The data we'll be using is a heart attack-related dataset, which can be found here. [code lang="python"] import pandas as pd heart_data = pd.read_csv("heart.csv") heart_data.head() [/code] Now, let's import ProfileReport from pandas_profiling. [code lang="python"] from pandas_profiling import ProfileReport report = ProfileReport(heart_data, title = "Sample Report") report [/code] If you're running this code in Jupyter Notebook, you should see the report generated within your notebook file. The report shows…
Read More
Python collections tutorial

Python collections tutorial

Python
In this post, we'll discuss the underrated Python collections package, which is part of the standard library. Collections allows you to utilize several data structures beyond base Python. How to get a count of all the elements in a list One very useful function in collections is the Counter method, which you can use to return a count of all the elements in a list. [code lang="python"] nums = [3, 3, 4, 1, 10, 10, 10, 10, 5] collections.Counter(nums) [/code] The Counter object that gets returned is also modifiable. Let's define a variable equal to the result above. [code lang="python"] counts = collections.Counter(nums) counts[20] += 1 [/code] Notice how we can add the number 20 to our Counter object without having to initialize it with a 0 value. Counter can…
Read More
How to create PDF files with Python

How to create PDF files with Python

Python
In a previous article we talked about several ways to read PDF files with Python. This post will cover two packages used to create PDF files with Python, including pdfkit and ReportLab. Create PDF files with Python and pdfkit pdfkit was the first library I learned for creating PDF files. A nice feature of pdfkit is that you can use it to create PDF files from URLs. To get started, you'll need to install it along with a utility called wkhtmltopdf. Use pip to install pdfkit from PyPI: [code] pip install pdfkit [/code] Once you're set up, you can start using pdfkit. In the example below, we download Wikipedia's main page as a PDF file. To get pdfkit working, you'll need to either add wkhtmltopdf to your PATH, or configure…
Read More
How to get stock earnings data with Python

How to get stock earnings data with Python

Python
In this post, we'll walk through a few examples for getting stock earnings data with Python. We will be using yahoo_fin, which was recently updated. The latest version now includes functionality to easily pull earnings calendar information for individual stocks or dates. If you need to install yahoo_fin, you can use pip: [code] pip install yahoo_fin [/code] If you already have it installed and need to upgrade, you can update your version like this: [code] pip install yahoo_fin --upgrade [/code] To get started, let's import yahoo_fin: [code lang="python"] import yahoo_fin.stock_info as si [/code] Getting stock earnings calendar data The first method we'll cover is the get_earnings_history function. get_earnings_history returns a list of dictionaries. Each dictionary contains an earnings date along with EPS actual / expected information. Let's test it out…
Read More
Technical analysis with Python

Technical analysis with Python

Python
In this post, we will introduce how to do technical analysis with Python. Python has several libraries for performing technical analysis of investments. We're going to compare three libraries - ta, pandas_ta, and bta-lib. The ta library for technical analysis One of the nicest features of the ta package is that it allows you to add dozen of technical indicators all at once. To get started, install the ta library using pip: [code] pip install ta [/code] Next, let's import the packages we need. We'll be using yahoo_fin to pull in stock price data. Now, data contains the historical prices for AAPL. [code lang="python"] # load packages import yahoo_fin.stock_info as si import pandas as pd from ta import add_all_ta_features # pull data from Yahoo Finance data = si.get_data("aapl") [/code] Next,…
Read More
Python’s rich library – a tutorial

Python’s rich library – a tutorial

Python, System Administration
The Python rich library is a package for having clearer, styled, and colored output in the terminal. rich works across multiple operating systems - including Windows, Linux, and macOS. In this post, we'll give an introduction to what it can do for you. You can get started with rich by installing it with pip. [code] pip install rich [/code] Once you have it installed, open up the command line and type in python. In order to get the additional functionality from rich, you'll need to do one more step, which you can see below. Running this snippet will allow you to have styled / formatted code interactively. You'll only need to do this once. [code lang="python"] from rich import pretty pretty.install() [/code] Here's a couple examples of automatic coloring for…
Read More
3 ways to do RPA with Python

3 ways to do RPA with Python

Python, System Administration
In this post we'll cover a few packages for doing robotic process automation with Python. Robotic process automation, or RPA, is the process of automating mouse clicks and keyboard presses - i.e. simulating what a human user would do. RPA is used in a variety of applications, including data entry, accounting, finance, and more. We'll be covering pynput, pyautogui, and pywinauto. Each of these three packages can be used as a starting point for building your own RPA application, as well as building UI testing apps. pynput The first package we'll discuss is pynput. One of the advantages of pynput is that is works on both Windows and macOS. Another nice feature is that it has functionality to monitor keyboard and mouse input. Let's get started with pynput by installing…
Read More