If you’re an avid R user, you probably use the famous dplyr package. Python has a package meant to be similar to dplyr, called dplython. This article will give an introduction for how to use dplython.
For the examples below, we’ll use a sample dataset that comes with R giving attributes about the US states, including population, area, and income levels. You can see the dataset by clicking here.
dplython can be installed using pip:.
pip install dplython
Once the package is installed, let’s load a few methods from it, and read in our dataset.
# load packages from dplython import select, DplyFrame, X, arrange, count, sift, head, summarize, group_by, tail, mutate import pandas as pd # read in data state_df = pd.read_csv("state_info.txt")
After we’ve read in our data, we need to convert it to an object called a DplyFrame, which we do using a method from dplython. This DplyFrame object will allow us to perform “dplyr-like” operations.
state_info = DplyFrame(state_df)
With dplython, selecting columns is done like below:
state_info >> select(X.State_Name, X.State_Region, X.Population)
Notice how we append X to each field name, which is a little different from the select command in dplyr. We also use >> instead of %>% as a pipe operator. If you’re not as familiar with dplyr from R, the pipe operator basically allows to you apply a function to the output of a function or object to the left the operator (the >> in dplython). In the case above, the select function is just being applied to state_info.
To select only the top 10 or bottom 10 rows, we can chain the head or tail methods, respectively, to our line of code above. Here the head and tail functions are being applied to the result to the left of each respective pipe operator.
# get top ten rows with selected columns state_info >> select(X.State_Name, X.State_Region, X.Population) >> head(10) # get bottom ten rows with selected columns state_info >> select(X.State_Name, X.State_Region, X.Population) >> tail(10)
Filtering can be done using the sift method. For example, if we want to get only the records where a state has an area less than 100,000 we would do this:
# filter initial data frame to states with areas less than 100,000 square miles state_info >> sift(X.Area < 100000) # filter rows and columns in one step state_info >> sift(X.Area < 100000) >> select(X.State_Name, X.State_Region, X.Population, X.Area)
To sort our DplyFrame by a column, we can use the arrange method, like in dplyr:
# sort DplyFrame in ascending order by Area state_info >> arrange(X.Area) # sort first by region, then by area state_info >> arrange(X.State_Region, X.Area)
Also similar to dplyr, we can use the mutate method to add fields to state_info:
with_extra_field = state_info >> mutate(Name_Region = X.State_Name + '-' + X.State_Region) # create multiple fields in one step more_fields = state_info >> mutate(Name_Region = X.State_Name + '-' + X.State_Region, Area_Per_Thousand_Sq_Mile = X.Area / 1000)
with_extra_field is now a DplyFrame containing the same fields as state_info, plus a field called “Name_Region”. more_fields contains two extra fields — one concatenating the state name and region, and one dividing the state area field by 1000.
Aggregating data can be done using the summarize and group_by methods:
# get average area by region state_info >> group_by(X.State_Region) >> summarize(average_region_area = X.Area.mean()) # or do multiple aggregations in one step state_info >> group_by(X.State_Region) >> summarize(average_region_area = X.Area.mean(), \ >> max_region_area = X.Area.max), \ >> max_region_income = X.Income.max())
The first line of code above will show the average (mean) state area by region. The second line calculates multiple aggregations in one step, including mean and max state areas by region, as well as max state income by region.
dplython can also transpose datasets using X.__T. (period followed by double underscore and second period).
# transpose state_info state_info >> X._.T()
dplython, like dplyr, has several “SQL-like” functions for joining datasets together:
from dplython import inner_join, left_join right_table = more_fields >> select(X.State_Name, X.Name_Region) # inner join joined_result = inner_join(state_info, right_table, on = "State_Name") # left_join left_join_result = left_join(state_info, right_table, on = "State_Name")
In each type of join above, the first and second parameters are the left and right tables being joined. The “on” parameter specifies what column or list of columns should be used as the key to join the tables together.
To get the count of the values in a specific column, use the count function:
# get a count of each state region state_info >> count(X.State_Region)
The sample_n function can be used to randomly sample rows from our dataset. You just need to pass whatever number of rows you want to randomly select (e.g. 10 in the below example).
from dplython import sample_n # randomly select ten rows from state_info state_info >> sample_n(10)
dplython can also sample a random proportion of a dataset. For example, if we want to randomly sample 70% of the rows, we can use the sample_frac function:
from dplython import sample_frac # randomly sample 70% of the rows from state_info state_info >> sample_frac(0.7)
The first line of code above will generate a DplyFrame with two columns: one showing the State Region, and the other showing the average area across each region. The second line contains this information, plus the max state area for each region, and the max state income for each region.
That’s it for this post! Please check out other Python posts of mine by clicking here.
Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for…
Ever had long-running code that you don't know when it's going to finish running? If…
Background If you've done any type of data analysis in Python, chances are you've probably…
In this post, we will investigate the pandas_profiling and sweetviz packages, which can be used…
In this post, we're going to cover how to plot XGBoost trees in R. XGBoost…
In this post, we'll discuss the underrated Python collections package, which is part of the…