Python

Parsing Dates with Pandas

The pandas package is one of the most powerful Python packages available. One useful feature of pandas is its Timestamp method. This provides functionality to convert strings in a variety of formats to dates.

The problem we’re trying to solve in this article is how to parse dates from strings that may contain additional text / words. We will look at this problem using pandas.

In the first step, we’ll load the pandas package.


'''Load pandas package '''
import pandas as pd

Next, let’s create a sample string containing a made-up date with other text. For now, assume the dates will not contain spaces (we will re-examine this later). Taking this assumption, we use the split method, available for strings in Python, to create a list of the sub-strings of our variable, string, which are separated by spaces.


'''Define sample string'''
string = 'this is a 05-17-2015 test'

'''Split string by spaces'''
split_string = string.split(' ')

Now, we’ll define a wrapper function that uses the pandas Timestamp method to attempt to convert an input string to a time stamp object. If the conversion fails, our wrapper will just return None. This allows us to figure out which sub-strings might be dates.


'''Define function that tries to convert a string to a pandas timestamp'''
def try_get_date(x):
    
    try:
        return pd.Timestamp(x)
    except Exception:
        return None

Once we’ve defined our wrapper above, we can run it on each element in split_string using a list comprehension.


'''Run the try_get_date function on each word in split_string'''
result = [try_get_date(x) for x in split_string]

'''Filter out any None values from result '''
result = [x for x in result if x is not None]

'''Take the first occurrence (and only in this case) of a date '''
result = result[0]

Let’s take what we did above, and wrap it into a function called parse_date. To make this a little more generalized, we’ll add an if / else sequence at the bottom of the function’s body to handle strings that do not contain any dates, as well as those that may contain multiple dates. See examples below for how this plays out.


'''Wrap what we did above into a function'''
def parse_date(string):
    
    split_string = string.split(' ')

    result = [try_get_date(x) for x in split_string]
    result = [x for x in result if x is not None]
    
    if len(result) == 0:
        return 'Could not find date in string'
    elif len(result) > 1:
        return 'Multiple dates found'
    else:
        return result[0]

You can see that the Timestamp method is pretty powerful, and can handle converting strings in several different formats (slashes, hyphens, etc.).

From the results above, there’s a problem we currently have. If the word today appears in our string, the Timestamp method will convert this to today’s date. If we don’t want this to happen, we can make a simple adjustment to our function, like so:


def parse_date(string):
    
    split_string = string.split(' ')

    '''Filter out potential occurrences of the word 'today' in string'''
    split_string = [x for x in split_string if x != 'today']

    result = [try_get_date(x) for x in split_string]
    result = [x for x in result if x is not None]
    
    if len(result) == 0:
        return 'Could not find date in string'
    elif len(result) > 1:
        return 'Multiple dates found'
    else:
        return result[0]

Another piece we might to change is adjusting for when multiple dates are in the string. For now, our function alerts you if this happens, but doesn’t actually give you the dates. You can change this by returning result instead of result[0] in the bottom of parse_date.

As we mentioned earlier, at the moment, we’re only taking into account dates without spaces. So how would we deal with a string like this?


string = 'January 3, 2017 is two days after the New Year!'

Using the calendar package, which generally comes with a standard Python installation, we can get a list of month names and abbreviations. These are stored in the variables, month_names and month_abbrs, respectively.

We then search for any occurrences of month names / abbreviations in the input string. Our full, updated function is below.


from calendar import month_name, month_abbr

month_names = [x for x in month_name]
month_abbrs = [x for x in month_abbr]

def parse_date(string):
    

    split_string = string.split(' ')

    months = [substring for substring in split_string if substring in month_names]

    if len(months) == 0:
        
        months = [substring for substring in split_string if substring in month_abbrs]
        

        '''if a substring in the split_string list is equal to a month's name, then
        extract the rest of the date from the string'''
        if len(months) > 0:
        
            '''Get index of occurrence of month name'''
            index = split_string.index(months[0])
        
            '''Assume the day and year follow the month '''
            sublist = split_string[index:index+3]

            date = ' '.join(sublist)
        
            result = try_get_date(date)
        
            if result is not None:
                return result

    
    result = [try_get_date(x) for x in split_string]
    result = [x for x in result if x is not None]
    
    if len(result) == 0:
        return 'Could not find date in string'
    elif len(result) > 1:
        return 'Multiple dates found'
    else:
        return result[0]

Let’s break this down:


months = [substring for substring in split_string if substring in month_names]

The above line of code searches for month name / abbreviation appearances in the string. Next, provided this search is successful, we get the index in split_string where the month name / abbreviation occurs. Using our example of ‘January 3, 2017 is two days after the New Year!’, the index would equal zero since ‘January’ is in the 0th index of split_string, i.e. this:

The sublist variable below stores this plus the day and year strings following the month string.


if len(months) > 0:
        
    '''Get index of occurrence of month name'''
    index = split_string.index(months[0])
        
    '''Assume the day and year follow the month '''
    sublist = split_string[index:index+3]

The rest of what we added to our function is below:


date = ' '.join(sublist)
        
result = try_get_date(date)
        
if result is not None:
    return result

Once we have the sublist containing [‘January’,’3,’,’2017′], we concatenate this strings in the list together, so that we have:

date = ‘January 3, 2017’

Then, we just use our function, try_get_date to convert this string to a time stamp. There we have it! The rest of the function is what he had previously, which will search for other dates in the string that do not have actual month names / abbreviations.

There are other possible forms that dates can take, so our function isn’t quite perfect, but it’s a starting point for doing work like this.

Andrew Treadway