Web Scraping

Word Frequency Analysis

In a previous article, we talked about using Python to scrape stock-related articles from the web. As an extension of this idea, we’re going to show you how to use the NLTK package to figure out how often different words occur in text, using scraped stock articles.

Initial Setup

Let’s import the NLTK package, along with requests and BeautifulSoup, which we’ll need to scrape the stock articles.


'''load packages'''
import nltk
import requests
from bs4 import BeautifulSoup

Pulling the data we’ll need

Below, we’re copying code from my scraping stocks article. This gives us a function, scrape_all_articles (along with two other helper functions), which we can use to pull the actual raw text from articles linked to from NASDAQ’s website.


def scrape_news_text(news_url):

    news_html = requests.get(news_url).content

    '''convert html to BeautifulSoup object'''
    news_soup = BeautifulSoup(news_html , 'lxml')

    paragraphs = [par.text for par in news_soup.find_all('p')]
    news_text = '\n'.join(paragraphs)

    return news_text

def get_news_urls(links_site):
    '''scrape the html of the site'''
    resp = requests.get(links_site)

    if not resp.ok:
        return None

    html = resp.content

    '''convert html to BeautifulSoup object'''
    soup = BeautifulSoup(html , 'lxml')

    '''get list of all links on webpage'''
    links = soup.find_all('a')

    urls = [link.get('href') for link in links]
    urls = [url for url in urls if url is not None]

    '''Filter the list of urls to just the news articles'''
    news_urls = [url for url in urls if '/article/' in url]

    return news_urls

def scrape_all_articles(ticker , upper_page_limit = 5):

    landing_site = 'http://www.nasdaq.com/symbol/' + ticker + '/news-headlines'

    all_news_urls = get_news_urls(landing_site)

    current_urls_list = all_news_urls.copy()

    index = 2

    '''Loop through each sequential page, scraping the links from each'''
    while (current_urls_list is not None) and (current_urls_list != []) and \
        (index <= upper_page_limit):

        '''Construct URL for page in loop based off index'''
        current_site = landing_site + '?page=' + str(index)
        current_urls_list = get_news_urls(current_site)

        '''Append current webpage's list of urls to all_news_urls'''
        all_news_urls = all_news_urls + current_urls_list

        index = index + 1

    all_news_urls = list(set(all_news_urls))

    '''Now, we have a list of urls, we need to actually scrape the text'''
    all_articles = [scrape_news_text(news_url) for news_url in all_news_urls]

    return all_articles

Let’s run our function to pull a few articles on Netflix (ticker symbol ‘NFLX’).


articles = scrape_all_articles('NFLX' , 10)

Above, we use our function to search through the first ten pages of NASDAQ’s listing of articles for Netflix. This gives us a total of 102 articles (at the time of this writing). The variable, articles, contains a list of the raw text of each article. We can view a sample one by printing the following:


print(articles[0])

Now, let’s set article equal to one of the articles we have.


article = articles[0]

To get word frequencies of this article, we are going to perform an operation called tokenization. Tokenization effectively breaks a string of text into individual words, which we’ll need to calculate word frequencies. To tokenize article, we use the nltk.tokenize.word_tokenize method.


tokens = nltk.tokenize.word_tokenize(article)

Now, if you print out tokens, you’ll see that it includes a lot of words like ‘the’, ‘a’, ‘an’ etc. These are known as ‘stop words.’ We can filter these out of tokens using stopwords from nltk.corpus. Let’s also make all the words upper case. This will allow us to avoid case sensitivity issues when we get any word frequency distributions.


from nltk.corpus import stopwords

'''Get list of English stop words '''
take_out = stopwords.words('english')

'''Make all words in tokens uppercase'''
tokens = [word.upper() for word in tokens]

'''Make all stop words upper case'''
take_out = [word.upper() for word in take_out]

'''Filter out stop words from tokens list'''
tokens = [word for word in tokens if word not in take_out]

*NLTK also has functionality to filter out stop words from other languages, as well.

In addition to filtering out stop words, we also probably want to get rid of punctuation (e.g. commas etc.). This can be done by filtering out any elements in tokens that are in string.punctuation, which contains a list of common punctuation forms.


tokens = [word for word in tokens if word not in string.punctuation]

tokens = [word for word in tokens if word[0] not in string.punctuation]

Now, we’re ready to get the word frequency distribution of the article in question. This is done using the nltk.FreqDist method, like below. The nltk.FreqDist method returns a dictionary, where each key is each uniquely occurring word in the text, while the corresponding values are how many times each of those words appear. Setting this dictionary equal to word_frequencies, we sort the result as a list of tuples (word_frequencies.items()) by the frequency of each word in descending order.


'''Returns a dictionary of words mapped to how
   often they occur'''
word_frequencies = nltk.FreqDist(tokens)

'''Sort the above result by the frequency of each word'''
sorted_counts = sorted(word_frequencies.items() , key = lambda x: x[1] ,
                       reverse = True)

Getting a function to calculate word frequency…

Let’s create a function from what we did that takes a single article, and returns the sorted word frequencies.


def get_word_frequecy(article):

    tokens = nltk.tokenize.word_tokenize(article)

    '''Get list of English stop words '''
    take_out = stopwords.words('english')
    take_out = [word.upper() for word in take_out]

    '''Convert each item in tokens to uppercase'''
    tokens = [word.upper() for word in tokens]

    '''Filter out stop words and punctuation '''
    tokens = [word for word in tokens if word not in take_out]

    tokens = [word for word in tokens if word not in string.punctuation]

    tokens = [word for word in tokens if word[0] not in string.punctuation]

    '''Get word frequency distribution'''
    word_frequencies = nltk.FreqDist(tokens)

    '''Sort word frequency distribution by number of times each word occurs'''
    sorted_counts = sorted(word_frequencies.items() , key = lambda x: x[1] ,
                           reverse = True)

    return sorted_counts

Now, we could run our function across every article in our list, like this:


articles = [article for article in articles if article != '']
results = [get_word_frequency(article) for article in articles]

The results variable contains word frequencies for each individual article. Using this information, we can get the most frequently occurring word in each article.


most_frequent = [pair[0] for pair in results]
most_frequent = [x[0] for x in most_frequent]

Next, we can figure out the most common top-occurring words across the articles.


most_frequent = nltk.FreqDist(most_frequent)
most_frequent = sorted(most_frequent.items() , key = lambda x: x[1] , 
                       reverse = True)

Filtering out articles using word frequency

If you print out most_frequent, you can see the words ‘NETFLIX’, ‘PERCENT’, and ‘STOCK’ are at the top of the list. Using word frequencies could be useful in giving a quick check to test whether an article actually has much to do with the stock that it’s listed under. For instance, some of the Netflix articles may be linked to the stock because they mentioned it in passing, or in a minor part of the text, while actually having more to do with another stock(s). Using our frequency function above, we could filter out articles that mention the stock name infrequently, like in the snippet below.


'''Create a dictionary that maps each article to its word frequency distribution'''
article_to_freq = {article:freq for article,freq in zip(articles , results)}

'''Filter out articles that don't mention 'Netflix' at least 3 times'''
article_to_freq = {article:freq for article,freq in 
                           article_to_freq.items() if freq >= 3}



Note, this isn’t a perfect form of topic modeling, but it is something you can do really quickly to make educated guesses about whether an article actually has to do with the topic you want. You can also make this process better by filtering out articles that don’t contain other words, as well. For instance, if you’re looking for articles specifically about Netflix’s stock, you might not want to include articles about new shows etc. on Netflix. So, you could maybe filter out articles that don’t mention words like ‘stock’ or ‘investing.’

One last note…

Another way of thinking about word frequency in our situation would be to get word counts across all articles at once. You can do this easily enough by concatenating (or joining together) each article in our list.


overall_text = ' '.join(articles)
top_words = get_word_frequency(overall_text)

This type of analysis can go much deeper into the world of natural language processing, but that would go well beyond a single blog post, so that’s the end for now!

Andrew Treadway

Recent Posts

Software Engineering for Data Scientists (New book!)

Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for…

2 years ago

How to stop long-running code in Python

Ever had long-running code that you don't know when it's going to finish running? If…

3 years ago

Faster alternatives to pandas

Background If you've done any type of data analysis in Python, chances are you've probably…

3 years ago

Automated EDA with Python

In this post, we will investigate the pandas_profiling and sweetviz packages, which can be used…

4 years ago

How to plot XGBoost trees in R

In this post, we're going to cover how to plot XGBoost trees in R. XGBoost…

4 years ago

Python collections tutorial

In this post, we'll discuss the underrated Python collections package, which is part of the…

4 years ago