The following article will show you an example of how to scrape articles about stocks from the Web using Python 3. Specifically, we’ll be looking at articles linked from http://www.nasdaq.com. If you’re not familiar with list comprehensions, you may want to check this, as we’ll be using them in our code.
Let’s start with a specific stock — say, Netflix, for example. Articles linked to a specific stock ticker from Nasdaq’s website have the following pattern:
http://www.nasdaq.com/symbol/TICKER/news-headlines,
where TICKER is replaced with whatever ticker you want. In our case, we will start by dealing specifically with Netflix’s (NFLX) stock. So our site of interest is:
http://www.nasdaq.com/symbol/nflx/news-headlines
The first step is to load the requests and BeautifulSoup packages. Here, we’ll also set the variable site equal to the URL above.
'''load packages''' import requests from bs4 import BeautifulSoup site = 'http://www.nasdaq.com/symbol/nflx/news-headlines'
Next, we use requests to get the HTML of the site we need. If you actually go to that URL in your browser, you’ll see it only shows the first few articles associated to Netflix. You can press the ‘Next’ button to see other articles. We’ll get more into these other articles later.
Once we have the HTML of the site, we’ll convert it to a BeautifulSoup object, which will make parsing the HTML much easier.
Using this object, we use the find_all method to get all of the links on the page. Links on a webpage can be identified by searching for ‘a’ (anchor) tags. More information on HTML links can be found here.
Once we have the link objects in a list, we get their URLS. We can do this by using the get method of each link object to find each one’s href attribute. If you’re not familiar with href attributes, they basically contain the URLs of the links they are associated with. Thus, getting the href attribute of each link on the page is equivalent to getting the URL of each link.
'''scrape the html of the site''' html = requests.get(site).content '''convert html to BeautifulSoup object''' soup = BeautifulSoup(html , 'lxml') '''get list of all links on webpage''' links = soup.find_all('a') urls = [link.get('href') for link in links] urls = [url for url in urls if url is not None] len(urls) '''226'''
As you can see above, there are over 200 links on the Netflix article main page! However, we don’t actually want to see the contents of all of these. What we’re after is just the news-related articles pertaining to Netflix. You can get these by filtering the list of URLS to those that contain the sub-string ‘/article/’. We’ll store this list in the variable news_urls.
'''Filter the list of urls to just the news articles''' news_urls = [url for url in urls if '/article/' in url] len(news_urls) '''6'''
Now that we have the list of URLs for news articles, let’s take the first one and store it as a variable, news_url.
Then, we’ll use a similar process as above to scrape the HTML of this URL, and convert that HTML to a BeautifulSoup object.
Our next step is to scrape the text from all paragraphs in the article. This can be done using the find_all method of news_soup, our BeautifulSoup object, to search for all of the ‘p’ tags on the page. This result is stored in the paragraphs variable.
The text from the webpage’s paragraphs are stored in a list — one paragraph’s text for each element in the list. Therefore, we add one additional line of code to join the text from each element in the list together, so that the variable, news_text, contains all the text from the article.
'''Set news_url equal to the 0th element in news_urls ''' news_url = news_urls[0] '''Get HTML of the webpage associated with news_url''' news_html = requests.get(news_url).content '''Convert news_url to a BeautifulSoup object''' news_soup = BeautifulSoup(news_html , 'lxml') '''Use news_soup to get the text from all pargraphs on page''' paragraphs = [par.text for par in news_soup.find_all('p')] '''Lastly, join all text in the list above into a single string''' news_text = '\n'.join(paragraphs)
It’s generally a good idea to package repetitive code into functions. So let’s do that for our logic above:
'''Package what we did above into a function''' def scrape_news_text(news_url): news_html = requests.get(news_url).content '''convert html to BeautifulSoup object''' news_soup = BeautifulSoup(news_html , 'lxml') paragraphs = [par.text for par in news_soup.find_all('p')] news_text = '\n'.join(paragraphs) return news_text
Now we have a function, scrape_news_text, that takes a URL of an article as its sole input, while returning the text of that article.
As mentioned above, the Nasdaq URL we’re using to scrape article links from only contains a few article links. To see the rest, you have to keep paging through the site, using the ‘Next’ button, or clicking on a specific number.
So if we wish to scrape articles linked on multiple pages, we should generalize our code above to help us.
'''Generalized function to get all news-related articles from a Nasdaq webpage''' def get_news_urls(links_site): '''scrape the html of the site''' resp = requests.get(links_site) if not resp.ok: return None html = resp.content '''convert html to BeautifulSoup object''' soup = BeautifulSoup(html , 'lxml') '''get list of all links on webpage''' links = soup.find_all('a') urls = [link.get('href') for link in links] urls = [url for url in urls if url is not None] '''Filter the list of urls to just the news articles''' news_urls = [url for url in urls if '/article/' in url] return news_urls
Using the functions we’ve created above, let’s put them to use.
def scrape_all_articles(ticker , upper_page_limit = 5): landing_site = 'http://www.nasdaq.com/symbol/' + ticker + '/news-headlines' all_news_urls = get_news_urls(landing_site) current_urls_list = all_news_urls.copy() index = 2 '''Loop through each sequential page, scraping the links from each''' while (current_urls_list is not None) and (current_urls_list != []) and \ (index <= upper_page_limit): '''Construct URL for page in loop based off index''' current_site = landing_site + '?page=' + str(index) current_urls_list = get_news_urls(current_site) '''Append current webpage's list of urls to all_news_urls''' all_news_urls = all_news_urls + current_urls_list index = index + 1 all_news_urls = list(set(all_news_urls)) '''Now, we have a list of urls, we need to actually scrape the text''' all_articles = [scrape_news_text(news_url) for news_url in all_news_urls] return all_articles
Now we have a function that uses an input ticker to pull back the articles linked from Nasdaq’s site. Notice, we also added a parameter upper_page_limit, which will limit the number of pages containing article links that Python will search for.
In other words, let’s say this limit is set to 5. Then, the function above will get the article links from each of the following:
After all the URLS for the links have been retrieved, the function above uses scrape_news_text to get the article text associated with each link. This is returned as the variable, all_articles. The while loop in scrape_news_text goes through each page up until the upper_page_limit parameter, scraping the links from each.
Now if we want to scrape the text from all Netflix-related articles linked across the first 5 pages, we can just do the following:
nflx_articles = scrape_all_articles('nflx' , 5) print(nflx_articles[0])
OK — so we have a function that can scrape articles for any stock! How can we put this to more use? What if we want to scrape the articles associated with every stock in the Dow Jones?
Well, first we need to get the list of all stocks currently in the Dow. These can be found on https://finance.yahoo.com/quote/%5EDJI/components?p=%5EDJI.
If you go to this page, you’ll see a table of the Dow stocks, with their tickers, company names, and price information. We can scrape this table using the read_html method in the pandas package, like below.
import pandas as pd dow_info = pd.read_html('https://finance.yahoo.com/quote/%5EDJI/components?p=%5EDJI')[1] dow_tickers = dow_info.Symbol.tolist()
Now, the variable, dow_tickers, contains the list of Dow stocks. For reference, you can also get the list of Dow tickers directly using the yahoo_fin package, like this:
from yahoo_fin.stock_info import si # return list of Dow tickers dow_tickers = si.tickers_dow()
Using this with the scraping function above, we can run the below line of code, using a dictionary comprehension.
If you haven’t see a dictionary comprehension before, it works similarly to a list comprehension, except that a dict is created as a result. In our case, the dict comprehension loops over each ticker in the Dow and maps each one (i.e. each key) to the articles associated with that ticker (the articles being the value).
So in the code below, we are looping over the tickers in the dow_tickers list, and scraping articles from the first page in the article collection (i.e. http://www.nasdaq.com/symbol/TICKER/news-headlines) for each ticker. Thus, the keys of the resultant dict are the tickers making up dow_tickers, and the values of the dict are the articles corresponding to each ticker.
dow_articles = {ticker : scrape_all_articles(ticker , 1) for ticker in dow_tickers}
We now have the text from all of the articles linked from the first page in the article collection for each respective Dow stock! Since we used a dictionary comprehension, you can view the list of articles associated with each stock by using its ticker as a key, like so:
dow_articles['AAPL']
That’s it for this blog post!
If you want to learn more about web scraping, you should check out Web Scraping with Python: Collecting Data from the Modern Web.
Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for…
Ever had long-running code that you don't know when it's going to finish running? If…
Background If you've done any type of data analysis in Python, chances are you've probably…
In this post, we will investigate the pandas_profiling and sweetviz packages, which can be used…
In this post, we're going to cover how to plot XGBoost trees in R. XGBoost…
In this post, we'll discuss the underrated Python collections package, which is part of the…