The yahoo_fin package contains functions to scrape stock-related data from Yahoo Finance and NASDAQ. You can view the official documentation by clicking this link, but the below post will provide a few more in-depth examples. Also, please check out my yahoo_fin playlist on YouTube. The first video is below, which covers installation and getting historical / real-time stock prices.
The functions in yahoo_fin are divided into two modules, stock_info and options. This post will focus on introducing stock_info. For more on using the options module, check out this post.
Let’s get started by importing the stock_info module from yahoo_fin.
import yahoo_fin.stock_info as si
One of the core functions available is called get_data, which retrieves historical price data for an individual stock. To call this function, just pass whatever ticker you want:
si.get_data("nflx") # gets Netflix's data si.get_data("aapl") # gets Apple's data si.get_data("amzn") # gets Amazon's data
You can also pull data for a specific date range, like below:
si.get_data("amzn", start_date = "01/01/2017", end_date = "01/31/2017")
Now, suppose you want to pull the price data for all the stocks in the S&P 500. This might take a few minutes, depending on your internet connection, but it can be done like this:
# get list of S&P 500 tickers sp = si.tickers_sp500() # pull data for each S&P stock price_data = {ticker : si.get_data(ticker) for ticker in sp}
The above code will create a dictionary where the keys are the S&P tickers, while the values are the corresponding price datasets. If you want to combine the various datasets into a single data frame, you could use functools:
from functools import reduce combined = reduce(lambda x,y: x.append(y), price_data.values())
This uses the reduce function from functools to collapse the collection of stock price data frames into a single data frame.
Click here to learn how to scrape real-time stock prices.
Financials, such as income statements, balance sheets, and cash flows can be scraped using yahoo_fin. For a full tutorial on this, check out this link.
Let’s scrape this information for Amazon.
income_statement = si.get_income_statement("amzn") balance_sheet = si.get_balance_sheet("amzn") cash_flow = si.get_cash_flow("amzn")
Now, the income_statement variable contains a data frame scraped from here. If we wanted to use this to see how net income or gross profit across the last several years we could examine those records like this:
income_statement.loc["netIncome"] income_statement.loc["grossProfit"]
The names of the records in the income statement can be seen by running the below line of code. This works also for the cash flow and balance sheet variables.
income.index
Let’s look at the balance sheet result. If you print the balance_sheet variable, you’ll see it contains a data frame scraped from this link.
Suppose you want to see how Amazon’s inventory has changed over the last three year-ends. This information is available on its balance sheet, so you could figure that out using the below code:
balance_sheet.loc["inventory"]
Parsing information from the cash flow result is similar to the above examples.
Getting the major holders of a stock can be done with the get_holders function.
holders = si.get_holders("amzn")
If you run the above line of code, you’ll see it returns a dictionary. The keys of the dictionary correspond with the headers on the Holders Page i.e. “Major Holders”, “Top Mutual Fund Holders”, “Top Institutional Holders” etc. The values are the corresponding tables that are shown beneath these respective headers on the Holders Page.
For instance, if you want to get the largest institutional holder, you could write this:
info = holders["Top Institutional Holders"] print(info.Holder[0])
The institutional holders are sorted in decreasing order by number of shares owned, so the 0th-indexed record contains the holder with the most number of shares. If you want to get the average number of shares owned by the ten top ten institutional holders, or average value invested, you could run the below code:
# get average number of shares owned by top 10 # institutional investors info.Shares.mean() # similarly, get average value invested info.Value.mean()
Data from the Analysts page can be scraped using the get_analysts_info function.
analysts_data = si.get_analysts_info("amzn")
Similarly to pulling holder information, the get_analysts_info function will return a dictionary. In this case, the keys are also the headers of the webpage the data is being scraped from — i.e. the Analysts page for the particular stock. For the Analysts page, this means the keys include “Earnings Estimate”, “Revenue Estimate”, “EPS Trends” etc. Once again, the values are the corresponding tables beneath each of these respective headers.
Yahoo_fin also provides functions to pull ticker lists. An example earlier in this post showed how to get the tickers in the S&P 500, but you can also pull the ones comprising the Dow Jones, or the NASDAQ.
# get list of Dow stocks dow_list = si.tickers_dow() # get list of NASDAQ stocks nasdaq_list = si.tickers_nasdaq()
Please subscribe to my website via the subscription area on the right side of the page. For other web scraping articles on this site, please see here.
Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for…
Ever had long-running code that you don't know when it's going to finish running? If…
Background If you've done any type of data analysis in Python, chances are you've probably…
In this post, we will investigate the pandas_profiling and sweetviz packages, which can be used…
In this post, we're going to cover how to plot XGBoost trees in R. XGBoost…
In this post, we'll discuss the underrated Python collections package, which is part of the…