How to scrape news articles with Python

How to scrape news articles with Python

Python, Web Scraping
In this post we're going to discuss how to scrape news articles with Python. This can be done using the handy newspaper package. Introduction to Python's newspaper package The newspaper package can be installed using pip: [code] pip install newspaper [/code] Once its installed, we can get started. newspaper can work by either scraping a single article from a given URL, or by finding the links on a webpage to other news articles. Let's start with handling a single article. First, we need to import the Article class. Next, we use this class to download the content from the URL to our news article. Then, we use the parse method to parse the HTML. Lastly, we can print out the text of the article using .text. Scraping a single article…
Read More
yahoo_fin 0.8.6

yahoo_fin 0.8.6

Python, Web Scraping
yahoo_fin, a package for scraping stock prices and financial data, was recently updated to version 0.8.6. The latest version has several new features, including the following: Users can now pull quarterly fundamentals data (in addition to yearly). This includes balance sheets, income statements, and cash flow statements. Earnings data can be extracted with the new get_earnings method. If you're looking to get multiple pieces of fundamentals data for the same stock - i.e. cash flows, income statements, and balance sheets, then make sure to check out a new method called get_financials. This method allows you to easily pull data points from each of these sources in a single request. New methods were added to retrieve historical dividend payouts and splits information. A bug in the ticker extraction methods causing a…
Read More
How to download fundamentals data with Python

How to download fundamentals data with Python

Python, Web Scraping
How can we download fundamentals data with Python? In this post we will explore how to download fundamentals data with Python. We'll be extracting fundamentals data from Yahoo Finance using the yahoo_fin package. For more on yahoo_fin, including installation instructions, check out its full documentation here or my YouTube video tutorials here. Getting started Now, let's import the stock_info module from yahoo_fin. This will provide us with the functionality we need to scrape fundamentals data from Yahoo Finance. We'll also import the pandas package as we'll be using that later to work with data frames. [code lang="python"] import yahoo_fin.stock_info as si import pandas as pd [/code] Next, we'll dive into getting common company metrics, starting with P/E ratios. How to get P/E (Price-to-Earnings) Ratios There's a couple ways to get…
Read More
Python for Web Scraping and APIs Online Course

Python for Web Scraping and APIs Online Course

Python, Web Scraping
I recently worked with a great team at 365 Data Science to create an extensive online Python for Web Scraping and APIs course! The course is now available on Udemy. Check it out here! The course covers many important topics, including the following: All about APIs: Get and Post requests HTML syntax - what's it for and why is it useful for web scraping BeautifulSoup - scraping links, text, and other elements from a webpage How to easily scrape tables from webpages How to automatically download files from the web Handling modern challenges: how to scrape JavaScript-rendered webpages Exploring the requests-html package, a powerful alternative to BeautifulSoup Click here to learn more about Python for Web Scraping and APIs Fundamentals! For other resources on learning Python or R, also check…
Read More
Updates to yahoo_fin package

Updates to yahoo_fin package

Python, Web Scraping
Package updates First of all - thank you to everyone who has contacted me regarding the yahoo_fin package! Due to some changes in Yahoo Finance's website, I've updated the source code of yahoo_fin. To upgrade to the latest version (0.8.4), you can use pip: [code] pip install yahoo_fin --upgrade [/code] Get weekly and monthly stock prices The most recent version includes all functionality from the previous version, but now also includes the ability to pull weekly and monthly historical stock prices in the get_data method. [code lang="python"] from yahoo_fin import stock_info as si # default daily data daily_data = si.get_data("amzn") # get weekly data weekly_data = si.get_data("amzn", interval = "1wk") # get monthly data monthly_data = si.get_data("amzn", interval = "1mo") [/code] Speed-up in functions The options module includes an update…
Read More
BeautifulSoup vs. Rvest

BeautifulSoup vs. Rvest

Python, R, Web Scraping
This post will compare Python's BeautifulSoup package to R's rvest package for web scraping. We'll also talk about additional functionality in rvest (that doesn't exist in BeautifulSoup) in comparison to a couple of other Python packages (including pandas and RoboBrowser). Getting started BeautifulSoup and rvest both involve creating an object that we can use to parse the HTML from a webpage. However, one immediate difference is that BeautifulSoup is just a web parser, so it doesn't connect to webpages. rvest, on the other hand, can connect to a webpage and scrape / parse its HTML in a single package. In BeautifulSoup, our initial setup looks like this: [code lang="python"] # load packages from bs4 import BeautifulSoup import requests # connect to webpage resp = requests.get(""https://www.azlyrics.com/b/beatles.html"") # get BeautifulSoup object soup…
Read More
Web Browsing and Parsing with RoboBrowser and requests_html

Web Browsing and Parsing with RoboBrowser and requests_html

Python, Web Scraping
Background So you've learned all about BeautifulSoup. What's next? Python is a great language for automating web operations. In a previous article we went through how to use BeautifulSoup and requests to scrape stock-related articles from Nasdaq's website. This post talks about a couple of alternatives to using BeautifulSoup directly. One way of scraping and crawling the web is to use Python's RoboBrowser package, which is built on top of requests and BeautifulSoup. Because it's built using each of these packages, writing code to scrape the web is a bit simplified as we'll see below. RoboBrowser works similarly to the older Python 2.x package mechanize in that it allows you to simulate a web browser. A second option is using requests_html, which was also discussed here, and which we'll also…
Read More
How to get options data with Python

How to get options data with Python

Python, Web Scraping
In a previous post, we talked about how to get real-time stock prices with Python. This post will go through how to download financial options data with Python. We will be using the yahoo_fin package. The yahoo_fin package comes with a module called options. This module allows you to scrape option chains and get option expiration dates. To get started we'll just import this module from yahoo_fin. [code lang="python"] from yahoo_fin import options [/code] How to get options expiration dates Any option contract has an expiration date. To get all of the option expiration dates for a particular stock, we can use the get_expiration_dates method in the options package. This method is equivalent to scraping all of the date selection boxes on the options page for an individual stock (e.g.…
Read More
Creating a word cloud on R-bloggers posts

Creating a word cloud on R-bloggers posts

R, Web Scraping
This post will go through how to create a word cloud of article titles scraped from the awesome R-bloggers. Our goal will be to use R's rvest package to search through 50 successive pages on the site for article titles. The stringr and tm packages will be used for string cleaning and for creating a term document frequency matrix (with tm). We will then create a word cloud based off the words comprising these titles. First, we'll load the packages we need. [code lang="R"] # load packages library(rvest) library(stringr) library(tm) library(wordcloud) [/code] Let's write a function that will take a webpage as input and return all the scraped article titles. [code lang="R"] scrape_post_titles <- function(site) { # scrape HTML from input site source_html <- read_html(site) # grab the title attributes…
Read More