Web Scraping Archives - Open Source Automation

05Aug 2020 by Andrew Treadway

How to scrape news articles with Python

In this post we're going to discuss how to scrape news articles with Python. This can be done using the handy newspaper package. Introduction to Python's newspaper package The newspaper package can be installed using pip: [code] pip install newspaper [/code] Once its installed, we can get started. newspaper can work by either scraping a single article from a given URL, or by finding the links on a webpage to other news articles. Let's start with handling a single article. First, we need to import the Article class. Next, we use this class to download the content from the URL to our news article. Then, we use the parse method to parse the HTML. Lastly, we can print out the text of the article using .text. Scraping a single article…

14Jul 2020 by Andrew Treadway

yahoo_fin 0.8.6

Python, Web Scraping

yahoo_fin, a package for scraping stock prices and financial data, was recently updated to version 0.8.6. The latest version has several new features, including the following: Users can now pull quarterly fundamentals data (in addition to yearly). This includes balance sheets, income statements, and cash flow statements. Earnings data can be extracted with the new get_earnings method. If you're looking to get multiple pieces of fundamentals data for the same stock - i.e. cash flows, income statements, and balance sheets, then make sure to check out a new method called get_financials. This method allows you to easily pull data points from each of these sources in a single request. New methods were added to retrieve historical dividend payouts and splits information. A bug in the ticker extraction methods causing a…

05May 2020 by Andrew Treadway

How to download fundamentals data with Python

Python, Web Scraping

How can we download fundamentals data with Python? In this post we will explore how to download fundamentals data with Python. We'll be extracting fundamentals data from Yahoo Finance using the yahoo_fin package. For more on yahoo_fin, including installation instructions, check out its full documentation here or my YouTube video tutorials here. Getting started Now, let's import the stock_info module from yahoo_fin. This will provide us with the functionality we need to scrape fundamentals data from Yahoo Finance. We'll also import the pandas package as we'll be using that later to work with data frames. [code lang="python"] import yahoo_fin.stock_info as si import pandas as pd [/code] Next, we'll dive into getting common company metrics, starting with P/E ratios. How to get P/E (Price-to-Earnings) Ratios There's a couple ways to get…

21Apr 2020 by Andrew Treadway

How to check if groceries are in stock and automatically buy them with R

R, Web Scraping

Background Anyone who's bought groceries online recently has seen the huge increase in demand due to the COVID-19 outbreak and quarantines. In this post, you'll learn how to buy groceries on Amazon using R! To do that, we'll be using the RSelenium package. In case you're not familiar, Selenium is a browser automation tool. It works like a normal browser, except that you write code to perform operations, such as navigating to websites, filling in forms online, clicking links and buttons, etc. This way, it's similar to writing a macro in Excel - except for a web browswer. Several different languages, including Python and R, have packages that allow you to use Selenium by writing code in their language. R's package for this, as mentioned above, is RSelenium. Getting started…

25Mar 2020 by Andrew Treadway

Python for Web Scraping and APIs Online Course

Python, Web Scraping

I recently worked with a great team at 365 Data Science to create an extensive online Python for Web Scraping and APIs course! The course is now available on Udemy. Check it out here! The course covers many important topics, including the following: All about APIs: Get and Post requests HTML syntax - what's it for and why is it useful for web scraping BeautifulSoup - scraping links, text, and other elements from a webpage How to easily scrape tables from webpages How to automatically download files from the web Handling modern challenges: how to scrape JavaScript-rendered webpages Exploring the requests-html package, a powerful alternative to BeautifulSoup Click here to learn more about Python for Web Scraping and APIs Fundamentals! For other resources on learning Python or R, also check…

16Dec 2019 by Andrew Treadway

Updates to yahoo_fin package

Python, Web Scraping

Package updates First of all - thank you to everyone who has contacted me regarding the yahoo_fin package! Due to some changes in Yahoo Finance's website, I've updated the source code of yahoo_fin. To upgrade to the latest version (0.8.4), you can use pip: [code] pip install yahoo_fin --upgrade [/code] Get weekly and monthly stock prices The most recent version includes all functionality from the previous version, but now also includes the ability to pull weekly and monthly historical stock prices in the get_data method. [code lang="python"] from yahoo_fin import stock_info as si # default daily data daily_data = si.get_data("amzn") # get weekly data weekly_data = si.get_data("amzn", interval = "1wk") # get monthly data monthly_data = si.get_data("amzn", interval = "1mo") [/code] Speed-up in functions The options module includes an update…

23Jul 2019 by Andrew Treadway

BeautifulSoup vs. Rvest

Python, R, Web Scraping

This post will compare Python's BeautifulSoup package to R's rvest package for web scraping. We'll also talk about additional functionality in rvest (that doesn't exist in BeautifulSoup) in comparison to a couple of other Python packages (including pandas and RoboBrowser). Getting started BeautifulSoup and rvest both involve creating an object that we can use to parse the HTML from a webpage. However, one immediate difference is that BeautifulSoup is just a web parser, so it doesn't connect to webpages. rvest, on the other hand, can connect to a webpage and scrape / parse its HTML in a single package. In BeautifulSoup, our initial setup looks like this: [code lang="python"] # load packages from bs4 import BeautifulSoup import requests # connect to webpage resp = requests.get(""https://www.azlyrics.com/b/beatles.html"") # get BeautifulSoup object soup…

22Jun 2019 by Andrew Treadway

Web Browsing and Parsing with RoboBrowser and requests_html

Python, Web Scraping

Background So you've learned all about BeautifulSoup. What's next? Python is a great language for automating web operations. In a previous article we went through how to use BeautifulSoup and requests to scrape stock-related articles from Nasdaq's website. This post talks about a couple of alternatives to using BeautifulSoup directly. One way of scraping and crawling the web is to use Python's RoboBrowser package, which is built on top of requests and BeautifulSoup. Because it's built using each of these packages, writing code to scrape the web is a bit simplified as we'll see below. RoboBrowser works similarly to the older Python 2.x package mechanize in that it allows you to simulate a web browser. A second option is using requests_html, which was also discussed here, and which we'll also…

17Apr 2019 by Andrew Treadway

How to get options data with Python

Python, Web Scraping

In a previous post, we talked about how to get real-time stock prices with Python. This post will go through how to download financial options data with Python. We will be using the yahoo_fin package. The yahoo_fin package comes with a module called options. This module allows you to scrape option chains and get option expiration dates. To get started we'll just import this module from yahoo_fin. [code lang="python"] from yahoo_fin import options [/code] How to get options expiration dates Any option contract has an expiration date. To get all of the option expiration dates for a particular stock, we can use the get_expiration_dates method in the options package. This method is equivalent to scraping all of the date selection boxes on the options page for an individual stock (e.g.…

29Jan 2019 by Andrew Treadway

Creating a word cloud on R-bloggers posts

R, Web Scraping

This post will go through how to create a word cloud of article titles scraped from the awesome R-bloggers. Our goal will be to use R's rvest package to search through 50 successive pages on the site for article titles. The stringr and tm packages will be used for string cleaning and for creating a term document frequency matrix (with tm). We will then create a word cloud based off the words comprising these titles. First, we'll load the packages we need. [code lang="R"] # load packages library(rvest) library(stringr) library(tm) library(wordcloud) [/code] Let's write a function that will take a webpage as input and return all the scraped article titles. [code lang="R"] scrape_post_titles <- function(site) { # scrape HTML from input site source_html <- read_html(site) # grab the title attributes…

Category: Web Scraping