Web Scraping Archives - Page 2 of 2 - Open Source Automation

19Jan 2019 by Andrew Treadway

Scraping data from a JavaScript webpage with Python

This post will walk through how to use the requests_html package to scrape options data from a JavaScript-rendered webpage. requests_html serves as an alternative to Selenium and PhantomJS, and provides a clear syntax similar to the awesome requests package. The code we'll walk through is packaged into functions in the options module in the yahoo_fin package, but this article will show how to write the code from scratch using requests_html so that you can use the same idea to scrape other JavaScript-rendered webpages. Note: requests_html requires Python 3.6+. If you don't have requests_html installed, you can download it using pip: [cc] pip install requests_html [/cc] Motivation Let's say we want to scrape options data for a particular stock. As an example, let's look at Netflix (since it's well known). If…

31Jul 2018 by Andrew Treadway

How to get live stock prices with Python

Python, Web Scraping

In a previous post, I gave an introduction to the yahoo_fin package. The most updated version of the package includes new functionality allowing you to scrape live stock prices from Yahoo Finance (real-time). In this article, we'll go through a couple ways of getting real-time data from Yahoo Finance for stocks, as well as how to pull cryptocurrency price information. The get_live_price function First, we just need to load the stock_info module from yahoo_fin. [code lang="python"] # import stock_info module from yahoo_fin from yahoo_fin import stock_info as si [/code] Then, obtaining the current price of a stock is as simple as one line of code: [code lang="python"] # get live price of Apple si.get_live_price("aapl") # or Amazon si.get_live_price("amzn") # or any other ticker si.get_live_price(ticker) [/code] Note: Passing tickers is not…

16Jul 2018 by Andrew Treadway

How to download image files with RoboBrowser

Python, Web Scraping

In a previous post, we showed how RoboBrowser can be used to fill out online forms for getting historical weather data from Wunderground. This article will talk about how to use RoboBrowser to batch download collections of image files from Pexels, a site which offers free downloads. If you're looking to work with images, or want to build a training set for an image classifier with Python, this post will help you do that. In the first part of the code, we'll load the RoboBrowser class from the robobrowser package, create a browser object which acts like a web browser, and navigate to the Pexels website. [code lang="python"] # load the RoboBrowser class from robobrowser from robobrowser import RoboBrowser # define base site base = "https://www.pexels.com/" # create browser object,…

25Jan 2018 by Andrew Treadway

Coding with the Yahoo_fin Package

Python, Web Scraping

Background on yahoo_fin The yahoo_fin package contains functions to scrape stock-related data from Yahoo Finance and NASDAQ. You can view the official documentation by clicking this link, but the below post will provide a few more in-depth examples. Also, please check out my yahoo_fin playlist on YouTube. The first video is below, which covers installation and getting historical / real-time stock prices. The functions in yahoo_fin are divided into two modules, stock_info and options. This post will focus on introducing stock_info. For more on using the options module, check out this post. Let's get started by importing the stock_info module from yahoo_fin. [code lang="python"] import yahoo_fin.stock_info as si [/code] Downloading price data One of the core functions available is called get_data, which retrieves historical price data for an individual stock.…

12Oct 2017 by Andrew Treadway

Word Frequency Analysis

Python, Web Scraping

In a previous article, we talked about using Python to scrape stock-related articles from the web. As an extension of this idea, we’re going to show you how to use the NLTK package to figure out how often different words occur in text, using scraped stock articles. Initial Setup Let's import the NLTK package, along with requests and BeautifulSoup, which we'll need to scrape the stock articles. [code language="python" style="font-size: 8px"] '''load packages''' import nltk import requests from bs4 import BeautifulSoup [/code] Pulling the data we'll need Below, we're copying code from my scraping stocks article. This gives us a function, scrape_all_articles (along with two other helper functions), which we can use to pull the actual raw text from articles linked to from NASDAQ's website. [code language="python"] def scrape_news_text(news_url): news_html…

19Sep 2017 by Andrew Treadway

RoboBrowser: Automating Online Forms

Python, Web Scraping

Background RoboBrowser is a Python 3.x package for crawling through the web and submitting online forms. It works similarly to the older Python 2.x package, mechanize. This post is going to give a simple introduction using RoboBrowser to submit a form on Wunderground for scraping historical weather data. Initial setup RoboBrowser can be installed via pip: [code lang="python"] pip install robobrowser [/code] Let's do the initial setup of the script by loading the RoboBrowser package. We'll also load pandas, as we'll be using that a little bit later. [code lang="python"] from robobrowser import RoboBrowser import pandas as pd [/code] Create RoboBrowser Object Next, we create a RoboBrowser object. This object functions similarly to an actual web browser. It allows you to navigate to different websites, fill in forms, and get…

24Aug 2017 by Andrew Treadway

Scraping Articles About Stocks

Python, Web Scraping

See recommended books here. The following article will show you an example of how to scrape articles about stocks from the Web using Python 3. Specifically, we'll be looking at articles linked from http://www.nasdaq.com. If you're not familiar with list comprehensions, you may want to check this, as we'll be using them in our code. Initial, Specific Example Let's start with a specific stock -- say, Netflix, for example. Articles linked to a specific stock ticker from Nasdaq's website have the following pattern: http://www.nasdaq.com/symbol/TICKER/news-headlines, where TICKER is replaced with whatever ticker you want. In our case, we will start by dealing specifically with Netflix's (NFLX) stock. So our site of interest is: http://www.nasdaq.com/symbol/nflx/news-headlines The first step is to load the requests and BeautifulSoup packages. Here, we'll also set the variable site equal to…

Category: Web Scraping