Getting data from PDFs the easy way with R

Getting data from PDFs the easy way with R

R
Earlier this year, a new package called tabulizer was released in R, which allows you to automatically pull out tables and text from PDFs. Note, this package only works if the PDF's text is highlightable (if it's typed) -- i.e. it won't work for scanned-in PDFs, or image files converted to PDFs. If you don't have tabulizer installed, just run install.packages("tabulizer") to get started. Initial Setup After you have tabulizer installed, we'll load it, and define a variable referencing an example PDF. [code lang="R"] library(tabulizer) site <- "http://www.sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf" [/code] The PDFs you manipulate with this package don't have to be located on your machine -- you can use tabulizer to reference a PDF by a URL. For our first example, we're going to use a sample PDF file found here:…
Read More

How to get live stock prices with Python

Python, Web Scraping
In a previous post, I gave an introduction to the yahoo_fin package. The most updated version of the package includes new functionality allowing you to scrape live stock prices from Yahoo Finance (real-time). In this article, we'll go through a couple ways of getting real-time data from Yahoo Finance for stocks, as well as how to pull cryptocurrency price information. The get_live_price function First, we just need to load the stock_info module from yahoo_fin. [code lang="python"] # import stock_info module from yahoo_fin from yahoo_fin import stock_info as si [/code] Then, obtaining the current price of a stock is as simple as one line of code: [code lang="python"] # get live price of Apple si.get_live_price("aapl") # or Amazon si.get_live_price("amzn") # or any other ticker si.get_live_price(ticker) [/code] Note: Passing tickers is not…
Read More
How to download image files with RoboBrowser

How to download image files with RoboBrowser

Python, Web Scraping
In a previous post, we showed how RoboBrowser can be used to fill out online forms for getting historical weather data from Wunderground. This article will talk about how to use RoboBrowser to batch download collections of image files from Pexels, a site which offers free downloads. If you're looking to work with images, or want to build a training set for an image classifier with Python, this post will help you do that. In the first part of the code, we'll load the RoboBrowser class from the robobrowser package, create a browser object which acts like a web browser, and navigate to the Pexels website. [code lang="python"] # load the RoboBrowser class from robobrowser from robobrowser import RoboBrowser # define base site base = "https://www.pexels.com/" # create browser object,…
Read More
R: How to create, delete, move, and more with files

R: How to create, delete, move, and more with files

File Manipulation, R, System Administration
Though Python is usually thought of over R for doing system administration tasks, R is actually quite useful in this regard. In this post we're going to talk about using R to create, delete, move, and obtain information on files. How to get and change the current working directory Before working with files, it's usually a good idea to first know what directory you're working in. The working directory is the folder that any files you create or refer to without explicitly spelling out the full path fall within. In R, you can figure this out with the getwd function. To change this directory, you can use the aptly named setwd function. [code lang="R"] # get current working directory getwd() # set working directory setwd("C:/Users") [/code] Creating Files and Directories…
Read More
ICA on Images with Python

ICA on Images with Python

Machine Learning, Python
Click here to see my recommended reading list. What is Independent Component Analysis (ICA)? If you're already familiar with ICA, feel free to skip below to how we implement it in Python. ICA is a type of dimensionality reduction algorithm that transforms a set of variables to a new set of components; it does so such that that the statistical independence between the new components is maximized. This is similar to Principle Component Analysis (PCA), which maps a collection of variables to statistically uncorrelated components, except that ICA goes a step further by maximizing statistical independence rather than just developing components that are uncorrelated. Like other dimensionality reduction methods, ICA seeks to reduce the number of variables in a set of data, while retaining key information. In the example we…
Read More
Coding with the Yahoo_fin Package

Coding with the Yahoo_fin Package

Python, Web Scraping
Background on yahoo_fin The yahoo_fin package contains functions to scrape stock-related data from Yahoo Finance and NASDAQ. You can view the official documentation by clicking this link, but the below post will provide a few more in-depth examples. Also, please check out my yahoo_fin playlist on YouTube. The first video is below, which covers installation and getting historical / real-time stock prices. The functions in yahoo_fin are divided into two modules, stock_info and options. This post will focus on introducing stock_info. For more on using the options module, check out this post. Let's get started by importing the stock_info module from yahoo_fin. [code lang="python"] import yahoo_fin.stock_info as si [/code] Downloading price data One of the core functions available is called get_data, which retrieves historical price data for an individual stock.…
Read More
Timing Python Processes

Timing Python Processes

Python
Timing Python processes is made possible with several different packages. One of the most common ways is using the standard library package, time, which we'll demonstrate with an example. However, another package that is very useful for timing a process -- and particularly telling you how far along a process has come -- is tqdm. As we'll show a little further down the post, tqdm will actually print out a progress bar as a process is running. Basic timing example Suppose we want to scrape the HTML from some collection of links. In this case, we're going to get a collection of URLs from Bloomberg's homepage. To do this, we'll use BeautifulSoup to get a list of full-path URLs. From the code below, this gives us a list of around…
Read More