How to read PDF files with Python

How to read PDF files with Python

Python
Background In a previous article, we talked about how to scrape tables from PDF files with Python. In this post, we'll cover how to extract text from several types of PDFs. To read PDF files with Python, we can focus most of our attention on two packages - pdfminer and pytesseract. pdfminer (specifically pdfminer.six, which is a more up-to-date fork of pdfminer) is an effective package to use if you're handling PDFs that are typed and you're able to highlight the text. On the other hand, to read scanned-in PDF files with Python, the pytesseract package comes in handy, which we'll see later in the post. Scraping hightlightable text For the first example, let's scrape a 10-k form from Apple (see here). First, we'll just download this file to a…
Read More
Getting data from PDFs the easy way with R

Getting data from PDFs the easy way with R

R
Earlier this year, a new package called tabulizer was released in R, which allows you to automatically pull out tables and text from PDFs. Note, this package only works if the PDF's text is highlightable (if it's typed) -- i.e. it won't work for scanned-in PDFs, or image files converted to PDFs. If you don't have tabulizer installed, just run install.packages("tabulizer") to get started. Initial Setup After you have tabulizer installed, we'll load it, and define a variable referencing an example PDF. [code lang="R"] library(tabulizer) site <- "http://www.sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf" [/code] The PDFs you manipulate with this package don't have to be located on your machine -- you can use tabulizer to reference a PDF by a URL. For our first example, we're going to use a sample PDF file found here:…
Read More