Getting data from PDFs the easy way with R

Getting data from PDFs the easy way with R

R
Earlier this year, a new package called tabulizer was released in R, which allows you to automatically pull out tables and text from PDFs. Note, this package only works if the PDF's text is highlightable (if it's typed) -- i.e. it won't work for scanned-in PDFs, or image files converted to PDFs. If you don't have tabulizer installed, just run install.packages("tabulizer") to get started. Initial Setup After you have tabulizer installed, we'll load it, and define a variable referencing an example PDF. [code lang="R"] library(tabulizer) site <- "http://www.sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf" [/code] The PDFs you manipulate with this package don't have to be located on your machine -- you can use tabulizer to reference a PDF by a URL. For our first example, we're going to use a sample PDF file found here:…
Read More