3 ways to scrape tables from PDFs with Python

3 ways to scrape tables from PDFs with Python

Python
This post will go through a few ways of scraping tables from PDFs with Python. To learn more about scraping tables and other data from PDFs with R, click here. Note, this options will only work for PDFs that are typed - not scanned-in images. tabula-py tabula-py is a very nice package that allows you to both scrape PDFs, as well as convert PDFs directly into CSV files. tabula-py can be installed using pip: [code] pip install tabula-py [/code] If you have issues with installation, check this. Once installed, tabula-py is straightforward to use. Below we use it scrape all the tables from a paper on classification regarding the Iris dataset (available here). [code lang="python"] import tabula file = "http://lab.fs.uni-lj.si/lasin/wp/IMIT_files/neural/doc/seminar8.pdf" tables = tabula.read_pdf(file, pages = "all", multiple_tables = True) [/code]…
Read More