Python

3 ways to scrape tables from PDFs with Python

This post will go through a few ways of scraping tables from PDFs with Python. To learn more about scraping tables and other data from PDFs with R, click here. Note, this options will only work for PDFs that are typed – not scanned-in images.

tabula-py

tabula-py is a very nice package that allows you to both scrape PDFs, as well as convert PDFs directly into CSV files. tabula-py can be installed using pip:


pip install tabula-py

If you have issues with installation, check this. Once installed, tabula-py is straightforward to use. Below we use it scrape all the tables from a paper on classification regarding the Iris dataset (available here).


import tabula

file = "http://lab.fs.uni-lj.si/lasin/wp/IMIT_files/neural/doc/seminar8.pdf"

tables = tabula.read_pdf(file, pages = "all", multiple_tables = True)

The result stored into tables is a list of data frames which correspond to all the tables found in the PDF file. To search for all the tables in a file you have to specify the parameters page = “all” and multiple_tables = True.

You can also use tabula-py to convert a PDF file directly into a CSV. The first line below will find the first table in the PDF and output it to a CSV. If we add the parameter all = True, we can write all of the PDF’s tables to the CSV.


# output just the first table in the PDF to a CSV
tabula.convert_into(file, "iris_first_table.csv")

# output all the tables in the PDF to a CSV
tabula.convert_into(file, "iris_all.csv", all = True)


tabula-py can also scrape all of the PDFs in a directory in just one line of code, and drop the tables from each into CSV files.


tabula.convert_into_by_batch("/path/to/files", output_format = "csv", pages = "all")

We can perform the same operation, except drop the files out to JSON instead, like below.


tabula.convert_into_by_batch("/path/to/files", output_format = "json", pages = "all")

Camelot

Camelot is another possibility for scraping tables from PDFs. Camelot can be installed like so:


pip install camelot-py[cv]

Camelot does have some additional dependencies, including GhostScript, which are listed here. Once installed, we can use Camelot similarly to tabula-py to scrape PDF tables.


file = "http://lab.fs.uni-lj.si/lasin/wp/IMIT_files/neural/doc/seminar8.pdf"

tables = camelot.read_pdf(file, pages = "1-end")

This returns a TableList object. To access any of the tables found by index, you can do this:


# get the 0th-indexed-table table
tables[0].df

# get the 3rd-indexed-table
tables[3].df

One cool feature of Camelot is that you also get a “parsing report” for each table giving an accuracy metric, the page the table was found on, and the percentage of whitespace present in the table.



tables[0].parsing_report

tables[3].parsing_report

From here we can see that the 0th-indexed identified table is essentially whitespace. If we look at the raw PDF, we can see there’s not a table on that page, so it’s safe to ignore this empty data frame.

Like tabula-py, you can export all the scraped tables to a file. Camelot supports (as of this writing) CSV, JSON, HTML, and SQLite. If you choose CSV, Camelot will create a separate CSV file for each table by default. You can create a zip file of these CSVs by adding the parameter compress = True. Choosing to export to excel will create a single workbook containing an individual worksheet for each table.


# export all tables at once to CSV files
tables.export("camelot_tables.csv", f = "csv")

# export all tables at once to CSV files in a single zip
tables.export("camelot_tables.csv", f = "csv", compress = True)

# export each table to a separate worksheet in an Excel file
tables.export("camelot_tables.xlsx", f = "excel")

If you want to export just a single table, you can do it just like in pandas since each individual table can be referred to as a data frame object.


tables[3].to_csv("camelot_third_table.csv")

tables[3].to_excel("camelot_third_table.xlsx")


Excalibur

If you’re looking for a web interface to use for extracting PDF tables, you can check out Excalibur, which is built on top of Camelot.

If Camelot is already installed, you can just use pip to install Excalibur:


pip install excalibur-py

You can get started with Excalibur from the command line. After you open the command line, just type the following:


excalibur initdb

The above command will initialize a meta database needed for the application. Next, run the below command to start the web server via Flask:


excalibur webserver

If you open a web browser to your local host, you should see an interface like below.

From here, you’ll be able to upload a PDF file of your choice, and Excalibur will do the rest.

For more on working with PDF files, check out this post for how to read PDF text with Python.

**Please check out my other Python posts here.

Andrew Treadway

Recent Posts

Software Engineering for Data Scientists (New book!)

Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for…

2 years ago

How to stop long-running code in Python

Ever had long-running code that you don't know when it's going to finish running? If…

3 years ago

Faster alternatives to pandas

Background If you've done any type of data analysis in Python, chances are you've probably…

3 years ago

Automated EDA with Python

In this post, we will investigate the pandas_profiling and sweetviz packages, which can be used…

3 years ago

How to plot XGBoost trees in R

In this post, we're going to cover how to plot XGBoost trees in R. XGBoost…

4 years ago

Python collections tutorial

In this post, we'll discuss the underrated Python collections package, which is part of the…

4 years ago