Background
In a previous article, we talked about how to scrape tables from PDF files with Python. In this post, we’ll cover how to extract text from several types of PDFs. To read PDF files with Python, we can focus most of our attention on two packages – pdfminer and pytesseract.
pdfminer (specifically pdfminer.six, which is a more up-to-date fork of pdfminer) is an effective package to use if you’re handling PDFs that are typed and you’re able to highlight the text. On the other hand, to read scanned-in PDF files with Python, the pytesseract package comes in handy, which we’ll see later in the post.
Scraping hightlightable text
For the first example, let’s scrape a 10-k form from Apple (see here). First, we’ll just download this file to a local directory and save it as “apple_10k.pdf”. The first package we’ll be using to extract text is pdfminer. To download the version of the package we need, you can use pip (note we’re downloading pdfminer.six):
pip install pdfminer.six
Next, let’s import the extract_text method from pdfminer.high_level. This module within pdfminer provides higher-level functions for scraping text from PDF files. The extract_text function, as can be seen below, shows that we can extract text from a PDF with one line code (minus the package import)! This is an advantage of pdfminer versus some other packages like PyPDF2.
from pdfminer.high_level import extract_text text = extract_text("apple_10k.pdf") print(text)
The code above will extract the text from each page in the PDF. If we want to limit our extraction to specific pages, we just need to pass that specification to extract_text using the page_numbers parameter.
# extract text from the first 10 pages text10 = extract_text("apple_10k.pdf", page_numbers = range(10)) # get text from pages 0, 2, and 4 text_pages = extract_text("apple_10k.pdf", page_numbers = [0, 2, 4])
Scraping a password-protected PDF
If the PDF we want to scrape is password-protected, we just need to pass the password as a parameter to the same method as above.
text = extract_text("apple_10k.pdf", password = "top secret password")
Scraping text from scanned-in images
If a PDF contains scanned-in images of text, then it’s still possible to be scrapped, but requires a few additional steps. In this case, we’re going to be using two other Python packages – pytesseract and Wand. The second of these is used to convert PDFs into image files, while pytesseract is used to extract text from images. Since pytesseract doesn’t work directly on PDFs, we have to first convert our sample PDF into an image (or collection of image files).
Initial setup
Let’s get started by setting up the Wand package. Wand can be installed using pip:
pip install Wand
This package also requires a tool called ImageMagick to be installed (see here for more details).
There are other options for packages that convert PDFs into images files. For example, pdf2image is another choice, but we’ll use Wand in this tutorial.
Additionally, let’s go ahead and install pytesseract. This package can also be installed using pip:
pip install pytesseract
pytesseract depends upon tesseract being installed (see here for instructions). tesseract is an underlying utility that performs OCR (Optical Character Recognition) on images to extract text.
Converting PDFs into image files
Now, once our setup is complete, we can convert a PDF into a collection of image files. The way we do this is by converting each individual page into an image file. In addition to using Wand, we’ll also going to import the os package to help create the name of each image output file.
For this example, we’re going to take a scanned-in version of the first three pages of the 10k form from earlier in this post.
from wand.image import Image import os pdf_file = "scanned_apple_10k_snippet.pdf" files = [] with(Image(filename=pdf_file, resolution = 500)) as conn: for index, image in enumerate(conn.sequence): image_name = os.path.splitext(pdf_file)[0] + str(index + 1) + '.png' Image(image).save(filename = image_name) files.append(image_name)
In the with statement above, we open a connection to the PDF file. The resolution parameter specifies the DPI we want for the image outputs – in this case 500. Within the for loop, we specify the output filename, save the image using Image.save, and lastly append the filename to the list of image files. This way, we can loop over the list of image files, and scrape the text from each.
This should create three separate image files:
["scanned_apple_10k_snippet1.png", "scanned_apple_10k_snippet2.png", "scanned_apple_10k_snippet3.png"]
Using pytesseract on each image file
Next, we can use pytesseract to extract the text from each image file. In the code below, we store the extracted text from each page as a separate element in a list.
all_text = [] for file in files: text = pytesseract.image_to_string(Image.open(file)) all_text.append(text)
Alternatively, we can use a list comprehension like below:
all_text = [pytesseract.image_to_string(Image.open(file)) for file in files]
That’s all for now. If you enjoyed this post, please follow my blog on Twitter!