tesseract Archives - Open Source Automation

21Jan 2020 by Andrew Treadway

How to read PDF files with Python

Python

Background In a previous article, we talked about how to scrape tables from PDF files with Python. In this post, we'll cover how to extract text from several types of PDFs. To read PDF files with Python, we can focus most of our attention on two packages - pdfminer and pytesseract. pdfminer (specifically pdfminer.six, which is a more up-to-date fork of pdfminer) is an effective package to use if you're handling PDFs that are typed and you're able to highlight the text. On the other hand, to read scanned-in PDF files with Python, the pytesseract package comes in handy, which we'll see later in the post. Scraping hightlightable text For the first example, let's scrape a 10-k form from Apple (see here). First, we'll just download this file to a…

Tag: tesseract