Data extraction from PDF files using python.

June 14, 2021

Data Extraction from PDF:

The basic concept of extract text from PDF files means to get important (What you want from PDF) data from PDF files. There are available lots of libraries for text extraction from PDF files.

PyPDF
invoice2data
textract
pdfMiner
pdf plumber
tabula (for extract tabular data)

Let’s start with PyPDF :

PyPDF is capable of

Splitting documents page by page.
Merging documents page by page.
Cropping pages.
Extracting document information.
Encrypting and decrypting PDF files.

To install PyPDF type the below command in the terminal:

pip install PyPDF2 (For python2)

pip install PyPDF3 (For python3)

2. Tabula: tabula-py is a simple Python wrapper of tabula-java, which can read tables in a PDF. Tabula is one of the useful packages which not only allows you to scrape tables from PDF files but also convert a PDF file directly into a CSV file.

To install tabula library:

pip install tabula-py

Parameters:

1. pages(optional): It allows str, int, list of int. It specifying pages to extract from.

2. Guess(optional): It allows bool. Guess the portion of the page to analyze per page.

3. stream(optional): Force PDF to extracted using stream-mode extraction (Extract whole data of PDF file)

4. multiple_tables: This parameter is very useful. It allows handling multiple tables within pages. By default, it is set as True.

3. Textract: Textract package is very useful for extracting text from documents. textract supports a growing list of file types for text extraction.

Example: It supports the following document types.

.csv, .pdf, .json, .jpg, .txt, .wav, .docx and many more.

To insatll textract:

pip instrall textract

Getting/Extracting company name from a PDF file:

Finding and extracting company names from PDF files is challenging, but not impossible. I was faced lots of issues regarding extracting company names from PDF files.

Here is the solution, with help of regex we can easily extract the company name from the PDF file. We need an email address.

For example: If a PDF file contains an email address or company URLs like gregory.novac@google.com or www.google.com then we get the company name easily.

For solving the above problem, we use textract library of python.

To install textract:

pip install textract

For more information about textract library, please visit: https://textract.readthedocs.io/en/stable/python_package.html

𝐓𝐡𝐚𝐧𝐤 𝐲𝐨𝐮 𝐟𝐨𝐫 𝐫𝐞𝐚𝐝𝐢𝐧𝐠.

Search This Blog

Pythoholic: Python conepts and projects.