Data extraction from PDF files using python.
Data Extraction from PDF:
The basic concept of extract text from PDF files means to get important (What you want from PDF) data from PDF files. There are available lots of libraries for text extraction from PDF files.
- PyPDF
- invoice2data
- textract
- pdfMiner
- pdf plumber
- tabula (for extract tabular data)
Let’s start with PyPDF :
- PyPDF is capable of
- Splitting documents page by page.
- Merging documents page by page.
- Cropping pages.
- Extracting document information.
- Encrypting and decrypting PDF files.
To install PyPDF type the below command in the
terminal:
pip install PyPDF2 (For python2)
pip install PyPDF3 (For python3)
2. Tabula: tabula-py is a simple Python wrapper of tabula-java, which can read tables in a PDF. Tabula is one of the useful packages which not only allows you to scrape tables from PDF files but also convert a PDF file directly into a CSV file.
To install
tabula library:
pip install tabula-py
Parameters:
1. pages(optional): It allows str, int, list of int. It specifying pages to extract from.
2. Guess(optional): It allows bool. Guess the portion of the page to analyze
per page.
3. stream(optional): Force PDF to extracted using stream-mode
extraction (Extract whole data of PDF file)
4. multiple_tables: This parameter is
very useful. It allows handling multiple tables within pages. By default, it is
set as True.
3. Textract: Textract package is very useful for extracting text from documents. textract supports a growing list of file types for text extraction.
Example: It supports the following document types.
.csv, .pdf, .json, .jpg, .txt, .wav, .docx and many more.
To insatll textract:
pip instrall textract
Getting/Extracting company name from a PDF file:
Finding and extracting company names from PDF files is challenging, but not impossible. I was faced lots of issues regarding extracting company names from PDF files.
Here is the solution, with
help of regex we can easily extract the company name from the PDF file. We need an email address.
For example: If a PDF file contains an email address or company URLs like gregory.novac@google.com or www.google.com then we get the company name easily.
For solving the above problem, we use textract library of python.
To install textract:
pip install textract
For more information about textract library, please visit: https://textract.readthedocs.io/en/stable/python_package.html
Comments
Post a Comment
If you have any doubt, please let me know. To check my other blog kindly check the following links:
https://pythoholic.blogspot.com/
If you are interested in reading Marathi stories and other stuff, kindly check the following link.
https://pratilipi.page.link/q8dZ4ffZwKPHUx6R9
ꜰᴏʀ ᴇxᴘʟᴏʀɪɴɢ ᴛʜᴇ ᴡᴏʀʟᴅ ᴘʟᴇᴀꜱᴇ ʜᴀᴠᴇ ʟᴏᴏᴋ ᴀɴᴅ ꜰᴏʟʟᴏᴡ.
https://maps.app.goo.gl/jnKyzdDpKMFutUqR7