Extracting and Mining PDF Data

Question

Extracting and Mining PDF Data

Keetj

2020年1月9日 00:03

I have a pdf file (admission application). I want to read/search the pdf and extract terms with similar meaning and then convert this data into a DataFrame to save as a xlsm file. HELP!

Topic etl

Category Data Science

Carlos Mougan · Accepted Answer · 2020年1月9日 00:03

in my opinion, you have 4 possibilities:

You may treat the pdf directly using tabula
You may convert the pdf to text using pdftotext, then parse text with python
You may use an external tool, to convert your pdf file to excel or CSV, then use required python module to open the excel/CSV file.
You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data

This answer comes from:

https://stackoverflow.com/questions/47533875/how-to-extract-table-as-text-from-the-pdf-using-python/53050405

Your question is near similar to:

Regards

Extracting and Mining PDF Data

About