Extracting and Mining PDF Data

I have a pdf file (admission application). I want to read/search the pdf and extract terms with similar meaning and then convert this data into a DataFrame to save as a xlsm file. HELP!

Topic etl

Category Data Science


in my opinion, you have 4 possibilities:

  • You may treat the pdf directly using tabula

  • You may convert the pdf to text using pdftotext, then parse text with python

  • You may use an external tool, to convert your pdf file to excel or CSV, then use required python module to open the excel/CSV file.

  • You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data

This answer comes from:

Your question is near similar to:

Regards

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.