pdf to json libraries

I am looking for a library which converts pdf to json. Basically in that json the paragraph heading is the and the value is the content of paragraph. Is there any python library for that ? I am already using pdfminer but that just converts to plain text. It cannot persist the structure/organisation of the document. For now it is ok to not read images and table although if there is a library to do that would be great.

Topic nlp

Category Data Science


It cannot persist the structure/organisation of the document.

To tackle this, you can look at pdftotext (which comes by default in case you are on a linux system). It preserves the exact layout of the pdf page. You will have to specify the layout option.

For now it is ok to not read images and table although if there is a library to do that would be great.

Table extraction is a different(and more complicated) problem. However if the tables are not that complicated, you can use Camelot/Tabula to extract those. I would suggest you to go with Camelot as it also provides a lot of tunable parameters that you change for your case specific tables. (More details :Camelot )

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.