pdf to json libraries

Question

pdf to json libraries

Hitesh Somani

2022年5月21日 15:06

I am looking for a library which converts pdf to json. Basically in that json the paragraph heading is the and the value is the content of paragraph. Is there any python library for that ? I am already using pdfminer but that just converts to plain text. It cannot persist the structure/organisation of the document. For now it is ok to not read images and table although if there is a library to do that would be great.

Topic nlp

Category Data Science

Gyan Ranjan · Accepted Answer · 2020年12月16日 08:17

It cannot persist the structure/organisation of the document.

To tackle this, you can look at pdftotext (which comes by default in case you are on a linux system). It preserves the exact layout of the pdf page. You will have to specify the layout option.

For now it is ok to not read images and table although if there is a library to do that would be great.

Table extraction is a different(and more complicated) problem. However if the tables are not that complicated, you can use Camelot/Tabula to extract those. I would suggest you to go with Camelot as it also provides a lot of tunable parameters that you change for your case specific tables. (More details :Camelot )

pdf to json libraries

About