extracting data from unstructured pdfs

Question

extracting data from unstructured pdfs

capiono

2022年3月1日 12:01

I have about 200,000 PDFs made up of 20 different designs. i.e In an organization, different (20) departments issue monthly award submission requirements. Each department has its own document format. These documents are collected by the organization.

Now I need to extract the paragraphs, bullet points, or sentences from each of these PDFs, organize it properly, specify if it is a requirement or not (label the data), and store it in storage. This process needs to be repeatable/automated for any future PDF.

A lot of the pdfs are not structured, have no tags or bookmarks, have no table of content.

I want to know what is the best technique or method for handling this type of problem?

Topic ocr text-mining nlp

Category Data Science

Vivek Singhal · Accepted Answer · 2022年1月29日 14:20

For each of the 20 designs, may have to design custom annotated section extractor using Vision AI algos. Then do OCR using tessaract or other OCR libraries on extracted sections.

I am not sure if this can be generalized for random document designs going forward. Each new document design needs CUSTOM vision solution first.

extracting data from unstructured pdfs

About