extracting data from unstructured pdfs

I have about 200,000 PDFs made up of 20 different designs. i.e In an organization, different (20) departments issue monthly award submission requirements. Each department has its own document format. These documents are collected by the organization.

Now I need to extract the paragraphs, bullet points, or sentences from each of these PDFs, organize it properly, specify if it is a requirement or not (label the data), and store it in storage. This process needs to be repeatable/automated for any future PDF.

A lot of the pdfs are not structured, have no tags or bookmarks, have no table of content.

I want to know what is the best technique or method for handling this type of problem?

Topic ocr text-mining nlp

Category Data Science


For each of the 20 designs, may have to design custom annotated section extractor using Vision AI algos. Then do OCR using tessaract or other OCR libraries on extracted sections.

I am not sure if this can be generalized for random document designs going forward. Each new document design needs CUSTOM vision solution first.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.