Classification of scanned documents in pdf files using deep learning or NLP

Question

Classification of scanned documents in pdf files using deep learning or NLP

Sherlock

2022年2月3日 14:03

I know classifying images using cnn but I have a problem where I have multiple types of scanned documents in a pdf file on different pages. Some types of scanned documents present in multiple pages inside the pdf.

Now I have to classify and return which documents are present and the page numbers in which they present in the pdf document. If scanned document is in multiple pages I should return the range of page numbers like 1 - 10.

Input will be pdf files containing scanned target documents

Output should be classified Document Name and Its page numbers

Can any one guide me on how can I a build a model that can address this problem.

Thankyou

Topic similar-documents image-classification deep-learning nlp python

Category Data Science

Peter · Accepted Answer · 2021年8月30日 11:31

Since this is a unsupervized problem, you need to try to extract "topics" using topic modeling. There are a number of tools available in Python, e.g. from sklearn or spacy.

Basic workflow:

Extract text from PDF
Text preprocessing (lowercase, stemming etc)
Topic modeling
Return "topic" per page

Classification of scanned documents in pdf files using deep learning or NLP

About