Classification of scanned documents in pdf files using deep learning or NLP

I know classifying images using cnn but I have a problem where I have multiple types of scanned documents in a pdf file on different pages. Some types of scanned documents present in multiple pages inside the pdf.

Now I have to classify and return which documents are present and the page numbers in which they present in the pdf document. If scanned document is in multiple pages I should return the range of page numbers like 1 - 10.

Input will be pdf files containing scanned target documents

Output should be classified Document Name and Its page numbers

Can any one guide me on how can I a build a model that can address this problem.

Thankyou

Topic similar-documents image-classification deep-learning nlp python

Category Data Science


Since this is a unsupervized problem, you need to try to extract "topics" using topic modeling. There are a number of tools available in Python, e.g. from sklearn or spacy.

Basic workflow:

  • Extract text from PDF
  • Text preprocessing (lowercase, stemming etc)
  • Topic modeling
  • Return "topic" per page

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.