Document Content

I have a set of .pdf/.docx documents with content. I need to search for the most suitable document according to a particular sentence. For instance:

 Sentence: Security in the work environment

The system should return the most appropriate document which contains at least the content expressed in the sentence. It should be a sort of search bar with advanced capabilities; I have a constraint: I can not have an apriori classification since the number of documents and the related category could vary on time.

How should I address this kind of task?

Topic semantic-similarity nlp

Category Data Science


If you are asking how to integrate this, I would leverage existing search technologies such as storing documents in mongo database or using solr indices just to name a few..

If you are asking on the implementation details, take a look on topic modeling, tf-idf, cosine similarity, synonym replacements, k-nearest neighbors to get you started. A lot of these techniques could either be used at index time on incoming documents or at query time to perform minimum work to reduce search space scope even if your documents & queries are dynamically changing. You'll probably want to allocate a test set of ranked documents expected to be returned by sample queries so you can benchmark your improvements.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.