Is there an algorithm or NN to match two documents, basically not closely similar?

Is there an algorithm or NN to match two documents? One is a claim description (e.g. a CV or product offer) and another is a requirements description (e.g. vacancy description or RFP). They are not similar, so basically it's not a docs similarity per se.

What's it better embedding to use on document corps (Doc2vec, Word2vec or just TF-IDF? etc) and what kind of further NN architecture would work to basically find a matching scores vector/matrix as output on how do input claim docs match to requirement docs? Or is there exists just any text analitics algorithm or something?

Thanks in advance for help.

Topic deep-learning text-mining neural-network similarity machine-learning

Category Data Science


One way to interpert your question is matching two documents that have the similar semantic content but might not have the same exact words.

Word Mover’s Distance (WMD) could be useful. WMD is an algorithm for finding the distance between pairs of strings. It is based on word embeddings (e.g., word2vec) which encode the semantic meaning of words into dense vectors.

The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to "travel" to reach the embedded words of another document.

For example:

enter image description here Source: "From Word Embeddings To Document Distances" Paper

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.