Is there an algorithm or NN to match two documents, basically not closely similar?

Question

Is there an algorithm or NN to match two documents, basically not closely similar?

Yuriy P

2022年4月19日 06:07

Is there an algorithm or NN to match two documents? One is a claim description (e.g. a CV or product offer) and another is a requirements description (e.g. vacancy description or RFP). They are not similar, so basically it's not a docs similarity per se.

What's it better embedding to use on document corps (Doc2vec, Word2vec or just TF-IDF? etc) and what kind of further NN architecture would work to basically find a matching scores vector/matrix as output on how do input claim docs match to requirement docs? Or is there exists just any text analitics algorithm or something?

Thanks in advance for help.

Topic deep-learning text-mining neural-network similarity machine-learning

Category Data Science

Brian Spiering · Accepted Answer · 2021年10月29日 14:05

One way to interpert your question is matching two documents that have the similar semantic content but might not have the same exact words.

Word Mover’s Distance (WMD) could be useful. WMD is an algorithm for finding the distance between pairs of strings. It is based on word embeddings (e.g., word2vec) which encode the semantic meaning of words into dense vectors.

The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to "travel" to reach the embedded words of another document.

For example:

Source: "From Word Embeddings To Document Distances" Paper

Is there an algorithm or NN to match two documents, basically not closely similar?

About