Deep learning techniques for concept similarity?

Given a corpus of product descriptions (say, vacuum cleaners), I'm looking for a way to group the documents that are all of the same type (where a type can be cordless vacuums, shampooer, carpet cleaner, industrial vacuum, etc.).

The approach I'm exploring is to use NER. I'm labeling a set of these documents with tags such as (KIND, BRAND, MODEL). The theory is that I'd then run new documents through the model, and the tokens corresponding to those tags would be extracted. I would then construct a feature vector for each document comprised of a boolean value for each of the tags. From there, a simple dot product would show all documents related to some base document (as in, these documents are all similar to this one document).

Question

What are other general approaches that might be a good fit for this task?

Topic similar-documents deep-learning nlp

Category Data Science


You have plenty of existing NLP models already fulfilling this task.

For instance, the bert-base-nli-mean-tokens: enter image description here

This model is too general, you will want to use a more adapted one, or build your own model based on an existing one like Bert.

Here is a complete list of models for sentence similarity: https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads

Roberta, Mini-LM, MPnet are generally good options.

However, some of them could require more compute power than other solutions, but with a good GPU, it would be fast enough.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.