How to identify text similarity based on training data?

Question

How to identify text similarity based on training data?

Shivam Agrawal

2022年2月18日 05:00

I have a set of documents (1 to 11) for which the labeling is done.

Lets Assume:

Doc No: 1,3,5,7 - Belongs to Type A
Doc No: 2,4,9 - Belongs to Type B
Doc No: 8,10 - Belongs to Type C
Doc No, 6,11 - Belongs to No one

Now, let us say I have new incoming docs - 11,12,13 .. and so on, and I would like to know which Type (A, B, C or none ) they belong to based on the text similarity of existing docs in that Type. Can someone please suggest how I can achieve this?

Should I create my own corpus of data and consider it a supervised problem?

Topic text-classification gensim word2vec lda recommender-system

Category Data Science

Brian Spiering · Accepted Answer · 2021年9月8日 13:39

Since you have labeled training data, it is a supervised machine learning problem. It is a text classification problem, given document inputs train a model that will later predict what group new documents belong to.

There are a variety of machine learning algorithms to solve this problem. Common options are Naive Bayes and Deep Learning.

It is clearer if the data structured like this:

Data	Label / Target
Doc 1	"A"
Doc 2	"B"
Doc 3	"A"
Doc 4	"C"

learner · Accepted Answer · 2020年7月16日 21:55

I would consider some unsupervised techniques followed by supervised labelling. Basically, represent your incoming documents as dense vectors and compute similarity between the already labelled documents. Then, label them with the most similar document.

Ideas on how to solve:

Run Latent Dirichlet Allocation (LDA) on all the documents.
Each labelled document is then the probability distribution over topics
It looks like Document 1 : [0.1 0.3 0.0 ...], Document 2: [0.8 0.3 0.1 ...], ...
Finally, for all incoming documents, compute similarity with all the already labelled documents.
Label the incoming document with the label of most similar document, which is already labelled.

Another idea:

Replace the LDA with Word2Vec based models.

fractalnature · Accepted Answer · 2020年7月10日 17:00

1

fractalnature answered at 2020年7月10日 17:00

Yes this is a supervised problem. I'd suggest following the example in this article.

How to identify text similarity based on training data?

About