How to identify text similarity based on training data?

I have a set of documents (1 to 11) for which the labeling is done.

Lets Assume:

Doc No: 1,3,5,7 - Belongs to Type A
Doc No: 2,4,9 - Belongs to Type B
Doc No: 8,10 - Belongs to Type C
Doc No, 6,11 - Belongs to No one

Now, let us say I have new incoming docs - 11,12,13 .. and so on, and I would like to know which Type (A, B, C or none ) they belong to based on the text similarity of existing docs in that Type. Can someone please suggest how I can achieve this?

Should I create my own corpus of data and consider it a supervised problem?

Topic text-classification gensim word2vec lda recommender-system

Category Data Science


Since you have labeled training data, it is a supervised machine learning problem. It is a text classification problem, given document inputs train a model that will later predict what group new documents belong to.

There are a variety of machine learning algorithms to solve this problem. Common options are Naive Bayes and Deep Learning.

It is clearer if the data structured like this:

Data Label / Target
Doc 1 "A"
Doc 2 "B"
Doc 3 "A"
Doc 4 "C"

I would consider some unsupervised techniques followed by supervised labelling. Basically, represent your incoming documents as dense vectors and compute similarity between the already labelled documents. Then, label them with the most similar document.

Ideas on how to solve:

  • Run Latent Dirichlet Allocation (LDA) on all the documents.
  • Each labelled document is then the probability distribution over topics
  • It looks like Document 1 : [0.1 0.3 0.0 ...], Document 2: [0.8 0.3 0.1 ...], ...
  • Finally, for all incoming documents, compute similarity with all the already labelled documents.
  • Label the incoming document with the label of most similar document, which is already labelled.

Another idea:

  • Replace the LDA with Word2Vec based models.

Yes this is a supervised problem. I'd suggest following the example in this article.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.