How to identify text similarity based on training data?
I have a set of documents (1 to 11) for which the labeling is done.
Lets Assume:
Doc No: 1,3,5,7 - Belongs to Type A
Doc No: 2,4,9 - Belongs to Type B
Doc No: 8,10 - Belongs to Type C
Doc No, 6,11 - Belongs to No one
Now, let us say I have new incoming docs - 11,12,13 .. and so on, and I would like to know which Type (A, B, C or none ) they belong to based on the text similarity of existing docs in that Type. Can someone please suggest how I can achieve this?
Should I create my own corpus of data and consider it a supervised problem?
Topic text-classification gensim word2vec lda recommender-system
Category Data Science