Is it possible to classify documents of corpus using labels?
I have a corpus of 23000 documents that need to be classified into 5 different categories. I do not have any labeled data available to me, just freeform text documents and labels(yes, one-word labels, not topics).
So I followed a 2-step approach:
- Synthetically generate labeled data (using a rule-based labeling approach, obviously the recall is very low, ~ 1/8 documents are labeled)
- Somehow, use this labeled data to identify labels for other documents.
I have attempted the following approaches for step 2:
- Topic modeling on data classified using rules to extract significant terms and using significant terms to label the remaining documents.
- Finding significant terms using sentence embedding
- Using sentence embedding as features for my classifier
But I haven't been successful in getting good results for my document classifier. Are there any other methods that can be used to classify the documents?
All help is greatly appreciated.
Topic document-term-matrix text-classification similar-documents nlp
Category Data Science