Is it possible to classify documents of corpus using labels?

Question

Is it possible to classify documents of corpus using labels?

Uttakarsh Tikku

2022年2月22日 15:00

I have a corpus of 23000 documents that need to be classified into 5 different categories. I do not have any labeled data available to me, just freeform text documents and labels(yes, one-word labels, not topics).

So I followed a 2-step approach:

Synthetically generate labeled data (using a rule-based labeling approach, obviously the recall is very low, ~ 1/8 documents are labeled)
Somehow, use this labeled data to identify labels for other documents.

I have attempted the following approaches for step 2:

Topic modeling on data classified using rules to extract significant terms and using significant terms to label the remaining documents.
Finding significant terms using sentence embedding
Using sentence embedding as features for my classifier

But I haven't been successful in getting good results for my document classifier. Are there any other methods that can be used to classify the documents?

All help is greatly appreciated.

Topic document-term-matrix text-classification similar-documents nlp

Category Data Science

tehem · Accepted Answer · 2020年8月26日 01:49

You could simply encode the documents using BERT and cluster the documents based on their content provided they sufficiently different in terms of the kind of content they contain.
Another approach would be to train a document segmentation model which would segment documents based on their semantic structures and then classify the documents based on their masked skeletons. This however would require a large dataset to train. Fortunately you can find one online called PubLayNet. Augment that with a few representations of your documents for better generalization over the test set.

I've read about the second approach being implemented to classify patents, legal documents, research papers etc. With good results. However it would take a long time to train.

I'd recommend simply clustering the documents based on their text embedding (point 1) and then naming the clusters. If that doesn't work satisfactorily, try the deep learning method for document semantic masking.

Is it possible to classify documents of corpus using labels?

About