Binary document classification using keywords for a very small dataset

Question

Binary document classification using keywords for a very small dataset

s21

2022年5月9日 14:00

I have a set of 150 documents with their assigned binary class. I also have 1000 unlabeled documents. Each document is about the length of a journal paper. Each class has 15 associated keywords.

I want to be able to predict the assigned class of the documents using this information. Does anyone have any ideas of how I could approach this problem?

Topic binary-classification text-classification classification nlp machine-learning

Category Data Science

Erwan · Accepted Answer · 2021年9月21日 10:35

This problem is called text classification (it belongs to the more general case of document classification). There are plenty of resources online about this, e.g. here, here or here. There are also a lot of research papers on the topic.

General text classification consists in two steps:

Represent the text as features
Train a classification model

The first step is specific to text, as opposed to the second step which is general ML. There are plenty of options to represent text as features, from traditional bag of words representation to word embeddings. In this question I explained the principle of the traditional BoW representation.

Binary document classification using keywords for a very small dataset

About