Binary document classification using keywords for a very small dataset

I have a set of 150 documents with their assigned binary class. I also have 1000 unlabeled documents. Each document is about the length of a journal paper. Each class has 15 associated keywords.

I want to be able to predict the assigned class of the documents using this information. Does anyone have any ideas of how I could approach this problem?

Topic binary-classification text-classification classification nlp machine-learning

Category Data Science


This problem is called text classification (it belongs to the more general case of document classification). There are plenty of resources online about this, e.g. here, here or here. There are also a lot of research papers on the topic.

General text classification consists in two steps:

  1. Represent the text as features
  2. Train a classification model

The first step is specific to text, as opposed to the second step which is general ML. There are plenty of options to represent text as features, from traditional bag of words representation to word embeddings. In this question I explained the principle of the traditional BoW representation.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.