classification of similar text input features with text output label

I hope somebody can provide guidance/input/advice on my project, where I believe AI can help.

I have a general understanding of AI, but I lack a formal training.

I've never built a neural net from scratch on my own.

Task

Build a classification model able to assign labels to input text data.

Differently from a textbook example, the input is free text, so neither categorical nor numerical.

To complicate matters, the predictors in the training data I use are often similar to each other.

Data

  • input: short text data consisting of job descriptions, eg. senior marketing manager

    the shortest entry consists of a single word, while the longest up to ~20 words.

    the input data forms a closed list (~130k entries) but new, unseen text might occur.
  • labels: closed list of 65 text labels (and corresponding id)

Strategies tested

For a previous project, I built a word embedding model with gensim Word2Vec using the same data.

So I used this model to get the vector representation of each word in the input text and then calculated the centroid to get an embedding for it.

I used this embedding to traintest the following

  • ordinary classifiers: DecisionTree, RandomForest, NaiveBayes, KNeighborsClassifier - max accuracy ~ 41%
  • multilabel classifiers: OneVsRestClassifier, OneVsOneClassifier, OutputCodeClassifier - max accuracy ~ 43%
  • keras-based text classifier found in keras multiple text features input and single text label output classification - max accuracy ~ 46%

    not really sure if this net is appropriate to my task, from the description it seems so to me

Given such low accuracies, I did not tune the hyperparameters of these classifiers.

I think I should build a model with higher baseline performance first.

New Strategy

I thought that training a Doc2Vec embedding of the input data and using it to train the keras neural net should improve the performance of my classifier.

Here's what I had in mind:

  • create a grid of Doc2Vec (hyper)parameters
  • train a new Doc2Vec model for each combination
  • test each new model using a simple and quick classifier (eg. Logistic)
  • use the highest accuracy model to traintest the classification keras neural net

Questions

Before embarking on this (lengthy) process, I wish to ask for advice:

  • Is this approach reasonable or can someone recommend a more clever way?
  • Can anyone recommend which Doc2Vec parameters to focus on and which to set?
  • Is the keras neural net I've found appropriate for this task or should I modify it? If so, how?

Many thanks to anyone who will be so kind to read all of this and provide suggestions.

Topic doc2vec text-classification gensim keras nlp

Category Data Science


I suggest you use the state of the art for this kind of problems: a BERT-based approach. This kind of approach is well documented and very accessible, given the large amount of examples available online.

The approach consists of taking a pre-trained neural network model from the BERT family (Transformer encoders normally trained on a masked language model task over a large dataset), and fine-tuning it on your data.

This would allow you to profit from:

  • BERT's subword vocabulary, which will avoid having out-of-vocabulary words, because it decomposes words into smaller fragments.
  • The power of transfer learning, which would mitigate the situations where you don't have a lot of data.
  • State of the art performance. You can check at paperswithcode that BERT-based approaches are frequent top-performers in standard text classification benchmarks.

One of the most well documented and maintained python libraries for this is Hugginface Transformers. You can have a look at some examples on how to do text classification on a custom dataset here.

About the computational resources to train the system, you may use your own GPUs if you have, you may also train on CPU (which may be feasible given your short sequence lengths), or you may use Google Colab, which is very handy if you don't need a lot of training time.


That approach is reasonable. Short text inputs and multi-class classification outputs is a challenging problem.

Genism's doc2vec hyperparameters probably matter less than collecting more data or reducing the number of labels.

It might be useful to try more advanced models, such as Transformer or Switch Transformer.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.