classification of similar text input features with text output label
I hope somebody can provide guidance/input/advice on my project, where I believe AI can help.
I have a general understanding of AI, but I lack a formal training.
I've never built a neural net from scratch on my own.
Task
Build a classification model able to assign labels to input text data.
Differently from a textbook example, the input is free text, so neither categorical nor numerical.
To complicate matters, the predictors in the training data I use are often similar to each other.
Data
- input: short text data consisting of job descriptions, eg. senior marketing manager
the shortest entry consists of a single word, while the longest up to ~20 words.
the input data forms a closed list (~130k entries) but new, unseen text might occur. - labels: closed list of 65 text labels (and corresponding id)
Strategies tested
For a previous project, I built a word embedding model with gensim
Word2Vec using the same data.
So I used this model to get the vector representation of each word in the input text and then calculated the centroid to get an embedding for it.
I used this embedding to traintest the following
- ordinary classifiers:
DecisionTree
,RandomForest
,NaiveBayes
,KNeighborsClassifier
- max accuracy ~ 41% - multilabel classifiers:
OneVsRestClassifier
,OneVsOneClassifier
,OutputCodeClassifier
- max accuracy ~ 43% keras
-based text classifier found in keras multiple text features input and single text label output classification - max accuracy ~ 46%
not really sure if this net is appropriate to my task, from the description it seems so to me
Given such low accuracies, I did not tune the hyperparameters of these classifiers.
I think I should build a model with higher baseline performance first.
New Strategy
I thought that training a Doc2Vec embedding of the input data and using it to train the keras neural net should improve the performance of my classifier.
Here's what I had in mind:
- create a grid of Doc2Vec (hyper)parameters
- train a new Doc2Vec model for each combination
- test each new model using a simple and quick classifier (eg. Logistic)
- use the highest accuracy model to traintest the classification
keras
neural net
Questions
Before embarking on this (lengthy) process, I wish to ask for advice:
- Is this approach reasonable or can someone recommend a more clever way?
- Can anyone recommend which Doc2Vec parameters to focus on and which to set?
- Is the keras neural net I've found appropriate for this task or should I modify it? If so, how?
Many thanks to anyone who will be so kind to read all of this and provide suggestions.
Topic doc2vec text-classification gensim keras nlp
Category Data Science