Vectorize One line text data

How to vectorize one-line text data? I have used tf-idf including bigrams and trigrams but I am not able to get good results. I have purchase order descriptions which are one-liners and I need to classify. It is a multi-class imbalanced data and I have a small dataset to train around 700 PO descriptions. The number of classes is 7 and the class distribution is similar to exponential. One class is dominating. My take is that TF IDF should not work since the term frequency and the IDF frequency will be very small. Also, can we make some user-defined functions to create vectors? If yes, what should be it?

Please suggest some alternative approaches as well.

Topic tfidf nlp

Category Data Science


You could alternatively use a pretrained embedder like word2vec or glove to vectorize your data into fixed length vectors.


Using bigrams and trigrams is likely to generate a high number of features, but with a small dataset the traditional approach would be to reduce the number of features. You could start by removing the least frequent words/n-grams (e.g. less than 3 occurrences), and/or use feature selection with InfoGain. It might not be very accurate but at least you avoid overfitting.


Check my answer to this question. Nowadays there're many pretrained embedders to choose from. They'll give you fixed-size numerical vector of features. You don't even have to go DNN way, xgboost will work just fine.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.