How to decide which method to use TFIDF, or BOW

In a huge dataset for NLP it is taking very long time to classify my dataset

therefore, trying each feature extraction method separetly is time consuming and not effecient.

Is there a way that can tell me which method (TFIDF or Bag Of Words) is more likely to give the highest F1 score.

I tried test them on smaller subset (1000 records) it was fast but best method in smaller subset does not mean it is the best in complete dataset.

any other way to decide which method to use?

Topic bag-of-words tfidf feature-extraction nlp

Category Data Science


Agree with the other answer here - but in general BOW is for word encoding and TFIDF to remove common words like "are", "is", "the", etc. which do not lead to intelligence discovery in text. So comparing BOW and TFIDF is not appropriate as they have different uses.

In general, I would recommend use pre-trained models such as BERT which is already trained on massive text corpus like Wikipedia and can provide commercial grade accuracy even with limited additional data used for incremental machine learning on top of base BERT model.

Handcoding feature extraction on large text corpus with NLTK APIs like BOW or TFIDF or Lemmatizer / Stemmers or CountVectorizer is not going to be able to match commercial grade pre-trained model sophistication such as BERT or Open AI GPT.


There is no specific way to deal with these kinds of experimentation. Below are some important points to remember before doing experimentation

  1. If you are using NN to do the work, dense vectors like word2vec or fasttext may give better results than BoW/TfIdf

  2. If you have more OOV words then fasttext may give better output than basic Word2Vec

  3. If you are using linear algorithms like Logistic Regression/Linear SVM, BoW/TfIdf may have some advantage over averaging all the word vectors in the sentence. But it's not always true.

  4. For the tree-based algorithms, training time may increase if we use BoW/TfIdf features because of the huge feature-length.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.