How to decide which method to use TFIDF, or BOW

Question

How to decide which method to use TFIDF, or BOW

asmgx

2022年3月8日 12:05

In a huge dataset for NLP it is taking very long time to classify my dataset

therefore, trying each feature extraction method separetly is time consuming and not effecient.

Is there a way that can tell me which method (TFIDF or Bag Of Words) is more likely to give the highest F1 score.

I tried test them on smaller subset (1000 records) it was fast but best method in smaller subset does not mean it is the best in complete dataset.

any other way to decide which method to use?

Topic bag-of-words tfidf feature-extraction nlp

Category Data Science

Vivek Singhal · Accepted Answer · 2022年2月3日 08:33

Agree with the other answer here - but in general BOW is for word encoding and TFIDF to remove common words like "are", "is", "the", etc. which do not lead to intelligence discovery in text. So comparing BOW and TFIDF is not appropriate as they have different uses.

In general, I would recommend use pre-trained models such as BERT which is already trained on massive text corpus like Wikipedia and can provide commercial grade accuracy even with limited additional data used for incremental machine learning on top of base BERT model.

Handcoding feature extraction on large text corpus with NLTK APIs like BOW or TFIDF or Lemmatizer / Stemmers or CountVectorizer is not going to be able to match commercial grade pre-trained model sophistication such as BERT or Open AI GPT.

Uday · Accepted Answer · 2021年3月3日 08:02

There is no specific way to deal with these kinds of experimentation. Below are some important points to remember before doing experimentation

If you are using NN to do the work, dense vectors like word2vec or fasttext may give better results than BoW/TfIdf
If you have more OOV words then fasttext may give better output than basic Word2Vec
If you are using linear algorithms like Logistic Regression/Linear SVM, BoW/TfIdf may have some advantage over averaging all the word vectors in the sentence. But it's not always true.
For the tree-based algorithms, training time may increase if we use BoW/TfIdf features because of the huge feature-length.

How to decide which method to use TFIDF, or BOW

About