Why I would use TF-IDF after Bag-of-Words (CountVectorizer)?

Question

Why I would use TF-IDF after Bag-of-Words (CountVectorizer)?

Tiago Bachiega de Almeida

2020年11月20日 17:50

In my recent studies over Machine Learning NLP tasks I found this very nice tutorial teaching how to build your first text classifier:

https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a

The point is that I always believed that you have to choose between using Bag-of-Words or WordEmbeddings or TF-IDF, but in this tutorial the author uses Bag-of-Words (CountVectorizer) and then uses TF-IDF over the features generated by Bag-of-Words.

text_clf = Pipeline([('vect', CountVectorizer()),
...                      ('tfidf', TfidfTransformer()),
...                      ('clf', MultinomialNB()),
... ])

Is that a valid technique? Why would I do it?

Topic bag-of-words tfidf nlp

Category Data Science

Ben Reiniger · Accepted Answer · 2020年11月20日 17:50

This is the standard TF-IDF feature extraction: you transform the document counts. It just looks odd to separate the two steps like this. sklearn provides both TfidfTransformer and TfidfVectorizer; note the documentation of the latter:

Equivalent to CountVectorizer followed by TfidfTransformer.

Why I would use TF-IDF after Bag-of-Words (CountVectorizer)?

About