How to decide to go with BOW or TFIDF

I know that there are methods that help in selecting features such as Matual Info, and Info Gain, etc.

But for datasets with thousands of records and thousands of features it is time consuming to train the model in BOW and TFIDF to decide which method is better.

is there a way to decide which method to choose without the need to spend all this time?

Topic bag-of-words tfidf nlp

Category Data Science


Technically BOW includes all the methods where words are considered as a set, i.e. without taking order into account. Thus TFIDF belongs to BOW methods: TFIDF is a weighting scheme applied to words considered as a set. There can be many other options for weighting the words in a set.

Compared to regular TF-weighted BOW, the TFIDF weighting scheme gives more weight to words which appear in fewer documents and less weight to words which appear in many documents. The rationale is that a word which appears in many documents is unlikely to be relevant since it doesn't help selecting the most similar document. Typically the most frequent words are grammatical words (also called stop words, e.g. determiners, pronouns, etc.), but in a corpus made of sci-fi books for example some words such as "robot" or "planet" will also be very common. On the contrary a word like "elephant" would be very rare in a sci-fi context, so it is given more weight because it's more discriminative.

This is meaningful in Information Retrieval tasks where the goal is to find a document similar to a query, and by extension it's useful in most tasks where the goal is to compare text documents by their semantic similarity. It is not meaningful and often counter-productive in classification tasks related to the style of text, as opposed to its semantic content.

Note that Okapi BM25 is a similar weighting scheme which is not as famous as TFIDF but has been proved to work better in most applications.


It depends on the problem you are trying to solve. If you know the signal in the dataset already, the words which decide your decision then go with Bag of Words. This is useful when you are doing something like text classification.

On the other hand, TF-IDF is useful when you don't know the signal in the dataset. If you want to do text similarity, then, this is a good option.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.