How to decide to go with BOW or TFIDF

Question

How to decide to go with BOW or TFIDF

asmgx

2021年4月21日 13:58

I know that there are methods that help in selecting features such as Matual Info, and Info Gain, etc.

But for datasets with thousands of records and thousands of features it is time consuming to train the model in BOW and TFIDF to decide which method is better.

is there a way to decide which method to choose without the need to spend all this time?

Topic bag-of-words tfidf nlp

Category Data Science

Erwan · Accepted Answer · 2021年4月21日 13:58

Technically BOW includes all the methods where words are considered as a set, i.e. without taking order into account. Thus TFIDF belongs to BOW methods: TFIDF is a weighting scheme applied to words considered as a set. There can be many other options for weighting the words in a set.

Compared to regular TF-weighted BOW, the TFIDF weighting scheme gives more weight to words which appear in fewer documents and less weight to words which appear in many documents. The rationale is that a word which appears in many documents is unlikely to be relevant since it doesn't help selecting the most similar document. Typically the most frequent words are grammatical words (also called stop words, e.g. determiners, pronouns, etc.), but in a corpus made of sci-fi books for example some words such as "robot" or "planet" will also be very common. On the contrary a word like "elephant" would be very rare in a sci-fi context, so it is given more weight because it's more discriminative.

This is meaningful in Information Retrieval tasks where the goal is to find a document similar to a query, and by extension it's useful in most tasks where the goal is to compare text documents by their semantic similarity. It is not meaningful and often counter-productive in classification tasks related to the style of text, as opposed to its semantic content.

Note that Okapi BM25 is a similar weighting scheme which is not as famous as TFIDF but has been proved to work better in most applications.

Abhishek Verma · Accepted Answer · 2021年4月21日 08:02

It depends on the problem you are trying to solve. If you know the signal in the dataset already, the words which decide your decision then go with Bag of Words. This is useful when you are doing something like text classification.

On the other hand, TF-IDF is useful when you don't know the signal in the dataset. If you want to do text similarity, then, this is a good option.

How to decide to go with BOW or TFIDF

About