How to justify logarithmically scaled frequency for tf in tf-idf?

Question

How to justify logarithmically scaled frequency for tf in tf-idf?

Fred Chang

2022年4月2日 22:00

I am studying tf-idf (term frequency - inverse document frequency). The original logic for tf was straightforward: count of term t / number of total terms in the document.

However, I came across the log scaled frequency: log(1 + count of term t in the document). Please refer to Wikipedia.

It does not include the number of total terms in a document. For example, say, document 1 has 10 words in total and one of them is happy. Using the original logic, tf(happy)=1/10=0.1. Document 2 also has one happy but it has 1,000 words in total. tf(happy)=1/1000=0.001. You can see the tf(happy) of document 1 is very different from that of document 2.

However, if we use the log scaled frequency, both are log(1+1), regardless of the length of documents (one only has 10 words, while the other has 1,000).

How to justify such logic? Thanks.

Topic logarithmic tfidf nlp

Category Data Science

Oscar · Accepted Answer · 2022年4月2日 22:00

The logic is that you are taking just the tf part this might be weighted on the single document dimension or not (and in the logarithmic case it is not, in principle you could take a boolean scale as well, having $1$ if the words appear in the document and $0$ otherwise). You are missing the idf part, precisely the one weighing the importance of the word in the document (related to the number of times the term appears in all corpus.s)

How to justify logarithmically scaled frequency for tf in tf-idf?

About