How to justify logarithmically scaled frequency for tf in tf-idf?

I am studying tf-idf (term frequency - inverse document frequency). The original logic for tf was straightforward: count of term t / number of total terms in the document.

However, I came across the log scaled frequency: log(1 + count of term t in the document). Please refer to Wikipedia.

It does not include the number of total terms in a document. For example, say, document 1 has 10 words in total and one of them is happy. Using the original logic, tf(happy)=1/10=0.1. Document 2 also has one happy but it has 1,000 words in total. tf(happy)=1/1000=0.001. You can see the tf(happy) of document 1 is very different from that of document 2.

However, if we use the log scaled frequency, both are log(1+1), regardless of the length of documents (one only has 10 words, while the other has 1,000).

How to justify such logic? Thanks.

Topic logarithmic tfidf nlp

Category Data Science


The logic is that you are taking just the tf part this might be weighted on the single document dimension or not (and in the logarithmic case it is not, in principle you could take a boolean scale as well, having $1$ if the words appear in the document and $0$ otherwise). You are missing the idf part, precisely the one weighing the importance of the word in the document (related to the number of times the term appears in all corpus.s)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.