What is the shape of the vector after it passes through the TfidfVecorizer fit_transform() method?

I am trying to understand what happens inside the IDF part of the TFIDF vectorizer.

The official scikit-learn page says that the shape is (4,9) for a corpus of 4 documents having 9 unique features.

I get the Term Frequency (TF) part, it makes sense to me that ( for every unique feature(9), for each document(4) we calculate each term's frequency, so we get a matrix of shape (4,9)

But what does not make sense to me is the IDF part the formula for IDF is:

$$\text{idf}(t,D) = \text{log} \ {{N} \over {| \{ d \in D:t \in d \} | }}$$

with

  • $N$: total number of documents in the corpus $N = |D|$

  • $| \{ d \in D:t \in d \} |$ : number of documents where the term $t$ appears (i.e., $\text{tf}(t,d) \neq 0$). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to $1 + | \{ d \in D:t \in d \} |$.

So applying this formula, for every feature (9) we calculate the log(total number of documents / number of documents having the term or feature in it) I think This will result in a shape of (1,9), please correct my understanding here.

Topic scikit-learn nlp

Category Data Science


The inverse document frequency transformation of TFIDF does not affect the shape of your vector. The shape is only altered during the tokenization phase which, in your case, results in a shape of (4, 9). The inverse document frequency portion only scales your existing features in accordance with how frequently these tokens arise elsewhere in your corpus. This can be thought as a one-to-one mapping of your original term-frequency vector (the tokenized corpus) to the tfidf vector (scaled using information on how frequent your vectors tokens exist elsewhere in your corpus).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.