What is the shape of the vector after it passes through the TfidfVecorizer fit_transform() method?
I am trying to understand what happens inside the IDF part of the TFIDF vectorizer.
The official scikit-learn page says that the shape is (4,9)
for a corpus of 4 documents having 9 unique features.
I get the Term Frequency (TF) part, it makes sense to me that ( for every unique feature(9), for each document(4) we calculate each term's frequency, so we get a matrix of shape (4,9)
But what does not make sense to me is the IDF part the formula for IDF is:
$$\text{idf}(t,D) = \text{log} \ {{N} \over {| \{ d \in D:t \in d \} | }}$$
with
$N$: total number of documents in the corpus $N = |D|$
$| \{ d \in D:t \in d \} |$ : number of documents where the term $t$ appears (i.e., $\text{tf}(t,d) \neq 0$). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to $1 + | \{ d \in D:t \in d \} |$.
So applying this formula, for every feature (9) we calculate the log(total number of documents / number of documents having the term or feature in it) I think This will result in a shape of (1,9), please correct my understanding here.
Topic scikit-learn nlp
Category Data Science