Hashing vectorizer
Hashing vectorizer is a vectorizer that uses the hashing trick to find the token string name to feature integer index mapping. Conversion of text documents into the matrix is done by this vectorizer where it turns the collection of documents into a sparse matrix which are holding the token occurrence counts.
HashingVectorizer and CountVectorizer are meant to do the same thing. Which is to convert a collection of text documents to a matrix of token occurrences. The difference is that HashingVectorizer does not store the resulting vocabulary (i.e. the unique tokens).
With HashingVectorizer, each token directly maps to a column position in a matrix, where its size is pre-defined. For example, if you have 10,000 columns in your matrix, each token maps to 1 of the 10,000 columns. This mapping happens via hashing. The hash function used is called Murmurhash3.
TfidfVectorizer
TF IDF is the result of the research conducted by two people. They are Hans Peter Luhn, credited for his work on term frequency (1957), and Karen Spärck Jones, who contributed to inverse document frequency (1972).
TF-IDF stands for Term Frequency-Inverse Document Frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the TF-IDF weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.
TF-IDF is a popular approach used to weigh terms for NLP tasks because it assigns a value to a term according to its importance in a document scaled by its importance across all documents in your corpus, which mathematically eliminates naturally occurring words in the English language, and selects words that are more descriptive of your text.
Computing TF-IDF
Typically, the TF-IDF is composed by two terms:
TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:
\begin{align*}
TF(t) = \frac{\text{Number of times term t appears in a document}}{\text{Total number of terms in the document}}
\end{align*}
IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However, it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones, by computing the following:
\begin{align*}
IDF(t) = log\left(\frac{\text{Total number of documents}}{\text{Number of documents with term t in it}} \right)
\end{align*}
\begin{align*}
TF-IDF(t) = TF(t) * IDF(t)
\end{align*}