What is the difference between a hashing vectorizer and a tfidf vectorizer

I'm converting a corpus of text documents into word vectors for each document. I've tried this using a TfidfVectorizer and a HashingVectorizer

I understand that a HashingVectorizer does not take into consideration the IDF scores like a TfidfVectorizer does. The reason I'm still working with a HashingVectorizer is the flexibility it gives while dealing with huge datasets, as explained here and here. (My original dataset has 30 million documents)

Currently, I am working with a sample of 45339 documents, so, I have the ability to work with a TfidfVectorizer also. When I use these two vectorizers on the same 45339 documents, the matrices that I get are different.

hashing = HashingVectorizer()
with LSM('corpus.db')) as corpus:
    hashing_matrix = hashing.fit_transform(corpus)
print(hashing_matrix.shape) 

hashing matrix shape (45339, 1048576)

tfidf = TfidfVectorizer()
with LSM('corpus.db')) as corpus:
    tfidf_matrix = tfidf.fit_transform(corpus)
print(tfidf_matrix.shape) 

tfidf matrix shape (45339, 663307)

I want to understand better the differences between a HashingVectorizer and a TfidfVectorizer, and the reason why these matrices are in different sizes - particularly in the number of words/terms.

Topic hashingvectorizer tfidf text-mining scikit-learn nlp

Category Data Science


Hashing vectorizer


Hashing vectorizer is a vectorizer that uses the hashing trick to find the token string name to feature integer index mapping. Conversion of text documents into the matrix is done by this vectorizer where it turns the collection of documents into a sparse matrix which are holding the token occurrence counts.

HashingVectorizer and CountVectorizer are meant to do the same thing. Which is to convert a collection of text documents to a matrix of token occurrences. The difference is that HashingVectorizer does not store the resulting vocabulary (i.e. the unique tokens).

With HashingVectorizer, each token directly maps to a column position in a matrix, where its size is pre-defined. For example, if you have 10,000 columns in your matrix, each token maps to 1 of the 10,000 columns. This mapping happens via hashing. The hash function used is called Murmurhash3.

TfidfVectorizer


TF IDF is the result of the research conducted by two people. They are Hans Peter Luhn, credited for his work on term frequency (1957), and Karen Spärck Jones, who contributed to inverse document frequency (1972).

TF-IDF stands for Term Frequency-Inverse Document Frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the TF-IDF weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

TF-IDF is a popular approach used to weigh terms for NLP tasks because it assigns a value to a term according to its importance in a document scaled by its importance across all documents in your corpus, which mathematically eliminates naturally occurring words in the English language, and selects words that are more descriptive of your text.

Computing TF-IDF

Typically, the TF-IDF is composed by two terms:

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

\begin{align*} TF(t) = \frac{\text{Number of times term t appears in a document}}{\text{Total number of terms in the document}} \end{align*}

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However, it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones, by computing the following:

\begin{align*} IDF(t) = log\left(\frac{\text{Total number of documents}}{\text{Number of documents with term t in it}} \right) \end{align*}

\begin{align*} TF-IDF(t) = TF(t) * IDF(t) \end{align*}


The main difference is that HashingVectorizer applies a hashing function to term frequency counts in each document, where TfidfVectorizer scales those term frequency counts in each document by penalising terms that appear more widely across the corpus. There’s a great summary here.

  • Hash functions are an efficient way of mapping terms to features; it doesn’t necessarily need to be applied only to term frequencies but that’s how HashingVectorizer is employed here. Along with the 45339 documents, I suspect the feature vector is of length 1048576 because it’s the default 2^20 n_features; you could reduce this and make it less expensive to process but with an increased risk of collision, where the function maps different terms to the same feature.

  • Depending on the use case for the word vectors, it may be possible to reduce the length of the hash feature vector (and thus complexity) significantly with acceptable loss to accuracy/effectiveness (due to increased collision). Scikit-learn has some hashing parameters that can assist, for example alternate_sign.

  • If the hashing matrix is wider than the dictionary, it will mean that many of the column entries in the hashing matrix will be empty, and not just because a given document doesn't contain a specific term but because they're empty across the whole matrix. If it is not, it might send multiple terms to the same feature hash - this is the 'collision' we've been talking about. HashingVectorizer has a setting that works to mitigate this called alternate_sign that's on by default, described here.

  • ‘Term frequency - inverse document frequency’ takes term frequencies in each document and weights them by penalising words that appear more frequently across the whole corpus. The intuition is that terms found situationally are more likely to be representative of a specific document’s topic. This is different to a hashing function in that it is necessary to have a full dictionary of words in the corpus in order to calculate the inverse document frequency. I expect your tf.idf matrix dimensions are 45339 documents by 663307 words in the corpus; Manning et al provide more detail and examples of calculation.

‘Mining of Massive Datasets’ by Leskovec et al has a ton of detail on both feature hashing and tf.idf, the authors made the pdf available here.


HashingVectorizer and CountVectorizer (note not Tfidfvectorizer) are meant to do the same thing. Which is to convert a collection of text documents to a matrix of token occurrences.

If your are looking to get term frequencies weighted by their relative importance (IDF) then Tfidfvectorizer is what you should use. If you need the raw counts or normalized counts (term frequency), then you should use CountVectorizer or HashingVectorizer.

To learn about HashingVectorizer, see this article on HashingVectorizer vs. CountVectorizer.

For more information about Tfidfvectorizer, see this article on How to Use Tfidftransformer and Tfidfvectorizer.


The HashingVectorizer has a parameter n_features which is 1048576 by default. When hashing, they don't actually compute a dictionary mapping terms to a unique index to use for each one. Instead, you just hash each term and use a large enough size that you don't expect there to be too many collisions: hash(term) mod table_size. You can make the returned matrix be any size you want by setting n_features. You should adjust this to be in the right ballpark for your corpus if you don't think the default is reasonable (having it larger will cause less collisions though it takes more memory).

from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer()
print(vectorizer.transform(['a very small document']).shape)
(1, 1048576)

small_vectorizer = HashingVectorizer(n_features=5)
print(small_vectorizer.transform(['a very small document']).shape)    
(1, 5)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.