How to implement HashingVectorizer in multinomial naive bayes algorithim

I had used TfidfVectorizer and passed it through MultinomialNB for document classification, It was working fine.

But now I need to pass a huge set of documents for ex above 1 Lakh and when I am trying to pass these document content to TfidfVectorizer my local computer hanged. It seems it has a performance issue. So I got a suggestion to use HashingVectorizer.

And I used below code for classification(Just replacing TfidfVectorizer by HashingVectorizer)

stop_words = open("english_stopwords").read().split("\n")
vect = HashingVectorizer(stop_words=stop_words, ngram_range=(1,5))
X_train_dtm = vect.fit_transform(training_content_list)
X_predict_dtm = vect.transform(predict_content_list)
nb = MultinomialNB()
nb.fit(X_train_dtm, training_label_list)
predicted_label_list = nb.predict(X_predict_dtm)

Got error:

File "/home/rajesh/www/rajesh/docuchief2/project/web/env/lib/python3.6/site-packages/sklearn/naive_bayes.py", line 720, in _count raise ValueError("Input X must be non-negative") ValueError: Input X must be non-negative

So I got TfidfVectorizer is calculated as per occurrence of words so it works, but HashingVectorizer logic is different which I can not figure out how HashingVectorizer will implement in MultinomialNB.

Can someone please help me with how I can solve this performance issue. Can I use TfidfVectorizer for a huge training dataset if yes then how? If not then how can I use HashingVectorizer here?

Topic naive-bayes-algorithim hashingvectorizer tfidf machine-learning

Category Data Science


In newer versions of sklearn, use HashingVectorizer(alternate_sign=False)


It doesn't seem that non_negative is an argument in some versions. Try using decode_error = 'ignore'. If you're working with a large dataset, this error could also be resulting from hash collisions, which can be solved by increasing the number of features:

vect = HashingVectorizer(decode_error = 'ignore',
                        n_features = 2**21,
                        preprocessor = None)

You need to ensure that the hashing vector doesn't purpose negatives. The way to do this is via HashingVectorizer(non_negative=True).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.