Is it good practice to remove the numeric values from the text data during preprocessing?

Im doing preprocessing on a text dataset. I have certain numerics in it like: date(1st July) year(2019) tentative values (3-5 years/ 10+ advantages). unique values (room no 31/ user rank 45) percentage(100%) Is it recommended to discard this numerics before creating a vectorizer(bow/tf-idf) for any model(classification/regression) development? Any quick help on this is much appreciated. Thank you
Category: Data Science

What is the difference between a hashing vectorizer and a tfidf vectorizer

I'm converting a corpus of text documents into word vectors for each document. I've tried this using a TfidfVectorizer and a HashingVectorizer I understand that a HashingVectorizer does not take into consideration the IDF scores like a TfidfVectorizer does. The reason I'm still working with a HashingVectorizer is the flexibility it gives while dealing with huge datasets, as explained here and here. (My original dataset has 30 million documents) Currently, I am working with a sample of 45339 documents, so, …
Category: Data Science

How to implement HashingVectorizer in multinomial naive bayes algorithim

I had used TfidfVectorizer and passed it through MultinomialNB for document classification, It was working fine. But now I need to pass a huge set of documents for ex above 1 Lakh and when I am trying to pass these document content to TfidfVectorizer my local computer hanged. It seems it has a performance issue. So I got a suggestion to use HashingVectorizer. And I used below code for classification(Just replacing TfidfVectorizer by HashingVectorizer) stop_words = open("english_stopwords").read().split("\n") vect = HashingVectorizer(stop_words=stop_words, …
Category: Data Science

CountVectorizer vs HashVectorizer for text

I'd like to tokenize a column of my training data (n-gram word-wise), but I'm working with a very large dataset distributed across a compute cluster. For this use case, Count Vectorizer doens't work well because it requires maintaining a vocabulary state, thus can't parallelize easily. Instead, for distributed workloads, I read that I should instead use a HashVectorizer. My issue is that there are no generated labels now. Throughout training and at the end, I'd like to see which words …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.