CountVectorizer vs HashVectorizer for text
I'd like to tokenize a column of my training data (n-gram word-wise), but I'm working with a very large dataset distributed across a compute cluster. For this use case, Count Vectorizer doens't work well because it requires maintaining a vocabulary state, thus can't parallelize easily.
Instead, for distributed workloads, I read that I should instead use a HashVectorizer. My issue is that there are no generated labels now. Throughout training and at the end, I'd like to see which words were the most important for my model, but this isn't possible with Hash.
Is there something I can do to maintain the human-readable labels generated from CountVectorizer but take advantage of a parallel distributed cluster like HashVectorizer can?
Topic hashingvectorizer tokenization nlp distributed
Category Data Science