TF-IDF for 400,000+ unique words in corpus?

I have a corpus with over 400,000 unique words. I would like to build a TF-IDF matrix for this corpus. I have tried doing this on my laptop (16GB RAM) and Google Colab, but am unable to do so due to memory constraints. What is the best way to go about this?

Topic google-cloud-platform tfidf memory nlp

Category Data Science


Without knowing your domain one cannot comment whether this is an appropriate size of feature names or not. However, consider this. Wordnet has database contains 155 327 words organized in 175 979 synsets for a total of 207 016 word-sense pairs[1].

Does your domain rely on more than 200% of the words in Wordnet.

I'm familiar with sklearn's TF-IDF implementation[1]. Your mileage may vary on the google cloud platform. My sense is that you have a combination of the following

  1. Stop words
  2. Words in multiple tense - run, runs, ran, running etc.
  3. Misspelled words - running, runing etc.
  4. Low frequency words
  5. High frequency words
  6. N-grams
  7. Sparse/dense vector output

Stop Words: Use your own list or use lists for a known source to remove them. They don't contribute much to the resulting vector. In another sense, you will not be able to distinguish one document from another with these present.

Multiple tense: Needs preprocessing, but you could use nltk to get lemmatize your words. Again, you'll know best if converting run/ran/running etc. to run will hinder or improve your output.

Misspelled words There are dictionaries that find the nearest word to correct to. Domain dependent and could alter acronyms in ways that could hinder performance.

Low frequency/High frequency words: Again these are words that either occur too many times or too few times to distinguish documents. sklearn's tfidf implementation has two parameters, max_df and min_df. You could use something like this to throttle your feature set.

N-Grams: Are you using the standard tokenizer that returns a unigram or are you asking for unigrams, bigrams, trigrams, and/or others. This is useful as different information is encoded at different levels, but also increases feature space.

Sparse/dense vector output: Finally and this applies only to sklearn and python implementations - the output from the tfidf vectorizer is a sparse matrix. Converting to a dense matrix is useful when debugging, but makes a very big difference when you have a large document set. It will consume a lot of memory and slowdown subsequent processing. Make sure you are retaining a sparse matrix

My recommendation is that you try each of these and examine the output for validity for your downstream processes before combining.

References

[1] https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

[2] https://en.wikipedia.org/wiki/WordNet

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.