How to handle Memory issues in training Word Embeddings on Large Datasets?
I want to train a word predictability task to generate word embeddings. The document collection contains 243k documents. The code implementation is in torch. I am struggling with the huge size of the dataset and need ideas on how to train word embeddings on such a large dataset which is a collection of 243 thousand full article documents. The research computing resource is timed so get short access to GPU node and so opted for Incremental model training:
- Incremental Model training: One way to train on entire dataset is to use incremental model training that is train the model on one chunk of the data and save it. Later on pick up the same pre-trained model and start training on it next chunk of data. The problem that I am facing in this approach is that how do I maintain the vocabulary/dictionary of words. In word embedding methods dictionary/vocab plays an important role. We sweep over all documents and create vocab of words that have count greater than a minimum set frequency. Now, actually this vocab is a hash map which has index associated to each word and in training samples we replace words with their indices in the vocab for the simplicity in the model. In the case of incremental training how do I create dictionary incrementally? Do I have to create vocab/dictionary on entire documents initially and then train incrementally? Or is the a way to extend vocab also in incremental training?
- Another problem is memory limit on the size of the vocab data structure. I am implementing my model in Torch which is LUA based. So, LUA puts a limit on tables size, I cannot load vocab for entire documents in a single table. How to overcome such memory issues?
- Taking inspiration from Glove vectors. In their paper they say that they “We trained our model on five corpora of varying sizes: a 2010 Wikipedia dump with 1 billion tokens; a 2014 Wikipedia dump with 1.6 billion to- kens; Gigaword 5 which has 4.3 billion tokens; the combination Gigaword5 + Wikipedia2014, which has 6 billion tokens; and on 42 billion tokens of web data, from Common Crawl5. We tokenize and lowercase each corpus with the Stanford tokenizer, build a vocabulary of the 400,000 most frequent words6, and then construct a matrix of co- occurrence counts X” . Any idea on how Glove vectors trained on such a big corpus and big vocabulary and how memory restrictions in their case might have got handled? Paper reference - http://nlp.stanford.edu/pubs/glove.pdf
- Any ideas on how to limit the size of the dataset for generating word embeddings? How would it affect the performance or coverage of word embeddings with the increase or decrease in number of documents? Is it a good idea to use sampling techniques to sample documents from dataset? If yes, then please suggest some of the sampling techniques.
Topic torch word-embeddings deep-learning dataset
Category Data Science