How to access an embedding table that is too large to fully load into memory?

I'm currently trying to find a way of loading/deserializing a .json file containing Flair word embeddings that is too large to fit in my RAM at once (>60GB .json with 32GB of RAM). My current code for loading the embedding is below.

def get_embedding_table(config):
    words_id2vec = json.load(open(config.words_id2vector_filename, 'r'))
    words_vectors = [0] * len(words_id2vec)
    for id, vec in words_id2vec.items():
        words_vectors[int(id)] = vec

    words_vectors.append(list(np.random.uniform(0, 1, config.embedding_dim)))
    words_embedding_table = tf.Variable(name='words_emb_table', initial_value=words_vectors, dtype=tf.float32)

The rest of the code that I am trying to reproduce with a different word embedding can be found here.

I wonder if it is somehow possible to access the embedding table without deserialization of the entire .json file, for example: by sequentially reading it, somehow splitting it, or reading it directly from my disk. I would greatly appreciate your input!

Topic tensorflow word-embeddings nlp

Category Data Science


There are a couple of options:

  1. Incrementally parse JSON with something like ijson. Munge and append each part to tf.Variable.

  2. Reduce the precision of the numbers dtype=tf.float32 to dtype=tf.bfloat16

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.