How to access an embedding table that is too large to fully load into memory?

Question

How to access an embedding table that is too large to fully load into memory?

Nels

2022年5月30日 09:00

I'm currently trying to find a way of loading/deserializing a .json file containing Flair word embeddings that is too large to fit in my RAM at once (>60GB .json with 32GB of RAM). My current code for loading the embedding is below.

def get_embedding_table(config):
    words_id2vec = json.load(open(config.words_id2vector_filename, 'r'))
    words_vectors = [0] * len(words_id2vec)
    for id, vec in words_id2vec.items():
        words_vectors[int(id)] = vec

    words_vectors.append(list(np.random.uniform(0, 1, config.embedding_dim)))
    words_embedding_table = tf.Variable(name='words_emb_table', initial_value=words_vectors, dtype=tf.float32)

The rest of the code that I am trying to reproduce with a different word embedding can be found here.

I wonder if it is somehow possible to access the embedding table without deserialization of the entire .json file, for example: by sequentially reading it, somehow splitting it, or reading it directly from my disk. I would greatly appreciate your input!

Topic tensorflow word-embeddings nlp

Category Data Science

Brian Spiering · Accepted Answer · 2020年4月22日 21:59

1

Brian Spiering answered at 2020年4月22日 21:59

There are a couple of options:

Incrementally parse JSON with something like ijson. Munge and append each part to tf.Variable.
Reduce the precision of the numbers dtype=tf.float32 to dtype=tf.bfloat16

How to access an embedding table that is too large to fully load into memory?

About