How to access an embedding table that is too large to fully load into memory?
I'm currently trying to find a way of loading/deserializing a .json file containing Flair word embeddings that is too large to fit in my RAM at once (>60GB .json with 32GB of RAM). My current code for loading the embedding is below.
def get_embedding_table(config):
words_id2vec = json.load(open(config.words_id2vector_filename, 'r'))
words_vectors = [0] * len(words_id2vec)
for id, vec in words_id2vec.items():
words_vectors[int(id)] = vec
words_vectors.append(list(np.random.uniform(0, 1, config.embedding_dim)))
words_embedding_table = tf.Variable(name='words_emb_table', initial_value=words_vectors, dtype=tf.float32)
The rest of the code that I am trying to reproduce with a different word embedding can be found here.
I wonder if it is somehow possible to access the embedding table without deserialization of the entire .json file, for example: by sequentially reading it, somehow splitting it, or reading it directly from my disk. I would greatly appreciate your input!
Topic tensorflow word-embeddings nlp
Category Data Science