How to calculate the mean average of word embedding and then compare strings using sklearn.metrics.pairwise

Question

How to calculate the mean average of word embedding and then compare strings using sklearn.metrics.pairwise

test

2022年3月14日 16:04

I am totally new to this topic, that's why I am so confused or stuck in this code for a while, but I am not sure how to solve it correctly. My goal is to write a short text embedding using vector representation from the text. The word embeddings are aggregated via mean averaging to infer a vector representation for the text. I generated model vectors using gensim.models and then I run each through the model and check if the word is inside it. If yes, I will embed it and then aggregate the mean average ( not sure if is correct). After that, I want to compare it with cosine similarity, but I am not sure how.

from sklearn.metrics.pairwise import cosine_similarity

first_sentence_list = ['driver', 'backs', 'into', 'stroller', 'with', 'child', ',', 'drives', 'off']
second_sentence_list = ['driver', 'backs', 'into', 'mom', ',', 'stroller', 'with', 'child', 'then', 'drives', 'off']

//

def meanEmbeddings(text_list):
    model = load_wiki_en_vectors()

    test = []
//loop the given sentence
    for word in text_list:
        try:
            word_embeding = model.get_vector(word, norm=True)
            test.append(np.mean(word_embeding,axis=0)) // not sure if this is right doing mean averaging here
        except KeyError:
            continue
    return test


res_1 = meanEmbeddings(first_sentence)
//[0.0023045307, 0.0033775743, ...]
res_2 = meanEmbeddings(second_sentence)
//[0.0023045307, 0.0033775743,...]

After want to do the similarity check using sklearn pairwise cosine similarity library. The problem is here, I have two different length of ( first 9 and second 11)


cos = cosine_similarity([res_1],[res_2])

Topic gensim word2vec word-embeddings scikit-learn python

Category Data Science

Ashwiniku918 · Accepted Answer · 2022年1月14日 05:30

This should work, you dont need to append.

   def meanEmbeddings(model, words):
        # remove out-of-vocabulary words
        words = [word for word in words if word in model.vocab]
        if len(words) >= 1:
            return np.mean(model[words], axis=0)
        else:
            return []

How to calculate the mean average of word embedding and then compare strings using sklearn.metrics.pairwise

About