How to calculate the mean average of word embedding and then compare strings using sklearn.metrics.pairwise

I am totally new to this topic, that's why I am so confused or stuck in this code for a while, but I am not sure how to solve it correctly. My goal is to write a short text embedding using vector representation from the text. The word embeddings are aggregated via mean averaging to infer a vector representation for the text. I generated model vectors using gensim.models and then I run each through the model and check if the word is inside it. If yes, I will embed it and then aggregate the mean average ( not sure if is correct). After that, I want to compare it with cosine similarity, but I am not sure how.

from sklearn.metrics.pairwise import cosine_similarity

first_sentence_list = ['driver', 'backs', 'into', 'stroller', 'with', 'child', ',', 'drives', 'off']
second_sentence_list = ['driver', 'backs', 'into', 'mom', ',', 'stroller', 'with', 'child', 'then', 'drives', 'off']

//

def meanEmbeddings(text_list):
    model = load_wiki_en_vectors()

    test = []
//loop the given sentence
    for word in text_list:
        try:
            word_embeding = model.get_vector(word, norm=True)
            test.append(np.mean(word_embeding,axis=0)) // not sure if this is right doing mean averaging here
        except KeyError:
            continue
    return test


res_1 = meanEmbeddings(first_sentence)
//[0.0023045307, 0.0033775743, ...]
res_2 = meanEmbeddings(second_sentence)
//[0.0023045307, 0.0033775743,...]

After want to do the similarity check using sklearn pairwise cosine similarity library. The problem is here, I have two different length of ( first 9 and second 11)


cos = cosine_similarity([res_1],[res_2])

Topic gensim word2vec word-embeddings scikit-learn python

Category Data Science


This should work, you dont need to append.

   def meanEmbeddings(model, words):
        # remove out-of-vocabulary words
        words = [word for word in words if word in model.vocab]
        if len(words) >= 1:
            return np.mean(model[words], axis=0)
        else:
            return []

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.