How to calculate the mean average of word embedding and then compare strings using sklearn.metrics.pairwise
I am totally new to this topic, that's why I am so confused or stuck in this code for a while, but I am not sure how to solve it correctly. My goal is to write a short text embedding using vector representation from the text. The word embeddings are aggregated via mean averaging to infer a vector representation for the text. I generated model vectors using gensim.models and then I run each through the model and check if the word is inside it. If yes, I will embed it and then aggregate the mean average ( not sure if is correct). After that, I want to compare it with cosine similarity, but I am not sure how.
from sklearn.metrics.pairwise import cosine_similarity
first_sentence_list = ['driver', 'backs', 'into', 'stroller', 'with', 'child', ',', 'drives', 'off']
second_sentence_list = ['driver', 'backs', 'into', 'mom', ',', 'stroller', 'with', 'child', 'then', 'drives', 'off']
//
def meanEmbeddings(text_list):
model = load_wiki_en_vectors()
test = []
//loop the given sentence
for word in text_list:
try:
word_embeding = model.get_vector(word, norm=True)
test.append(np.mean(word_embeding,axis=0)) // not sure if this is right doing mean averaging here
except KeyError:
continue
return test
res_1 = meanEmbeddings(first_sentence)
//[0.0023045307, 0.0033775743, ...]
res_2 = meanEmbeddings(second_sentence)
//[0.0023045307, 0.0033775743,...]
After want to do the similarity check using sklearn pairwise cosine similarity library. The problem is here, I have two different length of ( first 9 and second 11)
cos = cosine_similarity([res_1],[res_2])
Topic gensim word2vec word-embeddings scikit-learn python
Category Data Science