Word Embedding for Item Names(integer, one-hot encoding)

Question

Word Embedding for Item Names(integer, one-hot encoding)

Ken Kim

2019年6月20日 05:56

I am looking for the way to get the similarity between two item names using integer encoding or one-hot encoding.

For example, lane connector vs. a truck crane.

I have 100,000 item names consisting of 2~3 words as above.

also, items have its size(36mm, 12M, 2400*1200...) and unit(ea, m2, m3, hr...)

I wanna make (item name, size, unit) as a vector. To do this, I need to change texts to numbers using some way. All I found is only word2vec things, but my case has no context corpus. So I don't think it is possible to learn some context from my data.

Topic word word-embeddings nlp python

Category Data Science

bkshi · Accepted Answer · 2019年6月20日 05:56

Okay, so what I understand is you just have a list of words and want to get word vectors for those. You are correct that you cannot train a word2vec model as it requires a corpus. But what you can do is use a pre-trained model (word2vec or glove). I suggest you use word2vec as gensim has a pretty simple implementation. You can download Google’s pre-trained model here. And then you can use the following code to get word_embed for a given word_list.

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
vocab = model.vocab.keys()
word_embed = {}
for word in word_list:
    if word in vocab:
        word_embed.append(model[word])

Also, you'll have to apply some pre-processing to your word list so that you can get maximum matches from the pre-trained embeddings (like removing the etc.) And if a word is still not found in the pre-trained embeddings you can either initialize it randomly or take an average of the embeddings.

vipin bansal · Accepted Answer · 2019年6月20日 05:51

I'm not sure, if it's possible with this data set. Word2Vec is used to generate word embedding, which works on the principle of "words association" in a sentence.

So I dont think you can apply Word2Vec on this dataset which looks like doesn't have any association, except on some places where you can match(perform clustering) some parameters like:

Units
Size/dimension of the item-name

Interested to know some solution for such types of problems.

Word Embedding for Item Names(integer, one-hot encoding)

About