Word Embedding for Item Names(integer, one-hot encoding)

I am looking for the way to get the similarity between two item names using integer encoding or one-hot encoding.

For example, lane connector vs. a truck crane.

I have 100,000 item names consisting of 2~3 words as above.

also, items have its size(36mm, 12M, 2400*1200...) and unit(ea, m2, m3, hr...)

I wanna make (item name, size, unit) as a vector. To do this, I need to change texts to numbers using some way. All I found is only word2vec things, but my case has no context corpus. So I don't think it is possible to learn some context from my data.

Topic word word-embeddings nlp python

Category Data Science


Okay, so what I understand is you just have a list of words and want to get word vectors for those. You are correct that you cannot train a word2vec model as it requires a corpus. But what you can do is use a pre-trained model (word2vec or glove). I suggest you use word2vec as gensim has a pretty simple implementation. You can download Google’s pre-trained model here. And then you can use the following code to get word_embed for a given word_list.

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
vocab = model.vocab.keys()
word_embed = {}
for word in word_list:
    if word in vocab:
        word_embed.append(model[word])

Also, you'll have to apply some pre-processing to your word list so that you can get maximum matches from the pre-trained embeddings (like removing the etc.) And if a word is still not found in the pre-trained embeddings you can either initialize it randomly or take an average of the embeddings.


I'm not sure, if it's possible with this data set. Word2Vec is used to generate word embedding, which works on the principle of "words association" in a sentence.

So I dont think you can apply Word2Vec on this dataset which looks like doesn't have any association, except on some places where you can match(perform clustering) some parameters like:

  1. Units
  2. Size/dimension of the item-name

Interested to know some solution for such types of problems.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.