How can I get semantic word embneddings for compound terms?

Question

How can I get semantic word embneddings for compound terms?

hipoglucido

2017年9月4日 23:05

I need to build semantic word embeddings representation of compound terms like "electronic engineer" or "microsoft excel". One approach would be to use a standard pretrained model an average the words but, since I have a corpus of my domain, is there a possible better approach?

To be more precise:

The data I have is a corpus of millions of documents. Each document is ~ half a page and contains these compound terms. However there may be compound terms not included in the corpus.

Thanks

Topic word machine-learning

Category Data Science

Robin · Accepted Answer · 2017年8月25日 12:16

If you want an exact answer, please provide a precise question i.e. define what data you have, and what you exactly wants.

This said, in a general manner, you need a dataset of texts that contain these compound terms. How to treat compound terms is a whole scientific field in itself, but since you're talking about semantic word embeddings, I suggest you take a look at the article Distributed Representations of Words and Phrases and their Compositionality. The same guys who introduced word2vec describe here a simple method to go from word representation to phrase representation, giving btw a way to merge compound terms in single terms. The words "microsoft excel" become "microsoft_excel" and get their own unique embedding.

If you want a python implementation for that, take a look at the gensim.models.phrase class. This does the same work as presented in the previous article.

How can I get semantic word embneddings for compound terms?

About