How can I get semantic word embneddings for compound terms?

I need to build semantic word embeddings representation of compound terms like "electronic engineer" or "microsoft excel". One approach would be to use a standard pretrained model an average the words but, since I have a corpus of my domain, is there a possible better approach?

To be more precise:

The data I have is a corpus of millions of documents. Each document is ~ half a page and contains these compound terms. However there may be compound terms not included in the corpus.

Thanks

Topic word machine-learning

Category Data Science


If you want an exact answer, please provide a precise question i.e. define what data you have, and what you exactly wants.

This said, in a general manner, you need a dataset of texts that contain these compound terms. How to treat compound terms is a whole scientific field in itself, but since you're talking about semantic word embeddings, I suggest you take a look at the article Distributed Representations of Words and Phrases and their Compositionality. The same guys who introduced word2vec describe here a simple method to go from word representation to phrase representation, giving btw a way to merge compound terms in single terms. The words "microsoft excel" become "microsoft_excel" and get their own unique embedding.

If you want a python implementation for that, take a look at the gensim.models.phrase class. This does the same work as presented in the previous article.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.