How to create word2vec for phrases and then calculate cosine similarity

I just started using word2vec and have no idea how to create vectors (using word2vec) of two lists, each containing set of words and phrases and then how to calculate the cosine similarity between these 2 lists.

For example:

list1 =['blogs', 'vmware', 'server', 'virtual', 'oracle update', 'virtualization', 'application','infrastructure', 'management']
list2 = ['microsoft visual studio','desktop virtualization',
'microsoft exchange server','cloud computing','windows server 2008']

Any help would be appreciated.

Topic data-analysis cosine-distance word2vec python

Category Data Science

You cannot apply word2vec on multiple words. You should use something like doc2vec, which gives a vector for each phrase:

phrase = model.infer_vector(['microsoft', 'visual', 'studio'])

You can also average or sum the vectors of words (from word2vec) in each phrase, e.g.

phrase = w2v('microsoft') + w2v('visual') + w2v('studio')

This way, a phrase vector would be the same length as a word vector for comparison. But still, methods like doc2vec are better than a simple average or sum. Finally, you could proceed to compare each word in the first list to every phrase in the second list, and find the closest phrase.

Note that a phrase like "cloud computing" has a completely different meaning than the word "cloud". Therefore, these phrases, specially if frequent, better to be treated as a single word, e.g.

phrase = w2v('cloud_computing')

Extra directions:

  1. Here is an answer by Astariul on stackoverflow that uses a function from word2vec package to calculate similarity between two sets of words.

  2. Take a look at fastText that works better when there is a lot of misspelled, or out-of-vocabulary words.

enter image description here

Vector representation of phrases (called term-vectors) are used in projects like search results optimization and question answering.

A textbook example is "Chinese river" ~ {"Yangtze_River","Qiantang_River"} (

Above example identifies phrases based on Nouns mentioned in Freebase DB. There are alternatives such as :

  1. Identify all nouns and other phrases based on POS tagging
  2. Identify all bi-grams, tri-grams

Filter the list above based on usage (E.g.: only retain terms that have been used at least 500 times in large corpus such as Wikipedia).

Once terms have been identified, Word Vector algo will work as it is :

  1. Train word vector model
  2. Concat phrases into single tokens and retrain the model
  3. Merge these 2 models

Following patent from Google has more details

Other papers that have examples of domains where term vectors have been evaluated / used :


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.