I am working a data-set with more than 100,000 records. This is how the data looks like:

email_id    cust_id campaign_name
123         4567     World of Zoro
123         4567     Boho XYz
123         4567     Guess ABC
234         5678     Anniversary X
234         5678     World of Zoro
234         5678     Fathers day
234         5678     Mothers day
345         7890     Clearance event
345         7890     Fathers day
345         7890     Mothers day
345         7890     Boho XYZ
345         7890     Guess ABC
345         7890     Sale

I am trying to understand the campaign sequence and looking for the next possible campaign for the customers.

Assume I have processed my data and stored it in 'camp'.

With Word2Vec-

from gensim.models import Word2Vec

model = Word2Vec(sentences=camp, size=100, window=4, min_count=5, workers=4, sg=0)

The problem with this model is that it accepts tokens and spits out text-tokens with probabilities in return when looking for similarities.

Word2Vec accepts this form of input-


And gives this form of output -


Since I want to predict campaign sequence which occurs more frequently in combination with target word, I was wondering if there is anyway I can give below input to the model and get the campaign name in the output

My input to be as -

[['World of Zoro','Boho XYZ','Guess ABC'],['Anniversary X','World of 
Zoro','Fathers day','Mothers day'],['Clearance event','Fathers day','Mothers 
day','Boho XYZ','Guess ABC','Sale']]

Output -

model.wv.most_similar('World of Zoro')
[Sale,0.98],[Mothers day,0.97]

I am also not sure if there is any functionality within the Word2Vec or any similar algorithms which can help finding the next possible campaign for individual users.

  1. Word2Vec operates on words and you want to compare 'texts' (series of words of varied length). For that, doc2vec might more appropriate.

  2. You have very short 'texts' (names of campaigns) so generating embedding from them (only from them) want give greate effects. You could start with some pretrained vectors but anyway probably you want achieve much more. But the questions is:

  3. It's not sure from your examples and explanation what will be nature of texts you want to find the most similar campaign name with. On one side you are writing about finding the one most frequent with a given word - then you could just create statistics of words and campaigns. On the other side, in your example, you pass whole text - then above mentioned apply, finding most similar text using text vectors.


