how to train custom word2vec embeddings to find related articles?

I am beginner in machine learning. My project is to make search engine based on AI which shows related articles when we search on website. For this i decided to train my own embedding.

I found two methods for this:

  • One is to train network to find next word( i.e inputs=[the quick,the quick brown,the quick brown fox] and outputs=[brown, fox,lazy]
  • Other method is to train with nearest words(i.e [brown,fox],[brown,quick],[brown,quick]).

Which method should i use and after training how should i convert the sentence to a single vector to apply cosine similarity means sentence- the quick brown fox will return 4 vectors how should i convert it to feed for cosine similarity(which takes only one vector) with another sentence.

Topic embeddings word-embeddings nlp

Category Data Science


I find your question bit convoluted, so I will answer with the following bullet points:

  • Train your own word embeddings: There are many implementations out there, gensim is one.
  • Find related articles: On that point, without being an expert, I would suggest to do some research on Topic Modelling. There are also a lot of libraries you can use.
  • Word embeddings to sentence embeddings: This process is not as straightforward, the semantics change just by adding words together. You can use Word Mover's Distance or numerous other which train in a supervised way sentence embeddings or unsupervised.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.