Semantic network using word2vec
I have thousands of headlines and I would like to build a semantic network using word2vec, specifically google news files. My sentences look like
Titles
Dogs are humans’ best friends
A dog died because of an accident
You can clean dogs’ paws using natural products.
A cat was found in the kitchen
And so on.
What I would like to do is finding some specific pattern within this data, e.g. similarity in topics on dogs and cats, using semantic networks. Could you give me some advice on how I can do it?
Code:
import pandas as pd
import gensim
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.manifold import TSNE
main_data.Titles = np.where(main_data.Titles.isnull(),'NA', main_data.Titles)
article_titles = main_data['Titles']
titles_list = [title for title in article_titles]
big_title_string = ' '.join(titles_list)
tokens = word_tokenize(big_title_string)
words = [word.lower() for word in tokens if word.isalpha()]
stop_words = set(stopwords.words('english'))
words = [word for word in words if not a word in stop_words]
model = gensim.models.KeyedVectors.load_word2vec_format('path/GoogleNews-vectors-negative300.bin', binary = True) 
model.vector_size
vector_list = [model[word] for word in words if word in model.vocab]
words_filtered = [word for word in words if the word in `model.vocab`]
word_vec_zip = zip(words_filtered, vector_list)
word_vec_dict = dict(word_vec_zip)
df = pd.DataFrame.from_dict(word_vec_dict, orient='index')
tsne = TSNE(n_components = 2, init = 'random', random_state = 10, perplexity = 100)
tsne_df = tsne.fit_transform(df[:400])
sns.set()
fig, ax = plt.subplots(figsize = (11.7, 8.27))
sns.scatterplot(tsne_df[:, 0], tsne_df[:, 1], alpha = 0.5)
from adjustText import adjust_text
texts = []
words_to_plot = list(np.arange(0, 400, 10))
for word in words_to_plot:
    texts.append(plt.text(tsne_df[word, 0], tsne_df[word, 1], df.index[word], fontsize = 14))
    
adjust_text(texts, force_points = 0.4, force_text = 0.4, 
            expand_points = (2,1), expand_text = (1,2),
            arrowprops = dict(arrowstyle = -, color = 'black', lw = 0.5))
plt.show()
However, I cannot understand how to interpret the results. I think they are wrong and probably this is not a good approach for building a semantic network. maybe I have been missing something...For instance, this code is still keeping stopwords after the part of
words = [word for word in words if not a word in stop_words]
This is an example of output difficult to read and explain (at least, for me):
I would greatly appreciate it if you could give me some tips and advice on how to perform a semantic network that can show semantic similarity within titles.
Topic semantic-similarity word2vec neural-network nlp python
Category Data Science
