Document Similarity to List of Words in Sentiment Analysis
How would you go about finding document similarity to a list of words in Sentiment Analysis?
Looking find document similarity to multiple lists of words in sentiment analysis. I had been working on this with my intern but he is sorting by sentiment average to find the most similar score of each list or combinations of the list of words. I assume this isn't the best approach, I was thinking it should be a separate thing like below and I will attempt it.
Suppose he might have wanted to find a separate similarity score for each document, for example, a bunch of Moods, themes, feelings like sentiment analysis with 10 .txt files each with words to fit a theme or mood etc. like this below.
I am have been learning NLP on the side to help him and now I want to attempt this any suggestions of feedback greatly welcome.
I was thinking should I instead do doc2vec and get a similarity score for this separately and just use the sentiment score as another score.
Happy.txt
cheerful
contented
delighted
ecstatic
elated
glad
joyful
joyous
jubilant
lively
merry
overjoyed
peaceful
pleasant
pleased
thrilled
upbeat
blessed
blest
blissful
blithe
can't complain
captivated
chipper
Each column is a thing(movie, product, celebrity, whatever) and each thing has been reviewed.
examples;
thing1 was freaky awesome we have to do that again!!!
Each thing is a bunch of text documents reviewing a thing either positive, neutral, or negative, and has a sentiment score.
Then a similarity score to each txt file list of words.
so would have a separate score for each mood and for the sentiment and hot encode any categorial stuff then he would be able to get the most similar thing, to mood or almost any combination of them
Thing1 | Thing2 | Thing3
Happy | 0.857 | 0.126 | 0.836
Sad | 0.221 | 0.999 | 0.236
Romantic | 0.765 | 0.126 | 0.657
Humorous | 0.231 | 0.986 | 0.353
Sentiment | 0.987 | 0.237 | 0.736
** I also one hot encoded on category features**
Cat can be 1 or more
Category A | 1 | 0 | 1
Category B | 0. |. 1. |. 1
Category C|. 1. | 0 | 0
Price can only be one
Price 1-5 |. 1. | 0 | 0
Price 5-10. |. 1. | 0 | 0
price 11-20 | 0. |. 1. |. 1
** The happiest**
Thing1 0.857
Thing3 0.836
Thing2 0.126
** The saddest**
Thing2 0.999
Thing3 0.236
Thing1 0.221
** Most similar to Thing3**
Thing3 0.836 0.236 0.657 0.353
Thing1 0.857 0.221 0.765 0.321
Thing2 0.126 0.999 0.126 0.986
Using doc2vec I had did something similar with a bunch of Disney Princess books which lead me to this train of thought. Hopefully this right train of thought I want to help him finish before his intern finished.
Doc2Vec
# Read and tag each book into disney_corpus
disney_corpus = []
for book_filename in book_filenames:
with codecs.open(book_filename, r, utf-8) as book_file:
disney__corpus.append(
gensim.models.doc2vec.TaggedDocument(
gensim.utils.simple_preprocess( # Clean the text with simple_preprocess
book_file.read()),
[{}.format(book_filename)])) # Tag each book with its filename
# Larger values for iter should improve the model's accuracy.
model = gensim.models.Doc2Vec(vector_size = 300,
min_count = 3,
epochs = 100)
model.build_vocab(book_corpus)
print(model's vocabulary length:, len(model.wv.vocab))
model's vocabulary length: 1838
model.train(disney_corpus, epochs= 100, total_examples=len(sentences))
model.docvecs.most_similar(0) # Aladdin
[('disney\\TheLittleMermaid.rtf', 0.07826487720012665),
('disney\\Mulan.rtf', -0.035049568861722946),
('disney\\BeautyAndTheBeast.rtf', -0.08333050459623337)]
model.docvecs.most_similar(1) #BeautyAndTheBeast
[('disney\\TheLittleMermaid.rtf', 0.06666166335344315),
('disney\\Mulan.rtf', 0.02150556817650795),
('disney\\Aladdin.rtf', -0.08333051204681396)]
model.docvecs.most_similar(2) # Mulan
[('disney\\TheLittleMermaid.rtf', 0.12576593458652496),
('disney\\BeautyAndTheBeast.rtf', 0.02150557190179825),
('disney\\Aladdin.rtf', -0.035049568861722946)]
model.docvecs.most_similar(3) # TheLittleMermaid
[('disney\\Mulan.rtf', 0.12576593458652496),
('disney\\Aladdin.rtf', 0.07826487720012665),
('disney\\BeautyAndTheBeast.rtf', 0.06666165590286255)]
Topic doc2vec similar-documents nlp
Category Data Science