Document Similarity to List of Words in Sentiment Analysis

Question

Document Similarity to List of Words in Sentiment Analysis

JohnT

2020年8月17日 17:40

How would you go about finding document similarity to a list of words in Sentiment Analysis?

Looking find document similarity to multiple lists of words in sentiment analysis. I had been working on this with my intern but he is sorting by sentiment average to find the most similar score of each list or combinations of the list of words. I assume this isn't the best approach, I was thinking it should be a separate thing like below and I will attempt it.

Suppose he might have wanted to find a separate similarity score for each document, for example, a bunch of Moods, themes, feelings like sentiment analysis with 10 .txt files each with words to fit a theme or mood etc. like this below.

I am have been learning NLP on the side to help him and now I want to attempt this any suggestions of feedback greatly welcome.

I was thinking should I instead do doc2vec and get a similarity score for this separately and just use the sentiment score as another score.

Happy.txt

cheerful
contented
delighted
ecstatic
elated
glad
joyful
joyous
jubilant
lively
merry
overjoyed
peaceful
pleasant
pleased
thrilled
upbeat
blessed
blest
blissful
blithe
can't complain
captivated
chipper

Each column is a thing(movie, product, celebrity, whatever) and each thing has been reviewed.

examples;

thing1 was freaky awesome we have to do that again!!!

Each thing is a bunch of text documents reviewing a thing either positive, neutral, or negative, and has a sentiment score.

Then a similarity score to each txt file list of words.


so would have a separate score for each mood and for the sentiment and hot encode any categorial stuff then he would be able to get the most similar thing, to mood or almost any combination of them

Thing1 | Thing2 | Thing3
Happy | 0.857 | 0.126 | 0.836
Sad | 0.221 | 0.999 | 0.236
Romantic | 0.765 | 0.126 | 0.657
Humorous | 0.231 | 0.986 | 0.353

Sentiment | 0.987 | 0.237 | 0.736

** I also one hot encoded on category features** 

Cat can be 1 or more
Category A | 1 | 0 | 1
Category B | 0. |. 1. |. 1
Category C|. 1. | 0 | 0

Price can only be one
Price 1-5 |. 1. | 0 | 0
Price 5-10. |. 1. | 0 | 0
price 11-20 | 0. |. 1. |. 1


** The happiest** 

Thing1 0.857
Thing3 0.836
Thing2 0.126

** The saddest** 

Thing2 0.999
Thing3 0.236
Thing1 0.221

** Most similar to Thing3** 
Thing3 0.836 0.236 0.657 0.353

Thing1 0.857 0.221 0.765 0.321
Thing2 0.126 0.999 0.126 0.986

Using doc2vec I had did something similar with a bunch of Disney Princess books which lead me to this train of thought. Hopefully this right train of thought I want to help him finish before his intern finished.

Doc2Vec


# Read and tag each book into disney_corpus

disney_corpus = []

for book_filename in book_filenames:

    with codecs.open(book_filename, r, utf-8) as book_file:

        disney__corpus.append(

            gensim.models.doc2vec.TaggedDocument(

                gensim.utils.simple_preprocess( # Clean the text with simple_preprocess

                    book_file.read()),

                    [{}.format(book_filename)])) # Tag each book with its filename


# Larger values for iter should improve the model's accuracy.

model = gensim.models.Doc2Vec(vector_size = 300, 

                              min_count = 3, 

                              epochs = 100)


model.build_vocab(book_corpus)

print(model's vocabulary length:, len(model.wv.vocab))


model's vocabulary length: 1838

model.train(disney_corpus, epochs= 100, total_examples=len(sentences))


model.docvecs.most_similar(0) # Aladdin

[('disney\\TheLittleMermaid.rtf', 0.07826487720012665),
 ('disney\\Mulan.rtf', -0.035049568861722946),
 ('disney\\BeautyAndTheBeast.rtf', -0.08333050459623337)]


model.docvecs.most_similar(1) #BeautyAndTheBeast

[('disney\\TheLittleMermaid.rtf', 0.06666166335344315),
 ('disney\\Mulan.rtf', 0.02150556817650795),
 ('disney\\Aladdin.rtf', -0.08333051204681396)]

model.docvecs.most_similar(2) # Mulan

[('disney\\TheLittleMermaid.rtf', 0.12576593458652496),
 ('disney\\BeautyAndTheBeast.rtf', 0.02150557190179825),
 ('disney\\Aladdin.rtf', -0.035049568861722946)]

model.docvecs.most_similar(3) # TheLittleMermaid

[('disney\\Mulan.rtf', 0.12576593458652496),
 ('disney\\Aladdin.rtf', 0.07826487720012665),
 ('disney\\BeautyAndTheBeast.rtf', 0.06666165590286255)]

Topic doc2vec similar-documents nlp

Category Data Science

Document Similarity to List of Words in Sentiment Analysis

About