Is summing a cosine similarity matrix a good way to determine overall similarity?

Question

Is summing a cosine similarity matrix a good way to determine overall similarity?

Brian Guan

2022年5月21日 22:02

I'm trying to similar research abstracts, so I'm using word embeddings to convert words into 1x768 vectors, so overall turning abstracts into embeddings with shape (#ofwords, 768). Cosine similarity between two abstracts returns a matrix (#ofwords1, #ofwords2), which I then sum up to get an overall score. What I'm wondering is if this summing up of all the values in a cosine similarity matrix is really a good way to determine overall similarity between two different texts? Is there a better, or less computationally expensive way to do this?

Topic cosine-distance nlp

Category Data Science

Sammy · Accepted Answer · 2022年4月21日 15:15

A similar but more advanced approach would be BERTScore. It also computes pairwise cosine similarity between (BERT) embeddings but uses greedy matching by only accounting for similarity between the closest tokens: (based on Figure 1 from the BERTScore paper)

However, it should be noted that BERTScore is designed to be used for paragraphs and not documents.

Another more traditional approach would be doc2vec.

Brian Spiering · Accepted Answer · 2022年4月21日 14:12

A computationally simpler way is to compare documents is by calculating the cosine similarity of the average of the words embeddings for each document. The average of the word embeddings for a given document is one way to summarize the document.

Is summing a cosine similarity matrix a good way to determine overall similarity?

About