Is summing a cosine similarity matrix a good way to determine overall similarity?

I'm trying to similar research abstracts, so I'm using word embeddings to convert words into 1x768 vectors, so overall turning abstracts into embeddings with shape (#ofwords, 768). Cosine similarity between two abstracts returns a matrix (#ofwords1, #ofwords2), which I then sum up to get an overall score. What I'm wondering is if this summing up of all the values in a cosine similarity matrix is really a good way to determine overall similarity between two different texts? Is there a better, or less computationally expensive way to do this?

Topic cosine-distance nlp

Category Data Science


A similar but more advanced approach would be BERTScore. It also computes pairwise cosine similarity between (BERT) embeddings but uses greedy matching by only accounting for similarity between the closest tokens: enter image description here (based on Figure 1 from the BERTScore paper)

However, it should be noted that BERTScore is designed to be used for paragraphs and not documents.

Another more traditional approach would be doc2vec.


A computationally simpler way is to compare documents is by calculating the cosine similarity of the average of the words embeddings for each document. The average of the word embeddings for a given document is one way to summarize the document.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.