Cosine similarity between sentence embeddings is always positive

I have a list of documents and I am looking for a) duplicates; b) documents that are very similar. To do so, I proceed as follows:

  1. Embed the documents using paraphrase-xlm-r-multilingual-v1.
  2. Calculate the cosine similarity between the vector embeddings (code below).

All the cosine similarity values I get are between 0 and 1. Why is that? Shouldn't I also have negative cos similarity values? The sentence embeddings have both positive and negative elements.

num_docs = np.array(sentence_embedding).shape[0]

cos_sim = np.zeros([num_docs, num_docs])
for ii in range(num_docs):
    for jj in range(num_docs):
        if ii != jj:
            cos_sim[ii, jj] = np.dot(sentence_embedding[ii], sentence_embedding[jj].T)/(norm(sentence_embedding[ii])*norm(sentence_embedding[jj]))

Topic cosine-distance nlp python

Category Data Science


Disclaimer: This is actually a tentative explanation, it provides a possible answer, but it does not contain proof.

First of all, contrary to added comments, cosine similarity is not always in the range $[0,1]$. This range is valid if the vectors contain positive values, but if negative values are allowed, negative cosine similarity is possible. Take for example two vectors like $(-1,1)$ and $(1,-1)$ which should give a cosine similarity of $-1$ since the two vectors are on the same line but in opposite directions.

Going back to the question, we should ask if it's possible to have positive and negative values in vectors and still have only positive cosine similarity values. The answer is true, it is possible if the embedding vectors are contained into a nappe of a conical surface fixed in origin. (see Wikipedia: Conical surface). Basically, if you rotate the positive space section you still get positive cosine similarities.

Why would that happen with paraphrase-xlm-r-multilingual-v1? If you read the paper which describes the model Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks - Reimers,Gurevych in the training detail section they explained that they used pooling layers on top of the BERT-like pre-trained models to obtain a fixed encoding. The pooling layer default model is averaging. Supposing in the convolutional layers' output vectors you get uniform values in range $[-a,a]$. Pooling those output vector values by averaging basically moves the output vectors to be closer together, producing smaller angles so towards positive cosine similarity. Even pooling by mode max tokens has a similar effect. This increases a lot the probability to have resulted embeddings to have only positive similarity, even if they allow positive and negative values in resulted embeddings. As I said I do not have proof, but considering how pooling works and if many convolutions are pooled this is a logical consequence. This does not mean that it is not possible to have negative similarities.

A way to verify this experiment, if you have a good quantity of random sentences would be to plot a histogram of their cosine similarity and to visually inspect that you have few values near zero and a monotonic increase of frequencies as we grow towards $1$. This again would be just a hint.

[Later edit]

I have run some experiments myself to address the insights provided by @albus_c (thank you).

First of all I don't have sentences and anyway I don't use Python, so I generated artificial data (vectors with random values from a standard normal) in a matrix with rows being the instance vectors to be compared via cosine similarity. I noticed an interesting phenomena: on average the cosine similarity between random vectors have a shorter range of absolute values as the lengths of the vectors increases.

sampled cosine similarity vs random vectors size

In the graph above we can see that for small vector sizes the empirical distribution covers the whole range $[-1,1]$, and this range shrinks as vector sizes grows. This is important because the role of the pooling layers is to reduce the size of input vectors while retaining important information. As a consequence, if the pooling is aggressive the range of cosine similarity will increase on average. This is what @albus_c noticed, I think.

I also implemented a 2D pooling layer function over the random sample with average and max pooling. What I noticed, contrary to my intuition, is that averaging does not decrease the range of cosine, but keeps it in the same range. Due to the previous effect, however (pooling shrinks vector sizes and increase the range of cosine as a consequence) the final effect is that the cosine range is increased. In the case of max pooling, however, the cosine range is shrinked and moved drastically to positive values, as can be seen in the below graph.

cosine before and after pooling

In the graph above we can see in the upper left the histogram of cosine similarities on random vectors of size $768$. On upper right histogram we have cosine similarities only for vectors of size $384$ for comparison. I applied a 2d pooling layer with size $2$ and slide $2$. In the lower left graph we have similarities after max pooling. We clearly observe the values moving towards $1$ in positive range. In the lower right we have similarities after mean pooling. We notice the range has increased compared to original (upper left) but is similar with the range onvectors of same size (upper right).

I did not worked out an analytic explanation for that, those are only simulations. The normal shape that appears is due to how I generated data, in real life it can look different, but I expect tendencies to remain the same. I have also experimented with different sizes of pooling. If the size of the pooling patch increases the effects increases dramatically on max pooling while remaining the same for averaging. If the slide of the pooling is lower than the size of the patch (the patches overlap) a correlation appears between resulted vectors and the cosine range shrinks more due to that correlation for both max and average pooling.

I think a proper analitycal explanation can be also given, and if I will have results and time I will update the answer again, but I do not expect to change what we already see in the simulations.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.