Comparing the similarity structure of 2 distance matrices (computed from sentence embedding)

Question

Comparing the similarity structure of 2 distance matrices (computed from sentence embedding)

keun

2021年9月10日 16:52

I apologize if this question lacks clarity, my mathematical background on the topic is limited and was hoping to find some guidance. I would like to compare 2 distance matrices that contain pair-wise semantic (cosine) similarities for a set of 33 sentences. The matrices were created from sentence embedding, i.e., embedding of full sentences in a vector space (I used Google's Universal Sentence Encoder, so the vector space has 512 dimensions). The sets of sentences that underlie the 2 distance matrices describe the same 33 contents/events but one set contains more detailed descriptions, the other set less detailed (more condensed) descriptions. I would like to test how the loss of detail affects the structure of the distance matrices. In particular, I'm interested in the following 2 questions:

Can the similarity structure of the condensed sentences be described in a lower-dimensional space than that of the detailed sentences?

--First thing that came to mind were dimensionality reduction technique (PCA etc.).

(That might be directly related to the first question), is the range of distances for the detailed sentences wider than for the condensed sentences? That is, are there more extreme distances (high and low) for the detailed set of sentences?

--I was wondering whether multidimensional scaling could be useful. I would expect to see more clustering for the detailed set. But perhaps that doesn't make sense?

Any input would be greatly appreciated! Thank you!!

Topic semantic-similarity distance nlp dimensionality-reduction

Category Data Science

Comparing the similarity structure of 2 distance matrices (computed from sentence embedding)

About