Dissimilarity Matrix of non-metric proximity data

Question

Dissimilarity Matrix of non-metric proximity data

ninji

2022年2月9日 03:06

we currently have a coding exercise, where we are asked to implement Constant Shift Embedding (Paper). This in itself is not a big problem. For the algorithm, all you need is a symmetric non-zero diagonal dissimilarity matrix of some non-metric proximity data. With the algorithm you can then embed the information into a vector space and therefore you can use commonly known denoising and dimensionality reduction methods to improve the results of for example k-means clustering.

Given the E-Mail communications based on this data set, how would go about choosing a reasonable dissimilarity matrix?

The data is simply a list of unique pairs, where at least one e-mail has been sent from node A to node B. This gives rise to a graph of around 1000 nodes and 25000 edges.

Creating an adjacency matrix of this undirected graph might be a first step (which is also already provided in the framework).

I'm thankful for any pointers in the right direction.

EDIT: Over night I had an idea:

Let's say we only have 8 nodes. Now compare the proximity elements of two vertices. So if the prox. vectors would for example look like:

1 0 0 0 1 0 1 1

0 1 0 0 0 1 0 1

Their dissimilarity would be 5, since their vectors differ at 5 points.

Now just normalize w.r.t. the total number of nodes, therefore 5/8.

With this, we also incorporate the information of how many neighbors are shared instead of only looking at direct edges, and might therefore receive better results, when we later try to cluster the nodes.

Let me know what you think.

Topic similarity information-retrieval

Category Data Science

Juan Esteban de la Calle · Accepted Answer · 2019年4月30日 22:21

Maybe I did not completely understand your question, but I think the answer you are looking for is one of the following:

You may want to fill a n-by-n matrix with 1 if the person $i$ has sent e-mail(s) to person $j$, 0 otherwise
Maybe you want to fill the n-by-n matrix with the number of emails sent from person $i$ to person $j$.

Both measures are distances in the mathematical definition.

For clarity:

You could program the dissimilarity matrix as $M[i,j] = 1$ if the pair of people in your data exists.

Dissimilarity Matrix of non-metric proximity data

About