Dissimilarity Matrix of non-metric proximity data
we currently have a coding exercise, where we are asked to implement Constant Shift Embedding (Paper). This in itself is not a big problem. For the algorithm, all you need is a symmetric non-zero diagonal dissimilarity matrix of some non-metric proximity data. With the algorithm you can then embed the information into a vector space and therefore you can use commonly known denoising and dimensionality reduction methods to improve the results of for example k-means clustering.
Given the E-Mail communications based on this data set, how would go about choosing a reasonable dissimilarity matrix?
The data is simply a list of unique pairs, where at least one e-mail has been sent from node A to node B. This gives rise to a graph of around 1000 nodes and 25000 edges.
Creating an adjacency matrix of this undirected graph might be a first step (which is also already provided in the framework).
I'm thankful for any pointers in the right direction.
EDIT: Over night I had an idea:
Let's say we only have 8 nodes. Now compare the proximity elements of two vertices. So if the prox. vectors would for example look like:
1 0 0 0 1 0 1 1
0 1 0 0 0 1 0 1
Their dissimilarity would be 5, since their vectors differ at 5 points.
Now just normalize w.r.t. the total number of nodes, therefore 5/8.
With this, we also incorporate the information of how many neighbors are shared instead of only looking at direct edges, and might therefore receive better results, when we later try to cluster the nodes.
Let me know what you think.
Topic similarity information-retrieval
Category Data Science