Perform clustering from a similarity matrix

I have a list of songs for each of which I have extracted a feature vector. I calculated a similarity score between each vector and stored this in a similarity matrix. I would like to cluster the songs based on this similarity matrix to attempt to identify clusters or sort of genres.

I have used the networkx package to create a force-directed graph from the similarity matrix, using the spring layout. Then I used KMeans clustering on the position of the nodes from that graph and this resulted in clusters that made sense. However, I'm not sure that this is the correct approach as it is fundamentally linked with the positions given by the spring layout.

I have also attempted to run Spectral Clustering on the similarity matrix, however, it is too slow.

Is using the positions derived from a graph generated from the similarity matrix, and then using the spring layout piped into KMeans to extract clusters, fundamentally flawed? If so, what could be a potential alternative way to cluster elements given a similarity matrix?

Topic python k-means clustering

Category Data Science


I am not sure that the positions of the force-directed graph perform better than direct clustering on the original data.

A typical clustering approach when you have a distance matrix is to apply hierarchical clustering. With scikit-learn, you can use a type of hierarchical clustering called agglomerative clustering, e.g.:

from sklearn.cluster import AgglomerativeClustering

data_matrix = [[0,0.8,0.9],[0.8,0,0.2],[0.9,0.2,0]]

model = AgglomerativeClustering(
  affinity='precomputed',
  n_clusters=2,
  linkage='complete'
).fit(data_matrix)

print(model.labels_)

(source)

For this, you should express your similarities as distances (e.g. 1 - similarity.)

For new data, you can apply a k-nearest neighbor classifier on top of the clusters.


I am not sure I fully understand your question, but generally the metric you use to build a DAG needs to be understood in terms of how you interpret relevant results.

That said, a cluster map sounds like a good match for your use case. That is, a correlation matrix with sorted values according to linkage clustering on your datapoints. See below an example:

Source https://seaborn.pydata.org/examples/structured_heatmap.html

You can easily experiment with something like this using seaborn library and seaborn.clustermap.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.