Results interpretation of AgglomerativeClustering labelling
First of all I would like to say that I'm quite new to python and even more new to scikit, and I'm also a self learner, so please forgive my banal question, but it doesn't look banal to me.
So, I have the following cosine similarity matrix as a DataFrame:
m1 m2 m3 m4 m5
m1 1.000 0.179 0.775 0.673 0.544
m2 0.299 1.000 0.333 0.521 0.232
m3 0.656 0.440 1.000 0.444 0.722
m4 0.578 0.154 0.623 1.000 0.891
m5 0.345 0.312 0.722 0.221 1.000
I want to get all the clustering operations of the dendrogram. To accomplish that, I created this function:
from sklearn.cluster import AgglomerativeClustering
import numpy as np
import pandas as pd
def clusters(sim, link_name):
clusters_num = len(sim.columns) - 1
clusters_collection = []
while clusters_num = 1:
clusters = AgglomerativeClustering(n_clusters=clusters_num, affinity='cosine', linkage=link_name).fit_predict(sim)
clusters_collection.append(clusters)
clusters_num = clusters_num - 1
return clusters_collection
sim_matrix = pd.read_excel(r'C:\Users\damia\OneDrive\Desktop\logistic management tool\Es sim asimmetrica\sim asimmetrica.xlsx')
sim_matrix.index = sim_matrix.columns
print(sim_matrix)
print(clusters(sim_matrix, 'average'))
The results are the following:
m1 m2 m3 m4 m5
m1 1.000 0.179 0.775 0.673 0.544
m2 0.299 1.000 0.333 0.521 0.232
m3 0.656 0.440 1.000 0.444 0.722
m4 0.578 0.154 0.623 1.000 0.891
m5 0.345 0.312 0.722 0.221 1.000
[array([0, 3, 0, 1, 2], dtype=int64), array([0, 1, 0, 0, 2], dtype=int64), array([0, 1, 0, 0, 0], dtype=int64), array([0, 0, 0, 0, 0], dtype=int64)]
So apparently it groups m1 and m3 as a first move, but I was expecting it to group m4 and m5 because they have the highest similarity value (0.891).
I've done this exercise on paper before and the correct grouping order with average linkage should be:
- m4 + m5
- m1 + m3
- m1 + m3 + m4 + m5
- all together
Topic agglomerative cosine-distance scikit-learn python clustering
Category Data Science