Results interpretation of AgglomerativeClustering labelling

First of all I would like to say that I'm quite new to python and even more new to scikit, and I'm also a self learner, so please forgive my banal question, but it doesn't look banal to me.

So, I have the following cosine similarity matrix as a DataFrame:

       m1     m2     m3     m4     m5
m1  1.000  0.179  0.775  0.673  0.544
m2  0.299  1.000  0.333  0.521  0.232
m3  0.656  0.440  1.000  0.444  0.722
m4  0.578  0.154  0.623  1.000  0.891
m5  0.345  0.312  0.722  0.221  1.000

I want to get all the clustering operations of the dendrogram. To accomplish that, I created this function:

from sklearn.cluster import AgglomerativeClustering
import numpy as np
import pandas as pd

def clusters(sim, link_name):

    clusters_num = len(sim.columns) - 1

    clusters_collection = []
    while clusters_num = 1:
        clusters = AgglomerativeClustering(n_clusters=clusters_num, affinity='cosine', linkage=link_name).fit_predict(sim)
        clusters_collection.append(clusters)
        clusters_num = clusters_num - 1

    return clusters_collection

sim_matrix = pd.read_excel(r'C:\Users\damia\OneDrive\Desktop\logistic management tool\Es sim asimmetrica\sim asimmetrica.xlsx')
sim_matrix.index = sim_matrix.columns
print(sim_matrix)

print(clusters(sim_matrix, 'average'))

The results are the following:

       m1     m2     m3     m4     m5
m1  1.000  0.179  0.775  0.673  0.544
m2  0.299  1.000  0.333  0.521  0.232
m3  0.656  0.440  1.000  0.444  0.722
m4  0.578  0.154  0.623  1.000  0.891
m5  0.345  0.312  0.722  0.221  1.000
[array([0, 3, 0, 1, 2], dtype=int64), array([0, 1, 0, 0, 2], dtype=int64), array([0, 1, 0, 0, 0], dtype=int64), array([0, 0, 0, 0, 0], dtype=int64)]

So apparently it groups m1 and m3 as a first move, but I was expecting it to group m4 and m5 because they have the highest similarity value (0.891).

I've done this exercise on paper before and the correct grouping order with average linkage should be:

  1. m4 + m5
  2. m1 + m3
  3. m1 + m3 + m4 + m5
  4. all together

Topic agglomerative cosine-distance scikit-learn python clustering

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.