Unsupervised Hierarchical Agglomerative Clustering

Question

Unsupervised Hierarchical Agglomerative Clustering

David Waterworth

2022年4月16日 07:04

I've read a number of papers where the authors talk about Unsupervised Hierarchical Agglomerative Clustering. They seem to imply that the algorithm determines the number of clusters based on a hyper-parameter:

We define the hetereogeneity metric within a cluster to be the average of all-pair jaccard distances, and at each step merge two clusters if the heterogeneity of the resultant cluster is below a specified threshold

When I search for python implementations of Agglomerative Clustering I keep coming up with sklearn, which requires the number of clusters to be specified aprior. In most examples this is computed by plotting a dendogram and then determining by what appears to be eyeballing the chart how many clusters - for example https://towardsdatascience.com/machine-learning-algorithms-part-12-hierarchical-agglomerative-clustering-example-in-python-1e18e0075019 I'd argue it's impossible from the chart alone to determine if 3 or 5 is the optimal (based on largest vertical distance). I believe this is Wards method but I'm not sure it's the same as merging clusters where the heterogeneity is below a threshold and

Is this possible in sklearn, or is there another python implementation which does this? I feel at the very least there should be a way to process the dendogram programmatically rather than plotting it?

Topic agglomerative clustering

Category Data Science

David Waterworth · Accepted Answer · 2021年2月13日 23:25

I think I've figured out how to implement the algorithm described in the paper I'm studying. I suspect they used scipy.cluster.hierarchy.

Anyway, my process is:

Generate a distance matrix y from my list of examples x.
Compute the linkage using scipy.cluster.hierarchy.linkage
Generate flat clusters using scipy.cluster.hierarchy.fcluster

The last step is where the threshold mentioned is applied. I still have a question around how to use fcluster to generate clusters based on heterogeneity

What I've found confusing is there are a lot of tutorials on how to determine the number of clusters for sklearn.cluster.AgglomerativeClustering which use scipy.cluster.hierarchy.linkage then scipy.cluster.hierarchy.dendrogram to plot a dendrogram and which is then used to visually identify how many clusters are required.

Unsupervised Hierarchical Agglomerative Clustering

About