score or cost function for AgglomerativeClustering

Question

score or cost function for AgglomerativeClustering

daxu

2022年5月22日 07:00

I am learning AgglomerativeClustering using sklearn. It is fairly easy to use for example:

  # create clusters
    hc = AgglomerativeClustering(n_clusters=10, affinity = 'euclidean', linkage = 'ward')
    # save clusters for chart
    y_hc = hc.fit_predict(points)

In my case,n_clusters is dynamic within a range, say from 9-15. I would like to run some cost function or score function, so I can plot a chart and then pick one there.

However, seems AgglomerativeClustering doesn't have a score function, unlike Kmeans, which I can either use inertia_ or score to plot.

So is there a way I can plot a chart and justify my n_clusters choice?

Topic scikit-learn clustering

Category Data Science

Has QUIT--Anony-Mousse · Accepted Answer · 2018年11月4日 09:08

Don't run agglomerative clustering with multiple n_clusters, that is just unnecessary.

Agglomerative clustering is a two-step process (but the sklearn API is suboptimal here, consider using scipy itself instead!).

Construct a dendrogram
Decide where to cut the dendrogram

The first step is expensive, so you should only do this once. It does not yet produce clusters, but the Dendrogram which can be helpful to decide on the number of clusters. In the -cheap- second step, clusters are then extracted. You can use the height value of the dendrogram in a similar way as inertia with k-means. In fact, the height with Ward linkage is very closely related to inertia. But the same drawbacks apply: the value is (for most linkages) monotone. So it's not too helpful for choosing upon the number of clusters naively.

score or cost function for AgglomerativeClustering

About