score or cost function for AgglomerativeClustering

I am learning AgglomerativeClustering using sklearn. It is fairly easy to use for example:

  # create clusters
    hc = AgglomerativeClustering(n_clusters=10, affinity = 'euclidean', linkage = 'ward')
    # save clusters for chart
    y_hc = hc.fit_predict(points)

In my case,n_clusters is dynamic within a range, say from 9-15. I would like to run some cost function or score function, so I can plot a chart and then pick one there.

However, seems AgglomerativeClustering doesn't have a score function, unlike Kmeans, which I can either use inertia_ or score to plot.

So is there a way I can plot a chart and justify my n_clusters choice?

Topic scikit-learn clustering

Category Data Science


Don't run agglomerative clustering with multiple n_clusters, that is just unnecessary.

Agglomerative clustering is a two-step process (but the sklearn API is suboptimal here, consider using scipy itself instead!).

  1. Construct a dendrogram
  2. Decide where to cut the dendrogram

The first step is expensive, so you should only do this once. It does not yet produce clusters, but the Dendrogram which can be helpful to decide on the number of clusters. In the -cheap- second step, clusters are then extracted. You can use the height value of the dendrogram in a similar way as inertia with k-means. In fact, the height with Ward linkage is very closely related to inertia. But the same drawbacks apply: the value is (for most linkages) monotone. So it's not too helpful for choosing upon the number of clusters naively.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.