Find the shared properties of cluster samples

Question

Find the shared properties of cluster samples

qwertzuiop

2022年5月29日 06:04

I have a dataset which contains ~15 features. With the elbow method, I found out that the optimal number of clusters is probably four. Therefore, I applied the K-means algorithm with four clusters. Now, I would like to understand why these clusters have been formed the way they are. In other words, I would like to identify the shared properties of the points of a specific cluster.

My idea is the following:

Let's pretend that C1 are the coordinates of the centroid of the first cluster and that P1 and P2 are two points of this cluster.

$$ C1 = \begin{pmatrix} 5\\ 2\\ 4\\ \end{pmatrix} $$

$$ P1 = \begin{pmatrix} 8\\ 2\\ 6\\ \end{pmatrix} P2 = \begin{pmatrix} 9\\ 2\\ 0\\ \end{pmatrix} $$

If we compute the average distance of the different coordinates of P1 and P2 we obtain this:

$$ DistAverage = \begin{pmatrix} ((8-5)+(9-5))/2\\ ((2-2)+(2-2))/2\\ ((6-4)+(4-0))/2\\ \end{pmatrix} = \begin{pmatrix} 3.5\\ 0\\ 3\\ \end{pmatrix} $$

Would this mean that the second feature is a shared property of the points of this cluster (since the average distance is 0) ?

I hope that the question was clear enough.

Topic k-means clustering

Category Data Science

Miguel Raevenswood · Accepted Answer · 2021年8月9日 20:58

Like the above answer stated, there are plenty of metrics that one can use to determine why certain clusters were chosen over others. To add to that answer there other ones you can look into, in this link, that can help answer your question.

Inertia
Dunn Index

To summarize these two, inertia is about the distance between the centroid and the points in the cluster with a lower inertia being better. The Dunn Index measures a ratio between the distances within a cluster and between the cluster with higher score determining a better cluster.

As for specific "shared properties", I would say that might be specific to the project at hand. In the link that I previously shared, there is a useful chart showing two possible cluster types for the same scatterplot.

In case 1, the clusters share income levels while, in case 2, the clusters share debt levels. The article goes on to explain that case 2 would be the better one because you can describe the clusters as four different categories: high income/debt, High income/low debt, low income/high debt, low income/low debt. This is better than the two categories that we could derive from case 1 being low income, high income. This would give us the better cluster "shared property" of debt.

Brian Spiering · Accepted Answer · 2021年8月9日 15:45

There are many evaluation metrics that can quantity the within cluster properties vs between cluster properties.

You are describing something similar to Davies–Bouldin index which a measure of scatter within a cluster.

Has QUIT--Anony-Mousse · Accepted Answer · 2019年7月28日 17:15

1

Has QUIT--Anony-Mousse answered at 2019年7月28日 17:15

Obviously you can check the variance of each attribute.

But unless the data is badly scaled, there will likely need the combination of attributes to explain the differences of clusters.

Find the shared properties of cluster samples

About