Find the shared properties of cluster samples

I have a dataset which contains ~15 features. With the elbow method, I found out that the optimal number of clusters is probably four. Therefore, I applied the K-means algorithm with four clusters. Now, I would like to understand why these clusters have been formed the way they are. In other words, I would like to identify the shared properties of the points of a specific cluster.

My idea is the following:

Let's pretend that C1 are the coordinates of the centroid of the first cluster and that P1 and P2 are two points of this cluster.

$$ C1 = \begin{pmatrix} 5\\ 2\\ 4\\ \end{pmatrix} $$

$$ P1 = \begin{pmatrix} 8\\ 2\\ 6\\ \end{pmatrix} P2 = \begin{pmatrix} 9\\ 2\\ 0\\ \end{pmatrix} $$

If we compute the average distance of the different coordinates of P1 and P2 we obtain this:

$$ DistAverage = \begin{pmatrix} ((8-5)+(9-5))/2\\ ((2-2)+(2-2))/2\\ ((6-4)+(4-0))/2\\ \end{pmatrix} = \begin{pmatrix} 3.5\\ 0\\ 3\\ \end{pmatrix} $$

Would this mean that the second feature is a shared property of the points of this cluster (since the average distance is 0) ?

I hope that the question was clear enough.

Topic k-means clustering

Category Data Science


Like the above answer stated, there are plenty of metrics that one can use to determine why certain clusters were chosen over others. To add to that answer there other ones you can look into, in this link, that can help answer your question.

  1. Inertia
  2. Dunn Index

To summarize these two, inertia is about the distance between the centroid and the points in the cluster with a lower inertia being better. The Dunn Index measures a ratio between the distances within a cluster and between the cluster with higher score determining a better cluster.

As for specific "shared properties", I would say that might be specific to the project at hand. In the link that I previously shared, there is a useful chart showing two possible cluster types for the same scatterplot.

enter image description here

In case 1, the clusters share income levels while, in case 2, the clusters share debt levels. The article goes on to explain that case 2 would be the better one because you can describe the clusters as four different categories: high income/debt, High income/low debt, low income/high debt, low income/low debt. This is better than the two categories that we could derive from case 1 being low income, high income. This would give us the better cluster "shared property" of debt.


There are many evaluation metrics that can quantity the within cluster properties vs between cluster properties.

You are describing something similar to Davies–Bouldin index which a measure of scatter within a cluster.


Obviously you can check the variance of each attribute.

But unless the data is badly scaled, there will likely need the combination of attributes to explain the differences of clusters.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.