Choosing attributes for k-means clustering
The k-means clustering tries to minimize the within-cluster scatter and maximizing the distances between clusters. It does so on all attributes.
I am learning about this method on several datasets. To illustrate, in one the datasets countries are compared based on attributes related to their Human development Index. However some of the attributes are completely unrelated to this dimension, for example total population of countries. How to deal with this attributes? As mentioned before k-means tries to minimize the scatter based on all attributes, which would mean this additional attributes could hurt the clusters. To illustrate, I know the k-means cannot discern three clusters that are perfectly clustered around one dimension and are completely scattered around the other.
Should one just exclude some attributes based on prior knowledge? Is their perhaps a processes that identifies irrelevant attributes.
Topic noise unsupervised-learning k-means clustering
Category Data Science