Scaling negative and positive variables when performing a k-means cluster analysis

Question

Scaling negative and positive variables when performing a k-means cluster analysis

Jeff

2022年5月12日 11:04

I'm looking to perform a k-means cluster analysis on a set of data that contains variable ranges that contain both positive and negative values. Given the rangers vary so much the data will need to be scaled, but my concern is with the variables that contain negative value ranges. Should I perform some sort of log transformation on all the date so as to scale the data to positive values. For example:
Variable A: 3.4, 5.6,1.3,7.6,8.3
Variable B: 1,2,3,2,1
Variable C:-1.3, -1.4, -2.3, -4.2, -1.3

Topic k-means

Category Data Science

gcalongi · Accepted Answer · 2022年4月9日 03:55

You'll want to scale each variable to a normal distribution. For example, in Matlab, for all values of Variable A this would be something like:

VarA = zscore(VarA);

And then you'll want to repeat that for each variable before running k-means. Make sure you normalize each variable separately. This will put everything on the same scale so that the Euclidean distances are not weighted based on the width of the variable distributions.

There is another good explanation of this on the Stats Stack Exchange.

Scaling negative and positive variables when performing a k-means cluster analysis

About