Scaling negative and positive variables when performing a k-means cluster analysis

I'm looking to perform a k-means cluster analysis on a set of data that contains variable ranges that contain both positive and negative values. Given the rangers vary so much the data will need to be scaled, but my concern is with the variables that contain negative value ranges. Should I perform some sort of log transformation on all the date so as to scale the data to positive values. For example:
Variable A: 3.4, 5.6,1.3,7.6,8.3
Variable B: 1,2,3,2,1
Variable C:-1.3, -1.4, -2.3, -4.2, -1.3

Topic k-means

Category Data Science


You'll want to scale each variable to a normal distribution. For example, in Matlab, for all values of Variable A this would be something like:

VarA = zscore(VarA);

And then you'll want to repeat that for each variable before running k-means. Make sure you normalize each variable separately. This will put everything on the same scale so that the Euclidean distances are not weighted based on the width of the variable distributions.

There is another good explanation of this on the Stats Stack Exchange.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.