How do I determine the best statistical way for data transformation for standardization (like log, sq root) to remove bias between different datasets?

I'm currently working on applying data science to High Performance Computing cluster, by analyzing the log files generated and trying to see if there is a pattern that leads to a system failure(specifically STALE FILE HANDLEs for now in GPFS file system). I am categorizing the log files and clustering based on their instances per time interval. Since some messages are more predominant over the others in any given time frame than the others, i don’t want the clustering to bias towards the one with maximum variance.

Topic hpc dataset statistics clustering bigdata

Category Data Science


Its unclear what the OP is asking (so this response is somewhat general), but the table below illustrates common contexts and the transformations that are typical:

sales, revenue, income, price --> log(x)

distance --> 1/x, 1/x^2, log(x)

market share, preference share --> (e^x)/(1+e^x)

right-tailed dist --> sqrt(x), log(x) caution log(x<=0)

left-tailed dist --> x^2

You can also use John Tukey's three-point method as discussed in this post. When specific transformations don't work, use Box-Cox transformation. Use package car to lambda <- coef(powerTransform()) to compute lambda and then call bcPower() to transform. Consider Box-Cox transformations on all variables with skewed distributions before computing correlations or creating scatterplots.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.