How do I determine the best statistical way for data transformation for standardization (like log, sq root) to remove bias between different datasets?

Question

How do I determine the best statistical way for data transformation for standardization (like log, sq root) to remove bias between different datasets?

Kraamed

2021年10月20日 22:41

I'm currently working on applying data science to High Performance Computing cluster, by analyzing the log files generated and trying to see if there is a pattern that leads to a system failure(specifically STALE FILE HANDLEs for now in GPFS file system). I am categorizing the log files and clustering based on their instances per time interval. Since some messages are more predominant over the others in any given time frame than the others, i don’t want the clustering to bias towards the one with maximum variance.

Topic hpc dataset statistics clustering bigdata

Category Data Science

Brandon Loudermilk · Accepted Answer · 2016年6月30日 20:30

Its unclear what the OP is asking (so this response is somewhat general), but the table below illustrates common contexts and the transformations that are typical:

sales, revenue, income, price --> log(x)

distance --> 1/x, 1/x^2, log(x)

market share, preference share --> (e^x)/(1+e^x)

right-tailed dist --> sqrt(x), log(x) caution log(x<=0)

left-tailed dist --> x^2

You can also use John Tukey's three-point method as discussed in this post. When specific transformations don't work, use Box-Cox transformation. Use package car to lambda <- coef(powerTransform()) to compute lambda and then call bcPower() to transform. Consider Box-Cox transformations on all variables with skewed distributions before computing correlations or creating scatterplots.

How do I determine the best statistical way for data transformation for standardization (like log, sq root) to remove bias between different datasets?

About