How do I determine the best statistical way for data transformation for standardization (like log, sq root) to remove bias between different datasets?
I'm currently working on applying data science to High Performance Computing cluster, by analyzing the log files generated and trying to see if there is a pattern that leads to a system failure(specifically STALE FILE HANDLEs for now in GPFS file system). I am categorizing the log files and clustering based on their instances per time interval. Since some messages are more predominant over the others in any given time frame than the others, i don’t want the clustering to bias towards the one with maximum variance.
Topic hpc dataset statistics clustering bigdata
Category Data Science