Analysing process data with sub groupings and checking for correlation

I have a dataset of process data for different equipment with many sensors. I would like to check the correlation of the different sensors to see if there is any strong correlation between some sensors and potentially reduce the size of my dataset. Within this process data there are many different processes of varying lengths and different equipment. For now I am asserting that the different equipment shouldn't make a difference and therefore I do not want to include this in my analysis (yet).

When performing the correlation I am unsure whether or not I should account for the different processes in this dataset. For example two processes may have very different running times and also different sensor readings and I am worried that this might then hide any correlation? What would be the recommended approach for handling such a situation? Does this situation have a proper term associated with it?

This link shows the kind of correlation I would like to perform (but obviously not with the sample data shown here) https://spark.apache.org/docs/2.2.0/ml-statistics.html

Topic spearmans-rank-correlation pearsons-correlation-coefficient correlation pyspark

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.