Correcting for one of multiple strong batch effects in a dataset
I am wondering which statistical tools to use when analysing data that have multiple strong batch effects (distributions vary from one batch to another). I would like to correct batch effect when it originates from one variable, without taking off the potential batch effect from other variables.
If this is unclear, taking a short example is probably the best way to go to explain my problem:
Imagine that we have 10 persons taking part in an experiment. The experiment is as follows:
- each person is given a set of a 1 000 numbered tennis balls, all of which we know the physical properties (for instance, weight, diameter and colour of the ball)
- each person is then asked to throw the 1 000 tennis balls one by one as far as they can, and to record their results
Following the experiment, we will know for all 10 000 tennis ball:
- the distance at which the ball was thrown
- who threw the ball
- the ball's weight
- the ball's diameter
- the ball's colour
Now, because not everyone has the same capabilities for throwing tennis balls (be it in terms of muscle strength or something else), we can expect to see some strong batch effects within the data (for instance, we could observe that a ball thrown by the first person will, on average, have been launched farther than a ball with the same weight and diameter when it's been thrown by the second participant, etc...).
Correcting for this kind of batch effect can be done in multiple ways if everyone had been given the same set of balls (in a setting with normal distributions, standardisation would probably work fine). Now, imagine that when organising the experiment, we did not pay enough attention and ended up giving some people heavier tennis balls, some others smaller tennis balls, etc...
At the end of the experiment, we realise indeed using a Chi2 test (or, say, Kruskal-Wallis H test) that everyone was not given a set of balls coming from a random sampling of all 10 000 balls.
How can we then correct for who threw the ball, without taking off the batch effects originating from the fact that the set of balls were different?
The main problem is that by correcting for batch effect using regular standardisation (for instance), we will probably end up removing the effect due to the fact that some people were given heavier or larger balls.
Or, in other words using the example, how could we account for the difference in terms of strength between the first and the second participant while not correcting for the fact that the first participant had in average heavier tennis balls than the second?
At first, I was thinking of running a Generalized Linear Model with the dependent variable being the distance at which the balls were thrown, and all the other variables as regressors, and then subtracting to the dependent variable only the effect of the variable for who threw the ball. I am however unsure of whether this would or not make statistical sense, which is why I ask if other techniques can be used, or if this one would work.
Topic normalization preprocessing feature-scaling regression
Category Data Science