Dealing with historic data drift

Question

Dealing with historic data drift

David So

2021年5月20日 20:11

I'm trying to predict a continuous target in an industrial context.

The problem I'm facing is that the some of the predictors have changed over time, for example the pressure in the machine was increased. This influenced some of the other predictors, but hasn't influenced my target.

As an example (in R formula notation):

$ Y \sim U_1$ The target depends on some unobservable variable

$ X_j ~ \sim U_1 + X_i$ One of my observed variables depends on the unobservable varible and another observed variable. Therefore $X_j$ is helpful for predicting $Y$

Now $X_i$ has changed a couple of times. This clearly hasn't influenced my target. But I'm also not really able to learn the relation $Y \sim X_j$ because $X_j$ has changed with $X_i$.

I know some of these dependencies for fact from physics. But there's no way I can fix this by hand because there are about 1000 variables.

When reading about data drift, it's always about how to adjust your existing model to a sudden change, but for me the changes already happened in the past.

The time periods where nothing has changed are too small to just use the latest batch, but it seems like just using the whole dataset without any adjustments doesn't work either.

Can anyone give advice on how to address this?

(Right now I'm using XGBoost but I'm open to other models)

Topic concept-drift

Category Data Science

fswings · Accepted Answer · 2021年5月20日 20:11

Based on your post (my emphasis):

The problem I'm facing is that the some of the predictors have changed over time, for example the pressure in the machine was increased. This influenced some of the other predictors, but hasn't influenced my target.

So to be clear, the input parameters changed but made no impact on the output parameter?

There are two possibilities:

The data is part of a single population, then the change in the input parameters (over that range) has little or no impact on the output parameter. This can be addressed by creating a new feature, which is a delta from some baseline and see if that new feature is a better fit for your model.

Pressure	Piston Position	Leakage Flow (Out)
10	20	30
11	22	30
12	23	33

The inputs into row 2 have changed relative to row 1 but no change in output. Therefore a small change in parameter 1 and 2 has no impact on output (physical examples include say overcoming hysteresis or localised energy storage or data buffer or a bucket that has to be filled with water before spilling over).

In this case the output isn't sensitive to the small changes and perhaps they're a function of the original value. So new features could be:

Pressure	Piston Position	Leakage Flow (Out)	F1	F2
10	20	30	0	0
11	22	30	0	0
12	23	33	1	1

So F1 only increases if the difference between row 1 and the other rows is greater than 1 otherwise it's 0 etc.

The data is part of more than one population. As an example, say you're taking measurements of a product. Over time the measurement device loses calibration and you see a drift in the measurement. The only real fix is to determine the drift due to a loss of accuracy and apply a corrective factor as function of time or base value.

Dealing with historic data drift

About