Normalization in production

I am currently writing a machine learning pipeline for my time series application. At the end of each month, I get the data gathered, normalize it ([0, 1]), retrain the ML model with the new observation only and predict future values.

Question

Should I be reading the entire dataset each time I get a new Observation, normalize the entire dataset, create the ML model, then predict?

How I got stuck:

  • Let's say I have 1 feature and at t-1 all of the values have min/max = [0, 1000]
  • At t, a new observation comes in with value = 1001
  • How should I normalize the new value given that the ML model has been trained with different min/max?

Thank you

Topic batch-normalization normalization python machine-learning

Category Data Science


Normalizing the entire dataset for a single new observation may not be practical. If normalization gives a value outside [0, 1], consider using 0 or 1 (as the case may be) as an approximation. Usually, it is sufficient.

Do remember to flag the event with appropriate markers and alarms so that the risks are known to the users of the prediction. If these alarms go often enough, you may want to redefine the formulation/model to not depend on [0, 1] normalization or change the data shaping logic to ensure you are always within the range.


Normalization is a transformation of the data. The parameters of that transformation should be found on the training dataset. Then the same parameters should be applied during prediction.

You should not re-find the normalization parameters during prediction. A machine learning model maps feature values to target labels. If you should not change feature values without also changing the mapping. If change just the feature values, you have potential for inconsistent mapping.

If you are training on a specific range of features and then during prediction there are out-of-range feature values, there are two choices:

  1. Set it to the limits of the range, in the case of 1001 it would be transformed to be 1.

  2. Decide if it makes sense to make a prediction for that feature value, some machine learning models should not extrapolate.


Really depends

Why? updating everything in production (pre-processing, fitting etc) can get extremely expensive. If you have some complex architecture it is not worth it.

Alternatives

  1. Approximate covariate shift if you know distribution of your future data you can adjust all your, for example normalisation parameters, in advance.

  2. Save your you future data every time you make prediction, it could be cheaper to quickly save your data in DB and depending on your system do updates weekly,monthly

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.