What parameters to use when normalising training, validation, and testing data?

I know a similar post was made here, but I wanted to ask some follow up questions. I am conducting a cross-validation search to find values of a set of hyper-parameters and need to normalise the data.

If we split up the data as follows:

  1. 'Training' (call this set 'A' for now) and testing data
  2. Split the 'training' into training (call this set 'B' for now) and validation sets

what parameters should be used when normalising the datasets?

Am I correct in thinking that:

  1. We normalise dataset 'B' and then extract the means and standard deviations on it
  2. We then normalise the validation set using those parameters obtained from set 'B'
  3. Once we have used the validation set to find my hyper-parameters with cross-validation, then we normalise set 'A' and extract its parameters
  4. Use the parameters from set 'A' to normalise the testing set

Is this correct, or have I misunderstood something? I know this is basic, but I can't seem to find a straightforward answer to this anywhere?

Topic training normalization cross-validation python

Category Data Science


I am not exactly sure what you mean by "what parameters should be used when normalizing datasets."

However, it is important to note:

Normalization is a preprocessing step that you do to some or all of the parameters of your model before constructing the model.

But in answer to your question:

You always normalize the same parameters used in both the train and the test set (otherwise how would you be able to compare the results?).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.