Should I inpute the missing values before the train-validation split?

Question

Should I inpute the missing values before the train-validation split?

blueglass

2022年4月16日 11:03

validation is suppose to provide an unbiased evaluation of a model fit on the training data. In that case inputation before the training-validation split could cause an indirect data leakage because the data that is suposed to act as test data is already contaminated due to the imputation.

So the correct approach would be to calculate the statistics(mean,mode) just with the training data and fill the missing values of the training and validation data. That for every partition of training and validation data, then calculate the metrics to tune the hyperparameters. Then calculate again the statistics with the whole training data(training and validation data) fill the missing values in the training and test data. Am I wrong? I've been tearing my hair out thinking about this. I would like a good explanation.

Topic data-imputation preprocessing

Category Data Science

Should I inpute the missing values before the train-validation split?

About