Should I inpute the missing values before the train-validation split?

validation is suppose to provide an unbiased evaluation of a model fit on the training data. In that case inputation before the training-validation split could cause an indirect data leakage because the data that is suposed to act as test data is already contaminated due to the imputation.

So the correct approach would be to calculate the statistics(mean,mode) just with the training data and fill the missing values of the training and validation data. That for every partition of training and validation data, then calculate the metrics to tune the hyperparameters. Then calculate again the statistics with the whole training data(training and validation data) fill the missing values in the training and test data. Am I wrong? I've been tearing my hair out thinking about this. I would like a good explanation.

Topic data-imputation preprocessing

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.