K-Fold cross validation and data leakage

Question

K-Fold cross validation and data leakage

2022年4月15日 00:01

I want to do K-Fold cross validation and also I want to do normalization or feature scaling for each fold. So let's say we have k folds. At each step we take one fold as validation set and the remaining k-1 folds as training set. Now I want to do feature scaling and data imputation on that training set and then apply the same transformation on that validation set. I want to do this for each step. I am trying to avoid data leakage as much as possible and at the same time rescale my validation sets to get better results.

How can I do this with a few lines of code?

Secondly, is it necessary to do this? Because I don't see many people do this for k-fold validation. I have seen many times, they do feature scaling and imputation on the entire dataset first and then do the k-fold cross validation. But doesn't this cause data leakage?

Topic data-leakage data-imputation feature-scaling cross-validation

Category Data Science

Multivac · Accepted Answer · 2020年12月24日 01:31

A reproducible example with no data leakage:

In there I'm scaling the data only with the train data on the k-fold stage

import numpy as np

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

X, y = load_iris(return_X_y= True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 42)

model = Pipeline([("imputing", SimpleImputer(strategy= "mean")),("scaling", StandardScaler()), ("modeling", LogisticRegression(random_state= 42, class_weight= "balanced"))]).fit(X_train, y_train)

cv_scores = cross_val_score(estimator= model, X = X_train, y = y_train, scoring= "accuracy")

print(f"Mean accuracy cv: {np.mean(cv_scores)}")

model.score(X_test, y_test)

Note that in this case it is easy to apply all the pipeline because all the feature are the same data type, but imagine you have categorical and continuous features, so you need to apply different preprocessing and imputing to each.

In that case a combination of ColumnTransformer and Pipeline would do the job.

For reference check: https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html

K-Fold cross validation and data leakage

About