Updating a train/val/test set

It is considered best practice to split your data into a train and test set at the start of a data science / machine learnign project (and then your train set further into a validation set for hyperparamter optimisation).

If it turns out that the distribution in your train set isn't the same as your test set, perhaps one group is completely lacking in the test set, or a group is overly represented in the test set for example, what do you do?

Is it a problem, after learning that the distributions in the two sets are different to re-compute your train and test sets? Practically this must be ok, and you won't always know upfront that the distributions in the sets are representative. However, this is a form of data leakage, as you are applying information that you have learnt, to the creation of these sets which wasn't necessarily to hand before you started your task.

How does one deal with such a scenario?

Topic test validation training dataset

Category Data Science


In case of imbalanced datasets, we can either use stratify while splitting the data, or use Cross validation, or both. Stratifying/CV upfront helps mitigate this data leakage by developer confirmation bias.

Stratify

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y, 
                                                    test_size=0.25)

Cross validation

>>> from sklearn.model_selection import cross_val_score
>>> clf = svm.SVC(kernel='linear', C=1, random_state=42)
>>> scores = cross_val_score(clf, X, y, cv=5)
>>> scores
    array([0.96..., 1. , 0.96..., 0.96..., 1. ])

In this example there are 5 random splits, 5 models score.

RepeatedStratifiedKFold helps.

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=42)
scores = cross_val_score(model, X, y, scoring='mean_squared_error', cv=cv, n_jobs=-1)
print('Mean MSE: %.4f' % mean(scores))

In this example there are 5 splits and two iteration, total of 10 models aggregated score.


There are two really different scenarios:

The training and test data are obtained from the same dataset

If the data has been randomly split between training and test set, this is extremely unlikely to happen in the first place. If the data contains some small groups/classes, then the split can be made not only randomly but also with stratified sampling: this prevents that 100% of one group ends up on either side of the split by chance. And if there are some classes or groups which still appear too rarely (only once or twice), then these should usually be discarded or replaced by some generic value at some preprocessing stage.

The training and test data are obtained independently

This is the serious case. There are cases where external constraints make it unavoidable to have slightly different distributions in the training and test set, even though this breaches the main assumption of supervised learning. Note that if the distributions differ too much, it's very likely a lost cause. In this scenario one doesn't have a choice: if the training and test set are provided separately,then it's expected that the performance is measure on the test set "as is". So it's a matter of working with this constraint: some specific preprocessing may be necessary (e.g. introduce a special group/class 'unknown' in the training set), a robust method may be preferable, possibly some plan a default prediction (majority class) for invalid instances, etc.

Mistakes happen

In any case, if one realizes that there's something wrong in the design of the splitting process or any other problem which makes it necessary to re-shuffle the data, well, it's sometimes better to redo the whole process despite the risk of data leakage. Of course it's better if this can be avoided, but it's not as if the ML police is going to arrest you ;)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.