Updating a train/val/test set
It is considered best practice to split your data into a train and test set at the start of a data science / machine learnign project (and then your train set further into a validation set for hyperparamter optimisation).
If it turns out that the distribution in your train set isn't the same as your test set, perhaps one group is completely lacking in the test set, or a group is overly represented in the test set for example, what do you do?
Is it a problem, after learning that the distributions in the two sets are different to re-compute your train and test sets? Practically this must be ok, and you won't always know upfront that the distributions in the sets are representative. However, this is a form of data leakage, as you are applying information that you have learnt, to the creation of these sets which wasn't necessarily to hand before you started your task.
How does one deal with such a scenario?
Topic test validation training dataset
Category Data Science