Stratified sampling - use of proxy variable
For splitting of the data into train/test/val I use stratified sampling. Is it appropriate to define strata using information extracted from the dataset? E.g. use machine-learning to model proxy variable used for the strata definition?
My worry is the potential data leakage.
I wasn't able to find any counter-argument though.
Topic bootstraping sampling dataset
Category Data Science