Stratified sampling - use of proxy variable

For splitting of the data into train/test/val I use stratified sampling. Is it appropriate to define strata using information extracted from the dataset? E.g. use machine-learning to model proxy variable used for the strata definition?

My worry is the potential data leakage.

I wasn't able to find any counter-argument though.

Topic bootstraping sampling dataset

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.