Sampling trying to keep as much multivariate variance as possible

I was thinking if anyone considered a sampling technique that would try to aim keeping as much of the variance as possible (e.g. as many unique values, or very widely distributed continuous variables).

The benefit might be that it will allow development of code around the sample, and really work with the edge cases in the data.

You can then later always take a representative sample.

So, I am wondering if people have tried to sample for maximum variance before and if there is a clever way to sample with as high possible variance (of course an approximation is just fine).

Topic multivariate-distribution variance sampling

Category Data Science


It depends on what you mean by sampling. Is it sampling between or within features?

For between features, scikit-learn has a built-in option for VarianceThreshold which removes features whose variance does not meet some threshold.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.