Getting a balanced sample across many variables

Let’s say each element in my population has several attributes. Let’s call then A, B, C, D, E, F.

Let’s say, for simplicity, each attribute has 10 values (but could be any number between 2 and 30). Now I want to get a sample such that the distribution is the same across all features. So for example if the whole population has about 15% of people in feature A with value 1, my sample should be the same.

What should be the way for me to select a size for the sample and choose a sample that has the desired properties?

Topic multivariate-distribution distribution sampling statistics

Category Data Science


If I understand correctly, you want a sample with along each feature the value below which a given percentage of observations in your population falls. If so, you might want to try something like this using np.percentile:

import numpy as np
import pandas as pd

data = [ np.random.randint(1,10,6) for i in range(20) ] # fake data
df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D', 'E', 'F']) # fake dataframe

# 15 being your desired percentage
balanced_sample = [ np.percentile(df[col], 15) for col in df.columns ]

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.