Getting a balanced sample across many variables

Question

Getting a balanced sample across many variables

user

2022年4月24日 11:04

Let’s say each element in my population has several attributes. Let’s call then A, B, C, D, E, F.

Let’s say, for simplicity, each attribute has 10 values (but could be any number between 2 and 30). Now I want to get a sample such that the distribution is the same across all features. So for example if the whole population has about 15% of people in feature A with value 1, my sample should be the same.

What should be the way for me to select a size for the sample and choose a sample that has the desired properties?

Topic multivariate-distribution distribution sampling statistics

Category Data Science

etiennedm · Accepted Answer · 2020年8月4日 16:24

If I understand correctly, you want a sample with along each feature the value below which a given percentage of observations in your population falls. If so, you might want to try something like this using np.percentile:

import numpy as np
import pandas as pd

data = [ np.random.randint(1,10,6) for i in range(20) ] # fake data
df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D', 'E', 'F']) # fake dataframe

# 15 being your desired percentage
balanced_sample = [ np.percentile(df[col], 15) for col in df.columns ]

Getting a balanced sample across many variables

About