Sampling items from a population of subpopulations

Question

Sampling items from a population of subpopulations

dzieciou

2021年1月16日 14:50

I have a population of $n$ items to label and a budget to label only $m$ ($m n$) of them before training. The population can be partitioned into subpopulations, recursively. In other words, the whole population can be represented as a tree of subpopulations, $x_1$ can be split into $x_2$ and $x_7$ subpopulations, $x_2$ into $x_3$ and $x_4$, etc. Some subpopulations are more diverse and have more subpopulations.

What algorithm should I use to sample $m$ items, so that they belong to the most diverse set of subpopulations? Is stratified sampling a way to go?

My scenario is to label 500 products from 27K products organized into a taxonomy of product categories. There are 17 root categories that split into 118 subcategories that split into 597 sub-subcategories, etc. E.g. root categories include Produce and Bakery, Produce involves Vegetables and Fruits subcategories, etc. You can see that not only the number of products but also the number of final subcategories (leaves) in the tree is much bigger than my budget $m$, i.e., the number of products I can select for labeling.

Topic labelling sampling

Category Data Science

Sampling items from a population of subpopulations

About