Sampling a data based on average and variance of another data

Question

Sampling a data based on average and variance of another data

Minions

2021年10月30日 10:44

I have a set of textual datasets that have the following average and variance tokens lengths:

Dataset1
avg = 28.18, var = 393.03
Dataset2
avg = 32.70, var = 644.79
Dataset3
avg = 36.94, var = 805.50
Dataset4
avg = 28.56, var = 436.86

Dataset5
avg = 53.13, var = 612.18

How can I sample a smaller set of instances from Dataset5 that is similar (or equal if possible) in terms of avg and var to any of the above datasets?

I am using Pandas dataframes, where each dataset have 2 columns [text, tokens_length].

Topic mean variance sampling pandas dataset

Category Data Science

Erwan · Accepted Answer · 2021年10月30日 10:44

I would try to use a genetic algorithm. A simple representation of the problem in terms of a genetic algorithm would go like this:

A "gene" represents an instance, it's either selected or not (boolean)
An "individual" is a set of selected genes/instances represented as a one-hot vector.

An "individual" represents a candidate solution, and it can be evaluated by simply calculating the mean and std. dev. of the subset: at every iteration, a candidate solution closer to the target mean and std. dev. is more likely to be selected.

The standard genetic algorithm works like this:

Randomly pick a set of say 100 individuals (first generation)
Calculate the "performance" of every individual (mean and std. dev. of the subset)
Select say the top 10 individuals according to their performance, then produce the next generation of 100 individuals by cross-over among these top 10. A cross-over means picking two individuals A and B and producing a new individual with the value of every gene/instance taken from the same gene in either A or B.
Optionally add some random mutations to the new individuals' genes.
Iterate again from step 2. Keep iterating unless some stop condition is satisfied, for example the average performance over the last 5 generations doesn't increase anymore.

There are probably some good genetic learning libraries around but I've never used any myself (the basic method is fairly simple to implement).

Sampling a data based on average and variance of another data

About