Sampling Highly Imbalanced Large Dataset

Question

Sampling Highly Imbalanced Large Dataset

Harshit Gupta

2022年5月18日 20:50

I am working on a model which will run monthly on 8M users. I've snapshot-wise data in training set, eg:

Jan, 21 Snapshot : 8M Total : 233 Positives Rest Negative
Feb, 21 Snapshot : 8M Total : 599 Positives Rest Negative
March, 21 Snapshot : 8M Total : 600 Positives Rest Negative
April, 21 Snapshot : 8M Total : 750 Positives Rest Negative

similarly till March, 2022

I'm keeping March, 2022 as test set, which has 2000 positive labels and rest negative out of 8M.

I can't take all 8M rows from each snapshot in the training set, how should I sample my data ? Currently, I am taking all rows with positive class, and only 300K negative class rows from each snapshot. This makes training set distribution : 420K negative class samples and 8K positive class samples.

Am I sampling right ? Which CV technique will be right and what metric should be used to select best model ?

Topic binary-classification class-imbalance

Category Data Science

Sampling Highly Imbalanced Large Dataset

About