Sampling Highly Imbalanced Large Dataset
I am working on a model which will run monthly on 8M users. I've snapshot-wise data in training set, eg:
- Jan, 21 Snapshot : 8M Total : 233 Positives Rest Negative
- Feb, 21 Snapshot : 8M Total : 599 Positives Rest Negative
- March, 21 Snapshot : 8M Total : 600 Positives Rest Negative
- April, 21 Snapshot : 8M Total : 750 Positives Rest Negative
similarly till March, 2022
I'm keeping March, 2022 as test set, which has 2000 positive labels and rest negative out of 8M.
I can't take all 8M rows from each snapshot in the training set, how should I sample my data ? Currently, I am taking all rows with positive class, and only 300K negative class rows from each snapshot. This makes training set distribution : 420K negative class samples and 8K positive class samples.
Am I sampling right ? Which CV technique will be right and what metric should be used to select best model ?
Topic binary-classification class-imbalance
Category Data Science