Sampling Highly Imbalanced Large Dataset

I am working on a model which will run monthly on 8M users. I've snapshot-wise data in training set, eg:

  1. Jan, 21 Snapshot : 8M Total : 233 Positives Rest Negative
  2. Feb, 21 Snapshot : 8M Total : 599 Positives Rest Negative
  3. March, 21 Snapshot : 8M Total : 600 Positives Rest Negative
  4. April, 21 Snapshot : 8M Total : 750 Positives Rest Negative

similarly till March, 2022

I'm keeping March, 2022 as test set, which has 2000 positive labels and rest negative out of 8M.

I can't take all 8M rows from each snapshot in the training set, how should I sample my data ? Currently, I am taking all rows with positive class, and only 300K negative class rows from each snapshot. This makes training set distribution : 420K negative class samples and 8K positive class samples.

Am I sampling right ? Which CV technique will be right and what metric should be used to select best model ?

Topic binary-classification class-imbalance

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.