class imbalance - applied SMOTE - next steps
I am new to ML and learnt a lot from your valuable posts. I need your advise with the following situation and guidance on if the steps make sense. I have a binary classification problem, my dataset has a severe imbalance approximately 2% positive cases (4,000 cases) out of a total of 200,000 cases. I separated my dataset into a train and a test (80/20 stratified split). My train now has total of 160,000 cases (3,200 positive cases) and test has total of 40,000 (800 positive cases).
Next, from Train I created a (50-50) SMOTE sample which has ~9,000 positive cases (original 3,200 and SMOTE create synthetic samples of ~ 5,800 positives) and ~ 9,000 negative cases (so a total of 18,000 cases in the "new train").
next, I have developed bagged/ boosted tree based classifiers on the "new train" with ROC training criteria in R caret (I get high ROC around 0.9 which is usual). I then applied the model/s on the test set which has the original class imbalance and got the predicted probabilities (the AUC is now 0.65). I identified the optimal threshold for classification based on the pROC package for creation of the final class predictions on test dataset.
I need your advise on if I am transitioning from the trained SMOTE model with a 50/50 imbalance to the test model with the original imbalance of 50/1 correctly. is there anything fundamentally wrong in what I am doing and any suggestions on how to improve the process will be very helpful. are any corrections needed to the SMOTE based probabilities on test or is thresholding the best I can do?
Topic smote class-imbalance r
Category Data Science