Why SMOTE is not used in prize-winning Kaggle solutions?

Synthetic Minority Over-sampling Technique SMOTE, is a well known method to tackle imbalanced datasets. There are many papers with a lot of citations out-there claiming that it is used to boost accuracy in unbalanced data scenarios.

But then, when I see Kaggle competitions, it is rarely used, to the best of my knowledge there are no prize-winning Kaggle/ML competitions where it is used to achieve the best solution. Why SMOTE is not used in Kaggle?

I even see applied research papers (where there are millions of $ at stake) that SMOTE is not used: Practical Lessons from Predicting Clicks on Ads at Facebook

Is this because it's not the best strategy possible? Is it a research niche with no optimal real-life application? Is there any ML competition with a high-reward where this was used to achieve the best solution?

I guess I am just hesitant that creating synthetic data actually helps.

Topic smote kaggle class-imbalance machine-learning

Category Data Science


After some debate on social networks and much asking. See Twitter thread.

The best answer that I can find is that it does not work.

I would love to retract my answer and see a real-life example where it actually works (see tweet JFPuget ).

Some recap of other sources and social media:


I think this is a highly interesting topic, which has been around for a long time with, as we can see, no clear conclusion. As far as I know from applied experience, I would use under/over sampling techniques when:

[EDIT]

  • we know in advance that our dataset has unrealistic ratios between positive and negtive target labels; this way, with under/over sampling we could balance towards a more realistic ratio

  • our dataset has incorrect data samples, so we need to filter out (e.g. by undersampling) wrong data points (systematic errors when retrieving the raw data) which we will likely not find in a real inference scenario

  • problems where the positives are known to happen (like fraud detection) but we still did not have time enough to collect such positive samples enough, so oversampling could be interesting

Those cases aim to correct an unreliable training dataset, to get it more similiar to the real scenario. I think this is the reason to not apply, in real cases (or simply when the datasewt is correctly built) under/over sampling techniques. I understand in Kaggle the datasets are already well designed and aimed directly for modeling. So, in case the imbalanced dataset represents the real problem distribution, is the algorithm's responsibility to capture the pattern of the data as is.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.