How to manage sampling bias between training data and real-world data?
I'm currently working on a binary classification problem.
My training dataset is rather small with only 1000 elements.
(I don't know if it is relevant : my problem is similar to the spam filtering problem where a data can also be likely to be categorized as spam but i simplified it as a black or white issue, and use the probability given by the models to assign a likelihood score)
Among those 1000 elements:
- 70% are from the class 1
- 30% are from the class 2
I have no way to know the distribution of the real-world data, but i can only guess the distribution of my training set is not representative of the real-word one. (It could be 50/50 or 40/60 or 90/10..)
From what i read, please correct me if i'm wrong, this issue would be defined as a sampling bias and would require a domain adaptation (which i'm not really familiar with).
I would like to be able to predict a class accurately, despite the difference of distribution with the real-world.
Are there some methods to compensate for this discrepancy in data distribution ?
Or is there a way to ignore the distribution ?
Should i use some specific models/algorithms ?
Are there some things i should or shouldn't do in this particular case ?
Thank you for your help
Topic bias distribution domain-adaptation model-selection data-cleaning
Category Data Science