How to manage sampling bias between training data and real-world data?

I'm currently working on a binary classification problem.

My training dataset is rather small with only 1000 elements.

(I don't know if it is relevant : my problem is similar to the spam filtering problem where a data can also be likely to be categorized as spam but i simplified it as a black or white issue, and use the probability given by the models to assign a likelihood score)

Among those 1000 elements:

  • 70% are from the class 1
  • 30% are from the class 2

I have no way to know the distribution of the real-world data, but i can only guess the distribution of my training set is not representative of the real-word one. (It could be 50/50 or 40/60 or 90/10..)

From what i read, please correct me if i'm wrong, this issue would be defined as a sampling bias and would require a domain adaptation (which i'm not really familiar with).

I would like to be able to predict a class accurately, despite the difference of distribution with the real-world.

Are there some methods to compensate for this discrepancy in data distribution ?

Or is there a way to ignore the distribution ?

Should i use some specific models/algorithms ?

Are there some things i should or shouldn't do in this particular case ?

Thank you for your help

Topic bias distribution domain-adaptation model-selection data-cleaning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.