Discrepancy between training set and real-world data set: domain adaptation?

I have read in literature that in some cases the training set is not representative for a real-world dataset. However, I cannot seem to find a proper term describing this phenomenon; what is the proper term to address this problem?

Edit:

So far I have settled for the term domain adaptation, shortly described as a field in machine learning which aims to learn from a certain data distribution in order to predict data coming from a different (but related) target distribution.

Topic domain-adaptation dataset predictive-modeling machine-learning

Category Data Science


The case that you are describing is referred in the literature as sample selection bias [1]. This case is a part of the area of Transfer Learning/Domain Adaptation. The training set does not represent the real world data-set well, which means that there is a difference between the distributions of the training and test sets. Another term from the Domain Adaptation area that is referred to the same problem is the Covariate Shift.

  1. B. Zadrozny, “Learning and Evaluating Classifiers under Sample Selection Bias,” Proc. 21st Int’l Conf. Machine Learning, July 2004.

Overfitting?

This happens when you make a model too specific to the training set, so that it performs very well on that particular training data, but then it is not able to generalize to other data ("real world data") and so performs poorly in reality.


Extrapolation? Happens a lot when your data distributions change over time, so a system that is well modeled in the training set wont know how to deal with values that are not in a similar range. More of a general term, so it might be what you're looking for.

It also has different effects depending on the technique you use. Something like random forests is not very good at extrapolation, where others like logistic regression can still perform OK.


You may be looking for sampling bias. Also the other case (where training set does in fact represent the real world data-set well) is generally known as representative sample.

Hope this helps.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.