Discrepancy between training set and real-world data set: domain adaptation?

Question

Discrepancy between training set and real-world data set: domain adaptation?

Archie

2017年7月16日 21:04

I have read in literature that in some cases the training set is not representative for a real-world dataset. However, I cannot seem to find a proper term describing this phenomenon; what is the proper term to address this problem?

Edit:

So far I have settled for the term domain adaptation, shortly described as a field in machine learning which aims to learn from a certain data distribution in order to predict data coming from a different (but related) target distribution.

Topic domain-adaptation dataset predictive-modeling machine-learning

Category Data Science

Christos Karatsalos · Accepted Answer · 2017年7月16日 21:04

The case that you are describing is referred in the literature as sample selection bias [1]. This case is a part of the area of Transfer Learning/Domain Adaptation. The training set does not represent the real world data-set well, which means that there is a difference between the distributions of the training and test sets. Another term from the Domain Adaptation area that is referred to the same problem is the Covariate Shift.

B. Zadrozny, “Learning and Evaluating Classifiers under Sample Selection Bias,” Proc. 21st Int’l Conf. Machine Learning, July 2004.

lkavenagh · Accepted Answer · 2016年10月30日 23:39

Overfitting?

This happens when you make a model too specific to the training set, so that it performs very well on that particular training data, but then it is not able to generalize to other data ("real world data") and so performs poorly in reality.

TBSRounder · Accepted Answer · 2016年9月30日 15:21

Extrapolation? Happens a lot when your data distributions change over time, so a system that is well modeled in the training set wont know how to deal with values that are not in a similar range. More of a general term, so it might be what you're looking for.

It also has different effects depending on the technique you use. Something like random forests is not very good at extrapolation, where others like logistic regression can still perform OK.

hssay · Accepted Answer · 2016年8月31日 13:34

1

hssay answered at 2016年8月31日 13:34

You may be looking for sampling bias. Also the other case (where training set does in fact represent the real world data-set well) is generally known as representative sample.

Hope this helps.

Discrepancy between training set and real-world data set: domain adaptation?

About