What do "Under fitting" and "Over fitting" really mean? They have never been clearly defined
I am always getting lost when dealing with these terms. Especially being asked questions about the relationship such as underfitting-high bias (low variance) or overfitting-high variance (low bias). Here is my argument:
- From wiki:
In statistics, **overfitting is the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.1 An overfitted model is a statistical model that contains more parameters than can be justified by the data.2 The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.[3]:45
Underfitting occurs when a statistical model cannot adequately capture the underlying structure of the data. An under-fitted model is a model where some parameters or terms that would appear in a correctly specified model are missing.2
Based on this definition, both under-fitting and over-fitting are biased. I really could not tell which one has a higher bias. Furthermore, too closely in training data but fail in test data does not necessarily mean high variance.
- From Stanford CS229 Notes
High Bias ←→ Underfitting High Variance ←→ Overfitting Large σ^2 ←→ Noisy data
If we define underfitting and overfitting directly based on High Bias and High Variance. My question is: if the true model f=0 with σ^2 = 100, I use method A: complexed NN + xgboost-tree + random forest, method B: simplified binary tree with one leaf = 0.1 Which one is overfitting? Which one is underfitting?
Topic bias overfitting terminology machine-learning
Category Data Science