What do "Under fitting" and "Over fitting" really mean? They have never been clearly defined

I am always getting lost when dealing with these terms. Especially being asked questions about the relationship such as underfitting-high bias (low variance) or overfitting-high variance (low bias). Here is my argument:

  1. From wiki:

In statistics, **overfitting is the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.1 An overfitted model is a statistical model that contains more parameters than can be justified by the data.2 The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.[3]:45

Underfitting occurs when a statistical model cannot adequately capture the underlying structure of the data. An under-fitted model is a model where some parameters or terms that would appear in a correctly specified model are missing.2

Based on this definition, both under-fitting and over-fitting are biased. I really could not tell which one has a higher bias. Furthermore, too closely in training data but fail in test data does not necessarily mean high variance.

  1. From Stanford CS229 Notes

High Bias ←→ Underfitting High Variance ←→ Overfitting Large σ^2 ←→ Noisy data

If we define underfitting and overfitting directly based on High Bias and High Variance. My question is: if the true model f=0 with σ^2 = 100, I use method A: complexed NN + xgboost-tree + random forest, method B: simplified binary tree with one leaf = 0.1 Which one is overfitting? Which one is underfitting?

Topic bias overfitting terminology machine-learning

Category Data Science


Personally I find Victor Lavrenko's explanation of underfitting and overfitting the most intuitive and concise definition:

This definition is very useful for at least these two points:

  1. This is not always an easy task to measure model's complexity (check) as presented in most of the diagrams that explain this concept

  2. You avoid the pitfall of comparing the same model's metrics in train and test set as described here:

...We can identify if a machine learning model has overfit by first evaluating the model on the training dataset and then evaluating the same model on a holdout test dataset. If the performance of the model on the training dataset is significantly better than the performance on the test dataset, then the model may have overfit the training dataset....

This situation is not clearly defined since there is no a "standard" difference between train and test error that can guarantee that your model is overfitting, so there is no such thing as significantly better on the test set (as far as I know)

But of course I'm not saying that you should no care about training vs testing error.

enter image description here


"An overfitted model is a statistical model that contains more parameters than can be justified by the data"

This is an idea that is well past it's ``best before date''. In the early days of computational statistics, the most common way of controlling the complexity of a model was to limit the number of parameters (e.g. feature selection for linear models). But that hasn't been true for a long time. The early 1970s saw the introduction of ridge-regression, which introduced the idea of regularisation to control the capacity of a model. It adds a penalty term to the training criterion that penalises large magnitudes of weights. This is mathematically equivalent to placing an upper bound on the squared norm of the weight vector. This implements a simple form of "structural risk minimisation" (c.f. SVM) - if we increase the bound slightly, the model can do anything that it could do before, plus a few other things. So the regularisation parameter forms a set of nested model classes of increasing complexity. This means we can have over-parameterised model that don't over-fit, and indeed that is pretty much what modern machine learning algorithms are all about.

So one thing that would reduce the confusion is not to comflate over-fitting (fitting the data too closely) with over-parameterisation (having more parameters than strictly necessary to represent the underlying structure of the data).

When we "fit" a model, we generally mean we adjust the parameters of the model so that it's output more closely resembles the calibration data according to some criterion that measures the data "misfit". So over-fitting basically means reducing the value "data-misfit" function too much. How much is "too much"? If it makes generalisation performance worse, that is "too much".

If you can make generalisation performance better by using a more complex model (or training it for longer) then your model is currently "underfitting" the data.

Over/under-fitting is not defined in terms of bias or variance, it is defined in terms of the value of the training error (the data misfit) and the generalsiation properties of the model. Bias and variance are useful terms for understanding the consequences of over- and under-fitting. The diagrams help though.


You can look into the following figure to get an graphical intuition. Visit the source for detailed illustration.

Source : https://www.kaggle.com/getting-started/166897


I'll try to make it as simple as possible. Underfitting is when you have high bias and high variance in your model. So the model learns nothing from the training data (low training score aka high bias) and predicts poorly on the test data (low variance). You get underfitting when your model is too simple for the data or the data is too complex for your model to understand.

Here is an example of underfitting:-

enter image description here

As we can see both train and test scores are poor which means the model learns nothing from the data and performs/predicts nothing on the test set.

Techniques to reduce underfitting :

  1. Increase model complexity

  2. Increase number of features, performing feature engineering

  3. Remove noise from the data.

  4. Increase the number of epochs or increase the duration of training to get better results.

Overfitting is when you have low bias and high variance. So the model learns everything from the training dataset (high train score aka low bias) but is not able to perform good on the test set (low test score aka high variance) You get overfitting when your model is too complex for the data or your data is too simple for the model.

Here is an example of overfitting:-

enter image description here

As we can see, the training loss decreases initially (low bias) but the test/validation loss, after decreasing to a certain point starts gradually increasing. Also apparent is the large gap between train and test lines.

Techniques to reduce overfitting :

  1. Increase training data.

  2. Reduce model complexity.

  3. Early stopping during the training phase (have an eye over the loss over the training period as soon as loss begins to increase stop training).

  4. Ridge Regularization and Lasso Regularization

  5. Use dropout for neural networks to tackle overfitting.

Hope that clears the confusion!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.