When is a Model Underfitted?

Logic often states that by underfitting a model, it's capacity to generalize is increased. That said, clearly at some point underfitting a model cause models to become worse regardless of the complexity of data.

How do you know when your model has struck the right balance and is not underfitting the data it seeks to model?

Note: This is a followup to my question, "Why Is Overfitting Bad?"

Topic parameter algorithms efficiency

Category Data Science

Talking in simple terms, when you see that the predicted values by your model are exact or nearly equal to the true values then you can say that the model is not underfitting.

If the predicted values are not close to the true values then it can be said that the model is underfitting.

To answer your question it is important to understand the frame of reference you are looking for, if you are looking for what philosophically you are trying to achieve in model fitting, check out Rubens answer he does a good job of explaining that context.

However, in practice your question is almost entirely defined by business objectives.

To give a concrete example, lets say that you are a loan officer, you issued loans that are \$3,000 and when people pay you back you make \$50. Naturally you are trying to build a model that predicts how if a person defaults on their loan. Lets keep this simple and say that the outcomes are either full payment, or default.

From a business perspective you can sum up a models performance with a contingency matrix:

enter image description here

When the model predicts someone is going to default, do they? To determining the downsides of over and under fitting I find it helpful to think of it as an optimization problem, because in each cross section of predicted verses actual model performance there is either a cost or profit to be made:

enter image description here

In this example predicting a default that is a default means avoiding any risk, and predicted a non-default which doesn't default will make \$50 per loan issued. Where things get dicey is when you are wrong, if you default when you predicted non-default you lose the entire loan principal and if you predict default when a customer actually would not have you suffer \$50 of missed opportunity. The numbers here are not important, just the approach.

With this framework we can now begin to understand the difficulties associated with over and under fitting.

Over fitting in this case would mean that your model works far better on you development/test data then it does in production. Or to put it another way, your model in production will far underperform what you saw in development, this false confidence will probably cause you to take on far more risky loans then you otherwise would and leaves you very vulnerable to losing money.

On the other hand, under fitting in this context will leave you with a model that just does a poor job of matching reality. While the results of this can be wildly unpredictable, (the opposite word you want to describe your predictive models), commonly what happens is standards are tightened up to compensate for this, leading to less overall customers leading to lost good customers.

Under fitting suffers a kind of opposite difficulty that over fitting does, which is under fitting gives you lower confidence. Insidiously, the lack of predictability still leads you to take on unexpected risk, all of which is bad news.

In my experience the best way to avoid both of these situations is validating your model on data that is completely outside the scope of your training data, so you can have some confidence that you have a representative sample of what you will see 'in the wild'.

Additionally, it is always a good practice to revalidate your models periodically, to determine how quickly your model is degrading, and if it is still accomplishing your objectives.

Just to some things up, your model is under fitted when it does a poor job of predicting both the development and production data.

A model underfits when it is too simple with regards to the data it is trying to model.

One way to detect such situation is to use the bias–variance approach, which can represented like this:

enter image description here

Your model is underfitted when you have a high bias.

To know whether you have a too high bias or a too high variance, you view the phenomenon in terms of training and test errors:

High bias: This learning curve shows high error on both the training and test sets, so the algorithm is suffering from high bias:

enter image description here

High variance: This learning curve shows a large gap between training and test set errors, so the algorithm is suffering from high variance.

enter image description here

If an algorithm is suffering from high variance:

  • more data will probably help
  • otherwise reduce the model complexity

If an algorithm is suffering from high bias:

  • increase the model complexity

I would advise to watch Coursera' Machine Learning course, section "10: Advice for applying Machine Learning", from which I took the above graphs.

Simply, one common approach is to increase the complexity of the model, making it simple, and most probably underfitting at first, and increasing the complexity of the model until early signs of overfitting are witnessed using a resampling technique such as cross validation, bootstrap, etc.

You increase the complexity either by adding parameters (number of hidden neurons for artificial neural networks, number of trees in a random forest) or by relaxing the regularization (often named lambda, or C for support vector machines) term in your model.

CAPM (Capital Asset Pricing Model) in Finance is a classic example of an underfit model. It was built on the beautiful theory that "Investors only pay for risk they can't diversify away" so expected excess returns are equal to correlation to market returns.

As a formula [0] Ra = Rf + B (Rm - Rf) where Ra is the expected return of the asset, Rf is the risk free rate, Rm is the market rate of return, and Beta is the correlation to the Equity premium (Rm - Rf)

This is beautiful, elegant, and wrong. Investors seem to require a higher rate of small stocks and value (defined by book to market, or dividend yield) stocks.

Fama and French [1] presented an update to the model, which adds additional Betas for Size and Value.

So how do you know in a general sense? When the predictions you are making are wrong, and another variable with a logical explanation increases the prediction quality. It's easy to understand why someone might think small stocks are risky, independent of non-diversifiable risk. It's a good story, backed by the data.

[0] http://www.investopedia.com/terms/c/capm.asp [1] http://en.wikipedia.org/wiki/Fama%E2%80%93French_three-factor_model

Models are but abstractions of what is seen in real life. They are designed in order to abstract-away nitty-gritties of the real system in observation, while keeping sufficient information to support desired analysis.

If a model is overfit, it takes into account too many details about what is being observed, and small changes on such object may cause the model to lose precision. On the other hand, if a model is underfit, it evaluates so few attributes that noteworthy changes on the object may be ignored.

Note also that underfit may be seen as an overfit, depending on the dataset. If your input can be 99%-correctly-classified with a single attribute, you overfit the model to the data by simplifying the abstraction to a single characteristic. And, in this case, you'd be generalizing too much the 1% of the base into the 99%-class -- or also specifying the model so much that it can only see one class.

A reasonable way to say that a model is neither over nor underfit is by performing cross-validations. You split your dataset into k parts, and say, pick one of them to perform your analysis, while using the other k - 1 parts to train your model. Considering that the input itself is not biased, you should be able to have as much variance of data to train and evaluate as you'd have while using the model in real life processing.


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.