Bias and variance in the model o in the predictions?

This topic confuses me. In the literature or articles, when talking about bias and variance in automatic learning, specifically in cross-validation, do they refer to the high bias (underfitting) and high variance (overfitting) in the model? Or do they refer to the bias and variance of the predictions obtained in the iterations of the cross-validation? How to handle each case?

Topic bias variance cross-validation machine-learning

Category Data Science


Bias and variance are used to describe the predictions of models and they define if it is a good one or not. As a perfect model (low bias, low variance) does not exists, you often have to chose if you prefer a model with high bias/low variance or a model with low bias/high variance. This applis to the distribution of the predictions.

Cross validation helps you have a more accurate value of your metrics (like accuracy), as the stochastic nature of the way models are calculated can give you different values for the metrics each time your run it: because of some randomness (random seeds) and because of the way the train/validation/test split is done...

So it is good practice to do cross validation as the metrics will be averaged over several train/validation split and so the mean values will be more realistics to compare the different cases/models/parameters.


I finally understood "Bias and Variance" with Logistic Regression (LR). LR is known to have a high bias but low variance. For example, LR accuracy might be only at 90% while you'd get 95% with a Decision Tree (tends to overfit). But with LR in production feeding new data, you're very likely to see 90% accuracy ... while the Decision Tree had massive accuracy swings (variance) with new data (70% - 95%).

Hope that helps.


In some cases, you may have a model that's a black box - you feed input features and get output predictions, without knowing or caring what happens in the middle. In those situations, the model is, in a way, defined by its output - two models that produce the same predictions are indistinguishable from one another, even though they may be entirely distinct models. In these situations, saying a model is biased or has high variance is equivalent to saying the predictions of the model are biased or have high variance. It's acceptable to describe both the model and its output in terms of bias/variance. Cross validation provides an unbiased estimate of bias/variance for your model/predictions.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.