Why we take $\alpha\sum B_j^2$ as penalty in Ridge Regression?

$$RSS_{RIDGE}=\sum_{i=1}^n(\hat{y_i}-y_i)^2+\alpha\sum_{i=1}^nB_j^2$$

Why we are taking $\alpha\sum B_j^2$ as a penalty here? We are adding this term for minimizing variance in Machine Learning Model. But how this term minimizing variance. If I add suppose $e^x$ or any increasing function then it also minimizing the variance. I want to know how this term minimizing the error

Topic ridge-regression machine-learning-model machine-learning

Category Data Science


Expanding my comments into an answer.

Ridge regression is a by definition an augmentation of Least Squares method, especialy for problems where the data may be highly correlated (there is what is called multi-colinearity).

Lets assume the dependent variable is $y$ and $x_i$ are the independent variables.

Then assume the true mapping between $X$ and $y$ is:

$$y = \sum_i \beta_i x_i$$.

Ordinary Least Squares (OLS) assumes the mapping is given by $\hat{\beta_i}$ coefficients which solve the least squares problem:

$$\hat{B} = (X^TX)^{-1}X^TY$$

The above formula indeed minimises the square error $RSS = \sum_i(y_i-\hat{y_i})^2$

However, if some columns in $X$ are correlated then ordinary least squares may not provide the best solution.

Least Square Method finds the Best and Unbiased Coefficients

You may know that least square method finds the coefficients that best fit the data. One more condition to be added is that it also finds the unbiased coefficients. Here unbiased means that OLS doesn’t consider which independent variable is more important than others. It simply finds the coefficients for a given data set. In short, there is only one set of betas to be found, resulting in the lowest ‘Residual Sum of Squares (RSS)’. The question then becomes “Is a model with the lowest RSS truly the best model?”.

Bias vs. Variance

The answer for the question above is “Not really”. As hinted in the word ‘Unbiased’, we need to consider ‘Bias’ too. Bias means how equally a model cares about its predictors. Let’s say there are two models to predict an apple price with two predictor ‘sweetness’ and ‘shine’; one model is unbiased and the other is biased. First, the unbiased model tries to find the relationship between the two features and the prices, just as the OLS method does. This model will fit the observations as perfectly as possible to minimize the RSS. However, this could easily lead to overfitting issues. In other words, the model will not perform as well with new data because it is built for the given data so specifically that it may not fit new data. The biased model accepts its variables unequally to treat each predictor differently. Going back to the example, we would want to only care about ‘sweetness’ to build a model and this should perform better with new data. The reason will be explained after understanding Bias vs. Variance. If you’re not familiar with the bias vs. variance topic, I strongly recommend you to watch this video that will give you insight.

In order to affect bias-variance tradeoff ridge regression adds a small offset in the estimation process, ie:

The magnitude of the $B$ coefficients must satisfy a criterion (the "ridge") eg:

$$|B|_2^2 \le C^2$$

Then this constraint can be added in the Least Squares formula (eg via a Lagrange Multiplier). This constraint affects the bias-variance tradeoff and can be helpful in cases such as highly correlated data or data with unequal significance. Then the new $RSS$ is similar to what you ask in the question.

Again the square error is measured along with constraints on the coefficients in order to affect bias-variance tradeoff.

Reference:

  1. Ridge Regression for Better Usage
  2. Bias-Variance Tradeoff

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.