Expanding my comments into an answer.
Ridge regression is a by definition an augmentation of Least Squares method, especialy for problems where the data may be highly correlated (there is what is called multi-colinearity).
Lets assume the dependent variable is $y$ and $x_i$ are the independent variables.
Then assume the true mapping between $X$ and $y$ is:
$$y = \sum_i \beta_i x_i$$.
Ordinary Least Squares (OLS) assumes the mapping is given by $\hat{\beta_i}$ coefficients which solve the least squares problem:
$$\hat{B} = (X^TX)^{-1}X^TY$$
The above formula indeed minimises the square error $RSS = \sum_i(y_i-\hat{y_i})^2$
However, if some columns in $X$ are correlated then ordinary least squares may not provide the best solution.
Least Square Method finds the Best and Unbiased Coefficients
You may know that least square method finds the coefficients that best
fit the data. One more condition to be added is that it also finds the
unbiased coefficients. Here unbiased means that OLS doesn’t consider
which independent variable is more important than others. It simply
finds the coefficients for a given data set. In short, there is only
one set of betas to be found, resulting in the lowest ‘Residual Sum of
Squares (RSS)’. The question then becomes “Is a model with the lowest
RSS truly the best model?”.
Bias vs. Variance
The answer for the question above is “Not really”. As hinted in the
word ‘Unbiased’, we need to consider ‘Bias’ too. Bias means how
equally a model cares about its predictors. Let’s say there are two
models to predict an apple price with two predictor ‘sweetness’ and
‘shine’; one model is unbiased and the other is biased.
First, the unbiased model tries to find the relationship between the
two features and the prices, just as the OLS method does. This model
will fit the observations as perfectly as possible to minimize the
RSS. However, this could easily lead to overfitting issues. In other
words, the model will not perform as well with new data because it is
built for the given data so specifically that it may not fit new data.
The biased model accepts its variables unequally to treat each
predictor differently. Going back to the example, we would want to
only care about ‘sweetness’ to build a model and this should perform
better with new data. The reason will be explained after understanding
Bias vs. Variance. If you’re not familiar with the bias vs. variance
topic, I strongly recommend you to watch this video that will give you
insight.
In order to affect bias-variance tradeoff ridge regression adds a small offset in the estimation process, ie:
The magnitude of the $B$ coefficients must satisfy a criterion (the "ridge") eg:
$$|B|_2^2 \le C^2$$
Then this constraint can be added in the Least Squares formula (eg via a Lagrange Multiplier). This constraint affects the bias-variance tradeoff and can be helpful in cases such as highly correlated data or data with unequal significance. Then the new $RSS$ is similar to what you ask in the question.
Again the square error is measured along with constraints on the coefficients in order to affect bias-variance tradeoff.
Reference:
- Ridge Regression for Better Usage
- Bias-Variance Tradeoff