What is the intuition behind decreasing the slope when using regularization?

While training a logistic regression model, using regularization can help distribute weights and avoid reliance on some particular weight, making the model more robust.

Eg: suppose my input vector is 4 dimensional. The input values are [1,1,1,1]. The output can be 1 if my weight matrix has values [1,0,0,0] or [0.25,0.25,0.25,0.25]. L2 norm would give the later weight matrix (because pow(1, 2) 4*pow(0.25,2) ). I understand intuitively why l2 regularization can be beneficial here.

But in case of linear regression l2 regularization reduces the slope. Why reducing the slope only provides better performance, is increasing the slope also an alternative?

Topic regularization

Category Data Science


L2 regularization can reduce the slope (i.e., the size of the coefficients) in linear regression.

The fundamental idea is that larger coefficients are less likely to generalize. Regularization increases the cost associated with larger coefficients.


There are a lot of misconceptions on this topic.

(satinder singh) Why reducing the slope only provides better performance, is increasing the slope also an alternative?

Reducing the weight does not lead to better performance. In the limit for infinite regularization your resulting model will be a constant (if your weight always multiply the independent variable). The quality of the model is obviously bad. The goal of regularization is to prevent overfitting by penalizing large weights.

But why are large weights problematic? Imagine the following set of three points $(0,0)$, $(\varepsilon,1)$ and $(1, 1)$. If you try to fit the polynomial $y(x_n)=w_0 + w_1x_n + w_2 x_n^2$ you will obtain the following coefficients $w_0=0$, $w_1=1+\varepsilon^{-1}$, and $w_2=-\varepsilon^{-1}$. For $\varepsilon \to 0$ the coefficients will diverge. If you look at these three points you will see that the resulting solution is just overfitting to the data. This example shows that large weights are a sign of overfitting.

In order to counteract this effect we can introduce a regularization term $R(\mathbf{w})$ (is zero for $\mathbf{w}=\mathbf{0}$) and and construct a regularized loss function $E_\text{reg}(\mathbf{w}) = E(\mathbf{w}) + \lambda R(\mathbf{w})$. For $\lambda \to 0$ we will obtain the original unregularized loss function $E(\mathbf{w})$. For $\lambda \to \infty$ the regularized loss function will be dominated by the regularization term, which is minimized for $\mathbf{w}=\mathbf{0}$. So for infinite regularization you will prevent your model from overfitting for sure. The goal of regularization is to determine an optimal $\lambda_\text{optimal}$ which is preventing the model to overfit to the training data (prevent large weights) and still be able to generalize to test data.

(vivek) As far as I know only L1 has the impact of reducing the coefficents of lesser effective features and not L2.

Both regularizations $\mathcal{L}_1$ and $\mathcal{L}_2$ will reduce less important features by reducing the associated weights. $\mathcal{L}_1$ regularization has the ability to set some coefficients to $0$ exactly whereas $\mathcal{L}_2$ will in general lead to small magnitudes of weights but not exaclty $0$.


Please refer this article on L1 and L2 regularization :- https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261

L1 called as Lasso and L2 called as Ridge essentially reduces the learning process of the gradient descent ( loss reduction) in an attempt to reduce overfitting. As far as I know only L1 has the impact of reducing the coefficents of lesser effective features and not L2.


Using regularization and shrinking the parameters, we reduce the sample variance of the estimates, and reduce the tendency to fit the random noise. The fitting to the noise is something that we wish to reduce. We can't increase the slope as we wan't to reduce overfitting.

L2 doesn’t necessarily reduce the number of features, but rather reduces the magnitude/impact that each features has on the model by reducing the coefficient value.

Shrinking can lead to positive effect when we have overestimated and negative when we have underestimated. But we are not shrinking everyone equally ,we are shifting with a factor that is larger if the estimate is larger away from zero.

Shrinking all the slopes towards zero will make some of them more accurate and some of them less accurate, but you can see how it would make them collectively more accurate.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.