Why non-differentiable regularization lead to setting coefficients to 0?

Question

Why non-differentiable regularization lead to setting coefficients to 0?

Victor

2022年4月28日 22:01

The L2 regularization lead to minimize the values in the vector parameter. The L1 regularization lead to setting some coefficients to 0 in the vector parameter.

More generally, I've seen that non-differentiable regularization function lead to setting coefficients to 0 in the parameter vector. Why is that the case?

Topic regularization

Category Data Science

Ashwiniku918 · Accepted Answer · 2022年3月29日 17:51

ISLR talks about this topic in details, it can be understood by looking Contours of error and loss function given in the image below:

Each of the ellipses centered around βˆ represents a contour: this means that all of the points on a particular ellipse have the same RSS value. As the ellipses expand away from the least squares coefficient estimates, the RSS increases. Equations (6.8) and (6.9) indicate that the lasso and ridge regression coefficient estimates are given by the first point at which an ellipse contacts the constraint region. Since ridge regression has a circular constraint with no sharp points, this intersection will not generally occur on an axis, and so the ridge regression coefficient estimates will be exclusively non-zero. However, the lasso constraint has corners at each of the axes, and so the ellipse will often intersect the constraint region at an axis. When this occurs, one of the coefficients will equal zero. In higher dimensions, many of the coefficient estimates may equal zero simultaneously

Peter · Accepted Answer · 2019年7月1日 07:11

Look at the penalty terms in linear Ridge and Lasso regression:

Ridge (L2):

Lasso (L1):

Note the absolute value (L1 norm) in the Lasso penalty compared to the squared value (L2 norm) in the Ridge penalty.

In Introduction to Statistical Learning (Ch. 6.2.2) it reads: "As with ridge regression, the lasso shrinks the coefficient estimates towards zero. However, in the case of the lasso, the L1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. Hence, much like best subset selection, the lasso performs variable selection."

http://www-bcf.usc.edu/~gareth/ISL/

Why non-differentiable regularization lead to setting coefficients to 0?

About