Why non-differentiable regularization lead to setting coefficients to 0?

The L2 regularization lead to minimize the values in the vector parameter. The L1 regularization lead to setting some coefficients to 0 in the vector parameter.

More generally, I've seen that non-differentiable regularization function lead to setting coefficients to 0 in the parameter vector. Why is that the case?

Topic regularization

Category Data Science


ISLR talks about this topic in details, it can be understood by looking Contours of error and loss function given in the image below:

enter image description here

Each of the ellipses centered around βˆ represents a contour: this means that all of the points on a particular ellipse have the same RSS value. As the ellipses expand away from the least squares coefficient estimates, the RSS increases. Equations (6.8) and (6.9) indicate that the lasso and ridge regression coefficient estimates are given by the first point at which an ellipse contacts the constraint region. Since ridge regression has a circular constraint with no sharp points, this intersection will not generally occur on an axis, and so the ridge regression coefficient estimates will be exclusively non-zero. However, the lasso constraint has corners at each of the axes, and so the ellipse will often intersect the constraint region at an axis. When this occurs, one of the coefficients will equal zero. In higher dimensions, many of the coefficient estimates may equal zero simultaneously


Look at the penalty terms in linear Ridge and Lasso regression:

Ridge (L2):

enter image description here

Lasso (L1):

enter image description here

Note the absolute value (L1 norm) in the Lasso penalty compared to the squared value (L2 norm) in the Ridge penalty.

In Introduction to Statistical Learning (Ch. 6.2.2) it reads: "As with ridge regression, the lasso shrinks the coefficient estimates towards zero. However, in the case of the lasso, the L1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. Hence, much like best subset selection, the lasso performs variable selection."

http://www-bcf.usc.edu/~gareth/ISL/

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.