Why is l1 regularization rarely used comparing to l2 regularization in Deep Learning?

Question

Why is l1 regularization rarely used comparing to l2 regularization in Deep Learning?

seermer

2022年3月17日 17:03

l1 regularization increases sparsity, so unimportant weights are decreased closer to 0. In Deep Learning models, the input usually consists of thousands or millions of features/pixels, and the network usually contains millions to even billions of weights.

Intuitively and theoretically, such feature selection should be very helpful in Deep Learning models to reduce overfitting problems since not all features/weights are important, selecting important ones from millions of weights reduces the function complexity, which therefore reduces the possibility of memorizing the training set, which also forces the network to learn more from the important features rather than some useless relations.

However, I have been reading papers on Deep Learning, mostly Computer Vision, it seems like l1 is rarely used in those famous/strong/SOTA architectures and algorithms other than pruning related papers. In contrast, l2 is used in most of them (through weight decay, which is equivalent to l2 when using SGD).

Is there any reason behind this?

Topic regularization deep-learning neural-network machine-learning

Category Data Science

SrJ · Accepted Answer · 2021年8月3日 05:31

Derivative of $L1$ regularization is much more computationally expensive than derivate of $L2$ regularization which is required during backprop.

Also $L1$ regularization causes to sparse feature vector which is not desired in most of the cases.

Why is l1 regularization rarely used comparing to l2 regularization in Deep Learning?

About