Why is l1 regularization rarely used comparing to l2 regularization in Deep Learning?

l1 regularization increases sparsity, so unimportant weights are decreased closer to 0. In Deep Learning models, the input usually consists of thousands or millions of features/pixels, and the network usually contains millions to even billions of weights.

Intuitively and theoretically, such feature selection should be very helpful in Deep Learning models to reduce overfitting problems since not all features/weights are important, selecting important ones from millions of weights reduces the function complexity, which therefore reduces the possibility of memorizing the training set, which also forces the network to learn more from the important features rather than some useless relations.

However, I have been reading papers on Deep Learning, mostly Computer Vision, it seems like l1 is rarely used in those famous/strong/SOTA architectures and algorithms other than pruning related papers. In contrast, l2 is used in most of them (through weight decay, which is equivalent to l2 when using SGD).

Is there any reason behind this?

Topic regularization deep-learning neural-network machine-learning

Category Data Science


Derivative of $L1$ regularization is much more computationally expensive than derivate of $L2$ regularization which is required during backprop.

Also $L1$ regularization causes to sparse feature vector which is not desired in most of the cases.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.