Confusion with L2 Regularization in Back-propagation

In a very simple language, this is L2 regularization $\hspace{3cm}$$Loss_R$ = $Loss_N + \sum w_i^2$ $Loss_N$ - Loss without regularization $Loss_R$ - Loss with regularization When implementing [Ref], we simply add the derivative of the new penaty to the current delta weight, $\hspace{3cm}$$dw = dw_N + constant*w$ $dw_N$ - Weight delta without regularization What I think - L2 regularization is achieved with the last step only i.e. the weight is penalized. My question is - Why do we then add …
Category: Data Science

Importing Excel format data into R/R Studio and using glmnet package?

I have no problem importing Excel formatted data into R/R Studio and use all other R packages that I use. But, when I want to use the glmnet package to develop a regularization model, I invariably run into the following error (after specifying my regularization model and attempting to run it): Error in storage.mode(y) <- "double": (list) object cannot be coerced to type 'double' Here is what I have already tried to resolve this: De-format the numbers in Excel (no …
Category: Data Science

Regularizing the intercept - particular case

Yesterday I posted this thread Regularizing the intercept where I had a question about penalizing the intercept. In short, I asked wether there exist cases where penalizing the intercept leads to a lower expected prediction error and the answer was: Of course there exist scenarios where it makes sense to penalize the intercept, if that aligns with domain knowledge. However in real world, more often we do not just penalize the magnitude of intercept, but enforce it to be zero. …
Category: Data Science

Correct theoretical regularized objective function for XGB/LGBM (regression task)

I am writing an academic paper on the application of Machine Learning methods to Time Series Forecasting and I am unsure about how to write down the theoretical part about the regularized objective function for XGBoosting. Below you can find the equation given by the developers of the XGB algorithm for the regularized objective function (equation 2). The paper is called "XGBoost: A Scalable Tree Boosting System" by Chen & Guestrin (2016). In the Python API from the xgb library …
Category: Data Science

Regularizing the intercept

I am reading The Elements of Statistical Learning and regarding regularized logistic regression it says: "As with the lasso, we typically do not penalize the intercept term" and I am wondering in which situations you would penalize the intercept? Looking at regularization in general, couldn't one think of scenarios where penalizing the intercept would lead to a better EPE (expected prediction error)? Although we increase the bias wouldn't we in some scenarios still reduce the EPE? EDIT It might be …
Category: Data Science

Custom regularisation for logistics regression

My understanding of l2 regularisation: Weights of the model are assumed to have a prior guassian distribution centered around 0. Then MAP estimate over data adds an extra penalty in cost function. My problem statement: I am making a reasonable assumption(based on domain knowledge) that my features are independent which means I can use the weights of the features to infer the importance of features in influencing Y. From domain knowledge, I want to assume priors about the ratio of …
Category: Data Science

Why is my loss blowing up after adding regularization

I tried to add L2 regularization to a network class I wrote however when I train it the loss blows up even though accuracy also increases. Can someone explain where I am going wrong? (I am using the formulas from here) The update to minibatch (The (1-eta*(lmbda/n)) coefficient to w is what I added) def update_mini_batch(self, mini_batch, eta, lmbda, n): # n is the number of training samples being trained from # Turn the mini_batch with one dimensional samples into …
Category: Data Science

Version of Perceptron

If we change the $ywx<0$ condition (for performing update) to $ywx<1$ like in SVM (but without adding regularization to maximize the margin), is there any difference from the basic perceptron (the one with the aforementioned $ywx<0$ condition)?
Category: Data Science

Why is l1 regularization rarely used comparing to l2 regularization in Deep Learning?

l1 regularization increases sparsity, so unimportant weights are decreased closer to 0. In Deep Learning models, the input usually consists of thousands or millions of features/pixels, and the network usually contains millions to even billions of weights. Intuitively and theoretically, such feature selection should be very helpful in Deep Learning models to reduce overfitting problems since not all features/weights are important, selecting important ones from millions of weights reduces the function complexity, which therefore reduces the possibility of "memorizing" the …
Category: Data Science

Regularization for intercept parameter

Why is the regularization parameter not applied to the intercept parameter? From what I have read about the cost functions for Linear and Logistic regression, the regularization parameter (λ) is applied to all terms except the intercept. For example, here are the cost functions for linear and logistic regression respectively (Notice that j starts from 1):
Category: Data Science

Regularization hyperparam tuning during training

I have an idea for a regularization-hyperparam selection method, which I haven't encountered before and can't find on Google, but I'm sure someone has already tried it and I'm wondering what are the best practices. The most common method for hyperparam selection is to select different hyperparams (e.g some value for L2 regularization), train NNs with them, and test the NNs on some validation set - and select the best one. My idea is to train a single NN and …
Category: Data Science

Is it possible to explain why Lasso models eliminated certain coefficient?

Is it possible to understand why Lasso models eliminated specific coefficients?. During the modelling, many of the highly correlated features in data is being eliminated by Lasso regression. Is it possible why precisely these features are being eliminated from the model? (Is it the presence of any other features/multicollinearity etc.? I want to explain the lasso model behaviour. Your help is highly appreciated.
Category: Data Science

Request: Confirmation on my understanding of overfitting and regularization concepts

Overfitted models tend to have largely different (some very high, some comparatively low) coefficients/weights for different feature values. So, this means the model (when drawn as graph) will have high variation in slopes and even a small change in training data value (feature value) can lead to large change in output. To smoothen the overfitted model/curve that has high slope variation, we use regularization (example: L1/L2). L1 regularization removes unnecessary/less influential features from the model making the model less complex. …
Category: Data Science

L1 regularization to first layer or all the layers

I have lots of features in the input to a Fully Connected Neural Network(FCNN) and was thinking to add L1 regularization to only select the most relevant features. I found how to add it following this link, and added it to the weights of the first layer (my FCNN is 4 layers deep). However, when I manually check the weights all of them are now super small (<1e-4) and none of them are zero as I expected (that's why I …
Category: Data Science

What exactly is activity sparsity and why is it beneficial?

I have been reading about weight sparsity and activity sparsity with regard to convolutional neural networks. Weight sparsity I understood as having more trainable weights being exactly zero, which would essentially mean having less connections, allowing for a smaller memory footprint and quicker inference on test data. Additionally, it would help against overfitting (which I understand in terms of smaller weights leading to simpler models/Ockham's razor). From what I understand now, activity sparsity is analogous in that it would lead …
Category: Data Science

What is the intuition behind decreasing the slope when using regularization?

While training a logistic regression model, using regularization can help distribute weights and avoid reliance on some particular weight, making the model more robust. Eg: suppose my input vector is 4 dimensional. The input values are [1,1,1,1]. The output can be 1 if my weight matrix has values [1,0,0,0] or [0.25,0.25,0.25,0.25]. L2 norm would give the later weight matrix (because pow(1, 2) > 4*pow(0.25,2) ). I understand intuitively why l2 regularization can be beneficial here. But in case of linear …
Category: Data Science

Convolutional Neural Network overfitting

I built a CNN to learn to classify EEG data (only about 4000 training examples, 2 classes, 50-50 class balance). Each training example is 64x512, with 5 channels each Ive tried to keep the network as simple/small as possible for testing: ConvLayer (4 filters) MaxPool Dropout 50% Fully connected (50 neurons) Dropout 50% Softmax Im also using weight decay (L2 reg, lambda = 0.001) The problem is no matter how I play with the filter parameters (size, stride, number) my …
Category: Data Science

Problems with Graphical Lasso

I'm trying to use the Graphical Lasso algorithm (more specifically the R package glasso) to find an estimated graph representing the connections between a set of nodes by estimating a precision matrix. I have a feature matrix containing the values of multiple features for each of the nodes, and the sample covariance matrix obtained from the product between this matrix and its tranpose is used as the input for the glasso function, along with the l1 regularization coefficient $\lambda$. However, …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.