Confusion with L2 Regularization in Back-propagation

In a very simple language, this is L2 regularization $\hspace{3cm}$$Loss_R$ = $Loss_N + \sum w_i^2$ $Loss_N$ - Loss without regularization $Loss_R$ - Loss with regularization When implementing [Ref], we simply add the derivative of the new penaty to the current delta weight, $\hspace{3cm}$$dw = dw_N + constant*w$ $dw_N$ - Weight delta without regularization What I think - L2 regularization is achieved with the last step only i.e. the weight is penalized. My question is - Why do we then add …
Category: Data Science

calculating gradient descent

when using mini batch gradient descent , we perform backpropagation after each batch , ie we calculate the gradient after each batch , we also capture y-hat after each sample in the batch and finally calculate the loss function all over the batch , we use this latter to calculate the gradient, correct ? now as the chainrule states we calculate the gradient this way for the below neural network: the question is if we calculate the gradient after passing …
Category: Data Science

Is the Cross Entropy Loss important at all, because at Backpropagation only the Softmax probability and the one hot vector are relevant?

Is the Cross Entropy Loss (CEL) important at all, because at Backpropagation (BP) only the Softmax (SM) probability and the one hot vector are relevant? When applying BP, the derivative of CEL is the difference between the output probability (SM) and the one hot encoded vector. For me the CEL output, which is very sophisticated, does not play any roll for learning. I´m expecting a fallacy in my reasoning, so could somebody please help me out?
Category: Data Science

Backtracking filter coefficients of Convolutional Neural Networks

I'm starting to learn how convolutional neural networks work, and I have a question regarding the filters. Apparently, these are randomly generated when the model is generated, and then as the data is fed, these are corrected accordingly as with the weights in backtracking. However, how does this work in filters? To my understanding, backtracking works by calculating how much an actual weight contributed to the total error after an output has been predicted, and then correct it accordingly. I've …
Category: Data Science

Generator losses in WGAN and potential convergence failure

I have been training a WGAN for a while now, with my generator training once in every five epochs. I have tried several model architectures(no of filters) and also tried varying the relationship with each other. No matter what happens, my output is essentially noise. On further reading, it seems to be a classic case of convergence failure. Over time, my generator loss gets more and more negative while my discriminator loss remains around -0.4 My guess is that since …
Category: Data Science

Backpropagation with log likelihood cost function and softmax activation

In the online book on neural networks by Michael Nielsen, in chapter 3, he introduces a new cost function called as log-likelihood function defined as below $$ C = -ln(a_y^L) $$ Suppose we have 10 output neurons, when back propagating the error, only the gradient w.r.t. $y^{th}$ output neuron is non-zero and all others are zero. Is that right? If so, how is the below equation (81) true? $$\frac{\partial C}{\partial b_j^L} = a_j^L - y_j $$ I'm getting the expression …
Category: Data Science

Backpropagation in NN

During backward pass, which gradients are kept and which gradients are discarded? Why are some gradients discarded? I know that forward pass is computing the output of the network given the inputs and computing the loss. Backward pass is computing the gradients for each weight loss.
Category: Data Science

Force neural network to only product positive values

I have a custom neural network that has been written from scratch in python and also a dataset where negative target/response values are impossible, however my model sometimes produces negatives forecasts/fits which I'd like to completely avoid. Rather than transform the input data or clip the final forecasts, I'd like to force my neural network to only product positive values (or values above a given threshold) during forward and back propagation. I believe I understand what needs to be done …
Category: Data Science

Is Loss value (e.g., MSE loss) used in the calculation for parameter update when doing gradient descent?

My question is really simple. I know the theory behind gradient descent and parameter updates, what I really haven't found clarity on is that is the loss value (e.g., MSE value) used, i.e., multiplied at the start when we do the backpropagation for gradient descent (e.g., multiplying MSE loss value with 1 then doing backprop, as at the start of backprop we start with the value 1, i.e., derivative of x w.r.t x is 1)? If loss value isn't used …
Category: Data Science

What is the dimensionality of the bias term in neural networks?

I am trying to build a neural network (3 layers, 1 hidden) in Python on the classic Titanic dataset. I want to include a bias term following Siraj's examples, and the 3Blue1Brown tutorials to update the bias by backpropagation, but I know my dimensionality is wrong. (I feel I am updating the biases incorrectly which is causing the incorrect dimensionality) The while loop in the code below works for a training dataset, where the node products and biases have the …
Category: Data Science

How to plot the computational graph and derive the update procedure of parameters using the backpropagation algorithm?

Please help me to solve this problem without a code (ps: this is a written problem): Given the following loss function, please plot the computational graph, and derive the update procedure of parameters using the backpropagation algorithm, where $W$ = {$W_1, W_2, W_3, W_4$}, $b$ = {$b_1, b_2, b_3, b_4$} denote the parameters; $x ∈ R^d$ indicates the input features; $y ∈ R$ is the ground-truth label.
Category: Data Science

Keras Backpropagation when Later Layers are Frozen

I am working on a project with facial image translation and GANs and still have some conceptual misunderstandings. In my definition of my model, I extract a deep embedding of my generated image and the input image using a state of the art CNN which I mark as untrainable, calculate the distance between these embeddings and use this distance itself as a loss in my model definition. If the model from which the embeddings come from is untrainable, will the …
Category: Data Science

Hochreiter LSTM (p. 4): Maximal values of logistic sigmoid derivative times weight

My questions follow the below page 4 excerpt from Hochreiter's LSTM paper: If $f_{l_{m}}$ is the logistic sigmoid function, then the maximal value of $f^\prime_{l_{m}}$ is 0.25. If $y^{l_{m-1}}$ is constant and not equal to zero, then $|f^\prime_{l_{m}}(net_{l_{m}})w_{l_{m}l_{m-1}}|$ takes on maximal values where $w_{l_{m}l_{m-1}} = {1 \over y^{l_{m-1}}} \coth \left( {1 \over 2}net_{l_{m}} \right)$, goes to zero for $|w_{l_{m}l_{m-1}}| \rightarrow \infty$, and is less than 1.0 for $|w_{l_{m}l_{m-1}}| < 4.0$. The derivative of the sigmoid $f_{l_{m}} = f^\prime_{l_{m}} = \sigma$, …
Category: Data Science

Question about grad() from Deep Learning by Chollet

On page 58 of the second edition of Deep Learning with Python, Chollet is illustrating an example of a forward and backward pass of a computation graph. The computation graph is given by: $$ x\to w\cdot x := x_1 \to b + x_1 := x_2 \to \text{loss}:=|y_\text{true}-x_2|. $$ We are given that $x=2$, $w=3$, $b=1$, $y_{\text{true}}=4$. When running the backward pass, he calculates $$ grad(\text{loss},x_2) = grad(|4-x_2|,x_2) = 1. $$ Why is the following not true: $$ grad(\text{loss},x_2) = \begin{cases} …
Category: Data Science

Transferring the hidden state of a RNN to another RNN

I am using Reinforcement Learning to teach an AI an Austrian Card Game with imperfect information called Schnapsen. For different states of the game, I have different neural networks (which use different features) that calculate the value/policy. I would like to try using RNNs, as past actions may be important to navigate future decisions. However, as I use multiple neural networks, I somehow need to constantly transfer the hidden state from one RNN to another one. I am not quite …
Category: Data Science

Understanding the text from the paper 'Efficient BackProp' by Yann LeCun

Sorry, I just started in Deep Learning, so I am trying my best not to assume anything unless I am absolutely sure. Going through comments here someone recommended this excellent paper on backpropagation Efficient BackProp by Yann LeCun. While reading I stuck at '4.5 Choosing Target Values'. I can't copy paste the text as pdf is not allowing it so posting the screenshot here. Most of the paper was clear to me but I couldn't understand exactly what the author …
Category: Data Science

How backpropagation works in case of 2 hidden layers?

Imagine the next structure (for simplicity, there's no bias, and activation formatting with sigmoid or relu, just weights). The input has two neurons, the two hidden layers have 3 neurons each, the output layer has two neurons, so a cost ($\sum C$) with two "subcosts" ($C^1$, $C^2$). (I'm new at machine learning, and super confused with the different notations, and formatting, indexes, so to clarify, in case of activations, the upper index will show the index of it in the …
Category: Data Science

Gradient flow through concatenation operation

I need help in understanding the gradient flow through a concatenation operation. I'm implementing a network (mostly a CNN) which has a concatenation operation (in pytorch). The network is defined such that the responses of passing two different images through a CNN are concatenated and passed through another CNN and the training is done end to end. Since the first CNN is shared between both of the inputs to the concatenation, I was wondering how the gradients should be distributed …
Category: Data Science

Back propagation on matrix of weights

I am trying to implement a Neural Network for binary classification using python and numpy only. My network structure is as follows: input features: 2 [1X2] matrix Hidden layer1: 5 neurons [2X5] matrix Hidden layer2: 5 neurons [5X5] matrix Output layer: 1 neuron [5X1]matrix I have used the sigmoid activation function in all the layers. Now lets say I use binary cross entropy as my loss function. How do I do the back propagation on these matrices to update weights? …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.