Hi all I am resorting to you to figure out where the gradient and the loss for q,k,v weights update happens in Vision Transformers. I suspect it is the MLP/FF bit of the architecture but I am not confidently sure. I attach some code from lucidrains import torch from torch import nn from einops import rearrange, repeat from einops.layers.torch import Rearrange # helpers def pair(t): return t if isinstance(t, tuple) else (t, t) # classes class PreNorm(nn.Module): def __init__(self, dim, …
During backward pass, which gradients are kept and which gradients are discarded? Why are some gradients discarded? I know that forward pass is computing the output of the network given the inputs and computing the loss. Backward pass is computing the gradients for each weight loss.
While approximating gradients, using actual epsilon to shift the weights results in wildly big gradient approximations, as the "width" of the used approximation triangle is disporportionately small. In Andrew NG-s course, he is using 0.01, but I suppose it's for example purposes only. This makes me wonder, is there a method to chose the appropriate epsilon value for gradient approximation based on e.g. the current error value of the network?
I have an array of time of arrivals and I want to convert it to count data using pytorch in a differentiable way. Example arrival times: arrival_times = [2.1, 2.9, 5.1] and let's say the total range is 6 seconds. What I want to have is: counts = [0, 0, 2, 2, 2, 3] For this task, a non-differentiable way works perfect: x = [1, 2, 3, 4,5,6] counts = torch.sum(torch.Tensor(arrival_times)[:, None] < torch.Tensor(x), dim=0) It turns out the < …
I have a CNN architecture with two cross entropy losses $\mathcal{L}_1$ and $\mathcal{L}_2$ summed in the total loss $\mathcal{L} = \mathcal{L}_1 + \mathcal{L}_2$. The task I want to solve is Unsupervised Domain Adaptation. I have attested the following behavior: The gradients coming from $\mathcal{L}_1$ have a different magnitude than those coming from $\mathcal{L}_2$ such that the supervision coming from the first loss is negligible. $\mathcal{L}_1$ has a positive constant value and does not decrease during the training, while $\mathcal{L}_2$ does …
On page 58 of the second edition of Deep Learning with Python, Chollet is illustrating an example of a forward and backward pass of a computation graph. The computation graph is given by: $$ x\to w\cdot x := x_1 \to b + x_1 := x_2 \to \text{loss}:=|y_\text{true}-x_2|. $$ We are given that $x=2$, $w=3$, $b=1$, $y_{\text{true}}=4$. When running the backward pass, he calculates $$ grad(\text{loss},x_2) = grad(|4-x_2|,x_2) = 1. $$ Why is the following not true: $$ grad(\text{loss},x_2) = \begin{cases} …
I am trying to understand how integrated gradients work in the NLP case. Let $F: \mathbb{R}^{n} \rightarrow[0,1]$ a function representing a neural network, $x \in \mathbb{R}^{n}$ an input and $x' \in \mathbb{R}^{n}$ a reference. We consider the segment connecting $x$ to $x'$, and we compute the gradient at any point of this segment. The IG method is simply to sum these gradients. Thus, $I G$ in the ith dimension is given by the following formula: $$ I G_{i}(x)=\left(x_{i}-x'_{i}\right) \frac{\int_{\alpha=0}^{1} d …
Will there be some differences between applying AutoGrad on the loss function (using a python library) and applying explicit gradient (the gradient from the paper or the update rule)? For example: numerical, runtime, mathematical, or stability differences.
I am trying to implement deep reinforcement policy gradient REINFORCE in C++ and for my case there is no "autograd" method like in pytorch so I have to manually calculate the gradient. Let´s imaging that I have a scenario where the state space size is 4 and action space size is 2 (Cartpole). Also I collected the followind data for 3 steps: action probability (softmax): [0.21, 0.34, 0.45], [0.91, 0.01, 0.08], [0.50, 0.30, 0.20] sampled action (one hot encoder) : …
I do not understand how complex networks with many parameters/dimensions can be represented in a 3D space, and form a standard cost surface just like a simple network with, say, 2 parameters. For example, a network with 2 parameters that correspond to the X and Y axis, respectively, and cost function that corresponds to the Z axis makes sense...but how can we have a network with 1000 dimensions being represented in a 3D space, on a planar cost surface (not …
I have a trained neural network (NN) with independent inputs x1, x2.. xn and a scalar output y. Input x1 is a scalar, and tf.gradients(y, x1) returns a negative value. However, calculating approximate gradients via $\frac{NN(x1 + \Delta) - NN(x1-\Delta)}{2\Delta}$ where $\Delta > 0$ yields a positive value. The following is a visualization of my problem. In blue are y = NN(inputs) for all inputs seen as training data plotted against x1. Judging by these points, it is reasonable to …
I need to quantize the inputs, but the method (bucketize) I need to do so is indifferentiable. I can of course detach the tensor, but then I lose the flow of gradients to earlier weights. I guess the question is quite simple, how do you continue the flow of gradients when necessary. For example, using the following code ... x = self.linear1(x) min, max = int(x.min()), int(x.max()) bins = torch.linspace(min, max+1, 16) x = torch.bucketize(x.detach(), bins) # forced to detach …
I am using a loss which requires sampling from probability distributions to do monte carlo integration with. Sometimes legitimate training data can throw -inf/NaN. This is intended behaviour since the data point maybe far enough from the model that the probability is too small for float32. Needless to say switching to float64 etc is not a solution. The problem is that -inf turns into nan when calculating the gradient in using logsumexp, sinh, and MultivariateNormal.logpdf which then propagates all the …
#AM is the autograd function for argmax with backpropagation x,preds = model(id, mask) print(preds.retain_grad()) print(AM.apply(preds)) # compute the loss between actual and predicted values loss = torch.mean(1/(sample_score(x,AM.apply(preds))+1)) print(loss) #loss.requires_grad=Truecse # add on to the total loss x.retain_grad() preds.retain_grad() print(x.requires_grad) print(preds.requires_grad) loss.backward() print(model.fc2.weight.grad) total_loss = total_loss + loss.item() Here gradients are becoming None. How can I solve this?
I'm asked to compute central finite difference scheme (f(i+1)-f(i-1)) on an image. My attempt is something like: def gradient_x_diff(img): img = img.astype(float) return np.fabs(np.roll(imgf,1, axis = 0) - imgf(np.roll(imgf,1, axis = 0)) However, it's hinted that the solution is straightforward. It should be something like this: def gradient_x_diff(img): img = img.astype(float) return img[*]-img[**] What should I parse instead of the * and ** ?
I need to implement a custom loss function. The function is relatively simple: $$-\sum \limits_{i=1}^m [O_{1,i} \cdot y_i-1] \ \cdot \ \operatorname{ReLu}(O_{1,i} \cdot \hat{y_i} - 1)$$ With $O$ being some external attribute specific to each case. I was initially working with LightGBM, but I only found tutorials that included calculating the hessian and the gradient. If there is a way to add the function without this please correct me. Otherwise I am open to using other libraries. PyTorch-Fastai, Tensorflow-keras, catboost, …
According to this article: https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484 The vanishing gradient problem occurs when using the sigmoid activation function because sigmoid maps large input space into small space, so the gradient of big values will be close to zero. The article suggests using batch normalization layer. I can't understand how it can works? When using normalization, big values still get big values in another scope (instead of [-inf, inf] they will get [0..1] or [-1..1]) , so in the same cases the values …
ReLU is used as an activation function that serves two purposes: Breaking linearity in DNN. Helping in handling Vanishing Gradient problem. For Exploding Gradient problem, we use Gradient Clipping approach where we set the max threshold limit of Gradient, similarly to ReLU which set the minimum Gradient limit which is 0. So far I read, ReLU is considered as an activation function. In a similar fashion, can we use Gradient Clipping also as an activation function? If yes, any pros …