Gradient and loss calculation localization in Vision Transformers

Hi all I am resorting to you to figure out where the gradient and the loss for q,k,v weights update happens in Vision Transformers. I suspect it is the MLP/FF bit of the architecture but I am not confidently sure. I attach some code from lucidrains import torch from torch import nn from einops import rearrange, repeat from einops.layers.torch import Rearrange # helpers def pair(t): return t if isinstance(t, tuple) else (t, t) # classes class PreNorm(nn.Module): def __init__(self, dim, …
Category: Data Science

Backpropagation in NN

During backward pass, which gradients are kept and which gradients are discarded? Why are some gradients discarded? I know that forward pass is computing the output of the network given the inputs and computing the loss. Backward pass is computing the gradients for each weight loss.
Category: Data Science

How to choose appropriate epsilon value while approximating gradients to check training?

While approximating gradients, using actual epsilon to shift the weights results in wildly big gradient approximations, as the "width" of the used approximation triangle is disporportionately small. In Andrew NG-s course, he is using 0.01, but I suppose it's for example purposes only. This makes me wonder, is there a method to chose the appropriate epsilon value for gradient approximation based on e.g. the current error value of the network?
Topic: gradient
Category: Data Science

Differentiable approximation for counting negative values in array

I have an array of time of arrivals and I want to convert it to count data using pytorch in a differentiable way. Example arrival times: arrival_times = [2.1, 2.9, 5.1] and let's say the total range is 6 seconds. What I want to have is: counts = [0, 0, 2, 2, 2, 3] For this task, a non-differentiable way works perfect: x = [1, 2, 3, 4,5,6] counts = torch.sum(torch.Tensor(arrival_times)[:, None] < torch.Tensor(x), dim=0) It turns out the < …
Category: Data Science

CNN gradients with different magnitude

I have a CNN architecture with two cross entropy losses $\mathcal{L}_1$ and $\mathcal{L}_2$ summed in the total loss $\mathcal{L} = \mathcal{L}_1 + \mathcal{L}_2$. The task I want to solve is Unsupervised Domain Adaptation. I have attested the following behavior: The gradients coming from $\mathcal{L}_1$ have a different magnitude than those coming from $\mathcal{L}_2$ such that the supervision coming from the first loss is negligible. $\mathcal{L}_1$ has a positive constant value and does not decrease during the training, while $\mathcal{L}_2$ does …
Category: Data Science

Question about grad() from Deep Learning by Chollet

On page 58 of the second edition of Deep Learning with Python, Chollet is illustrating an example of a forward and backward pass of a computation graph. The computation graph is given by: $$ x\to w\cdot x := x_1 \to b + x_1 := x_2 \to \text{loss}:=|y_\text{true}-x_2|. $$ We are given that $x=2$, $w=3$, $b=1$, $y_{\text{true}}=4$. When running the backward pass, he calculates $$ grad(\text{loss},x_2) = grad(|4-x_2|,x_2) = 1. $$ Why is the following not true: $$ grad(\text{loss},x_2) = \begin{cases} …
Category: Data Science

How to interpret integrated gradients in an NLP toxic text classification use-case?

I am trying to understand how integrated gradients work in the NLP case. Let $F: \mathbb{R}^{n} \rightarrow[0,1]$ a function representing a neural network, $x \in \mathbb{R}^{n}$ an input and $x' \in \mathbb{R}^{n}$ a reference. We consider the segment connecting $x$ to $x'$, and we compute the gradient at any point of this segment. The IG method is simply to sum these gradients. Thus, $I G$ in the ith dimension is given by the following formula: $$ I G_{i}(x)=\left(x_{i}-x'_{i}\right) \frac{\int_{\alpha=0}^{1} d …
Category: Data Science

How to manually calculate the gradient that will propagate back over the network using the REINFORCE algorithm?

I am trying to implement deep reinforcement policy gradient REINFORCE in C++ and for my case there is no "autograd" method like in pytorch so I have to manually calculate the gradient. Let´s imaging that I have a scenario where the state space size is 4 and action space size is 2 (Cartpole). Also I collected the followind data for 3 steps: action probability (softmax): [0.21, 0.34, 0.45], [0.91, 0.01, 0.08], [0.50, 0.30, 0.20] sampled action (one hot encoder) : …
Category: Data Science

Intuitive explanation for representing gradient in higher dimensions

I do not understand how complex networks with many parameters/dimensions can be represented in a 3D space, and form a standard cost surface just like a simple network with, say, 2 parameters. For example, a network with 2 parameters that correspond to the X and Y axis, respectively, and cost function that corresponds to the Z axis makes sense...but how can we have a network with 1000 dimensions being represented in a 3D space, on a planar cost surface (not …
Category: Data Science

Analytical gradients from tf.gradients don't match approximate gradients

I have a trained neural network (NN) with independent inputs x1, x2.. xn and a scalar output y. Input x1 is a scalar, and tf.gradients(y, x1) returns a negative value. However, calculating approximate gradients via $\frac{NN(x1 + \Delta) - NN(x1-\Delta)}{2\Delta}$ where $\Delta > 0$ yields a positive value. The following is a visualization of my problem. In blue are y = NN(inputs) for all inputs seen as training data plotted against x1. Judging by these points, it is reasonable to …
Category: Data Science

Gradient passthough in PyTorch

I need to quantize the inputs, but the method (bucketize) I need to do so is indifferentiable. I can of course detach the tensor, but then I lose the flow of gradients to earlier weights. I guess the question is quite simple, how do you continue the flow of gradients when necessary. For example, using the following code ... x = self.linear1(x) min, max = int(x.min()), int(x.max()) bins = torch.linspace(min, max+1, 16) x = torch.bucketize(x.detach(), bins) # forced to detach …
Category: Data Science

Propagating -infs in pytorch and outliers in general

I am using a loss which requires sampling from probability distributions to do monte carlo integration with. Sometimes legitimate training data can throw -inf/NaN. This is intended behaviour since the data point maybe far enough from the model that the probability is too small for float32. Needless to say switching to float64 etc is not a solution. The problem is that -inf turns into nan when calculating the gradient in using logsumexp, sinh, and MultivariateNormal.logpdf which then propagates all the …
Category: Data Science

Gradients are becoming None in PyTorch

#AM is the autograd function for argmax with backpropagation x,preds = model(id, mask) print(preds.retain_grad()) print(AM.apply(preds)) # compute the loss between actual and predicted values loss = torch.mean(1/(sample_score(x,AM.apply(preds))+1)) print(loss) #loss.requires_grad=Truecse # add on to the total loss x.retain_grad() preds.retain_grad() print(x.requires_grad) print(preds.requires_grad) loss.backward() print(model.fc2.weight.grad) total_loss = total_loss + loss.item() Here gradients are becoming None. How can I solve this?
Category: Data Science

Central finite distance gradient simplified

I'm asked to compute central finite difference scheme (f(i+1)-f(i-1)) on an image. My attempt is something like: def gradient_x_diff(img): img = img.astype(float) return np.fabs(np.roll(imgf,1, axis = 0) - imgf(np.roll(imgf,1, axis = 0)) However, it's hinted that the solution is straightforward. It should be something like this: def gradient_x_diff(img): img = img.astype(float) return img[*]-img[**] What should I parse instead of the * and ** ?
Category: Data Science

Which Neural Network or Gradient Boosting framework is the simplest for Custom Loss Functions?

I need to implement a custom loss function. The function is relatively simple: $$-\sum \limits_{i=1}^m [O_{1,i} \cdot y_i-1] \ \cdot \ \operatorname{ReLu}(O_{1,i} \cdot \hat{y_i} - 1)$$ With $O$ being some external attribute specific to each case. I was initially working with LightGBM, but I only found tutorials that included calculating the hessian and the gradient. If there is a way to add the function without this please correct me. Otherwise I am open to using other libraries. PyTorch-Fastai, Tensorflow-keras, catboost, …
Category: Data Science

How batch normalization layer resolve the vanishing gradient problem?

According to this article: https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484 The vanishing gradient problem occurs when using the sigmoid activation function because sigmoid maps large input space into small space, so the gradient of big values will be close to zero. The article suggests using batch normalization layer. I can't understand how it can works? When using normalization, big values still get big values in another scope (instead of [-inf, inf] they will get [0..1] or [-1..1]) , so in the same cases the values …
Category: Data Science

Vanishing Gradient vs Exploding Gradient as Activation function?

ReLU is used as an activation function that serves two purposes: Breaking linearity in DNN. Helping in handling Vanishing Gradient problem. For Exploding Gradient problem, we use Gradient Clipping approach where we set the max threshold limit of Gradient, similarly to ReLU which set the minimum Gradient limit which is 0. So far I read, ReLU is considered as an activation function. In a similar fashion, can we use Gradient Clipping also as an activation function? If yes, any pros …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.