Derivative of Loss wrt bias term

I read this and have an ambiguity. I try to understand well how to calculate the derivative of Loss w.r.t to bias. In this question, we have this definition: np.sum(dz2,axis=0,keepdims=True) Then in Casper's comment, he said that the The derivative of L (loss) w.r.t. b is the sum of the rows $$ \frac{\partial L}{\partial Z} \times \mathbf{1} = \begin{bmatrix} . &. &. \\ . &. &. \end{bmatrix} \begin{bmatrix} 1\\ 1\\ 1\\ \end{bmatrix} $$ But actually, using axis=0, is it not …
Category: Data Science

How to compute backpropagation gradient according chain rule for using vector/matrix differential?

I have some problems for computing derivative for sum of squares error in backprop neural network. For example, we have a neural network as in picture. For drawing simplicity, i've dropped the sample indexes. Сonventions: x - data_set input. W - is a weigth matrix. v - vector of product: W*x. F - activation function vector. y - vector of activated data D - vector of answers e - error signal lower index is a variable(NxN) - dimenstionality higher [index] …
Category: Data Science

1st order Taylor Series derivative calculation for autoregressive model

I wrote a blog post where I calculated the Taylor Series of an autoregressive function. It is not strictly the Taylor Series, but some variant (I guess). I'm mostly concerned about whether the derivatives look okay. I noticed I made a mistake and fixed the issue. It seemed simple enough,but after finding an error, I started to doubt myself. $$f(t+1) = w_{t+1} \cdot f(t) $$ $$y^{*}_{t+1} = f(t+1)-{\frac {f'(t+1)}{1!}}(-t-1+t)$$ $$y^{*}_{t+1} = w_{t+1} f(t) + \dfrac{d}{df(t)}w_{t+1}f(t) + \dfrac{d}{dw_{t+1}}w_{t+1}f(t)$$ $$y'_{t+1} = w_{t+1} …
Category: Data Science

Maximum Entropy Policy Gradient Derivation

I am reading through the paper on Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review by Sergey Levine. I am having a difficulty in understanding this part of the derivation on Maximum Entropy Policy Gradients (Section 4.1) Note that in the above derivation, the term H(q(thetha(at|st))) should have been log (qthetha(at|st)), and that log refers to log base e (i.e. natural logarithm). In the first line of gradient, it should have been r(st,at) - log(qthetha(at|st)). In particular, I …
Category: Data Science

Adding a group specific penalty to binary cross-entropy

I want to implement a custom Keras loss function that consists of plain binary cross-entropy plus a penalty that increases the loss for false negatives from one class (each observation can belong to one of two classes, privileged and unprivileged) and decreases the loss for true positives from that same class. My implementation so far can be seen below. Unfortunately, it does not work yet, because as you can see, I simply add the penalty to the binary cross-entropy, and …
Category: Data Science

Understanding the step of SGD for binary classification

I cannot understand the step of SGD for binary classification. For example, we have $y$ - true labels $\in \{0,1\}$ and $p=f_\theta(x)$-predicted labels $\in [0,1]$. Then, the update step of SGD is the following $\Theta' \leftarrow \Theta - \nu \frac{\partial L(y,f_\theta(x))}{\partial \Theta}$, where L - loss function. Then follows the replacement that I cannot understand $\Theta' \leftarrow \Theta - \nu \frac{\partial L(y,p)}{\partial p}| {\scriptscriptstyle p=f_\theta(x)} \frac{\partial f_\theta(x)}{\partial \Theta}$ Why do we need to take the derivate of $\partial p$? Why …
Category: Data Science

Loss function for points inside polygon

I am trying to optimize some parameters that used to transform 2d points from a place to another (you may think of that as rotation & translation parameter for simplicity) The parameters are considered optimal if the transformed points lay inside a pre-defined convex polygon. Otherwise, the parameters should be adjusted till all points lay inside that polygon. I do not care how the points are arranged inside that polygon, my only concern that they are inside.. How can I …
Category: Data Science

Batch normalization backpropagation doubts

I have recently studied the batch normalization layer and its backpropagation process, using as my main sources the original paper and this website showing part of the derivation process, but there is a step in the part that isn't covered that I don't really understand, namely, using the notation of the website, this is when computing: $$ \frac{\partial \widehat{x}_i}{\partial x_i} = \frac{\partial}{\partial x_i} \frac{x_i - \mu}{\sqrt{\sigma^2+\epsilon}} = \frac{1}{\sqrt{\sigma^2+\epsilon}} $$ Applying the quotient rule I expected the following (since $\mu$ and …
Category: Data Science

SVM - Making sense of distance derivation

I am studying the math behind SVM. The following question is about a small but important detail during the SVM derivation. The question Why the distance between the hyperplane $w*x+b=0$ and data point (in vector form) $p$, $d = \frac{w * p + b}{||w||}$ can be simplified to $d = \frac{1}{||w||}$? My argument Since data point $p$ is not on the hyperplane, then we have $w*p+b=k, k \ne 0$. Then $d=\frac{k}{||w||}$ but $k$ is not a constant as it depends …
Category: Data Science

Is it valid to use numpy.gradient to find slope of line as well as slope of curve at any point?

what is the difference between slope of the line and slope of the curve? Is it valid to use numpy.gradient to find the slope of the line and slope of the curve at any point? #slope of line at any point tanθ= y2-y1/x2-x1 #slope of curve at any point tanθ =dy/dx is it valid to use numpys np.gradient() to get both slopes of curve and line ? or is it meant only to find the slope of line? Reference slope …
Category: Data Science

Deriving vectorized form of linear regression

We first have the weights of a D dimensional vector $w$ and a D dimensional predictor vector $x$, which are all indexed by $j$. There are $N$ observations, all D dimensional. $t$ is our targets, i.e, ground truth values. We then derive the cost function as follows: We then compute the partial derivate of $\varepsilon$ with respect to $w_j$: I'm confused as to where the $j'$ is coming from, and what it would represent. We then write it as: Then, …
Category: Data Science

Why is it valid to remove a constant factor from the derivative of an error function?

I was reading the book 'Make your own neural network' by Tariq Rashid. In his book, he said: (Note - He's talking about normal feed forward neural networks) The $t_k$ is the target value at node $k$, the $O_k$ is the predicted output at node $k$, $W_{jk}$ is the weight connecting the node $j$ and $k$ and the $E$ is the error at node $k$ Then he says that, we can remove the 2 because we only care about the …
Category: Data Science

A Derivation in Combinatory Categorial Grammer

I am reading about CCG on page 23 of Speech and Language processing. There is a derivation as follows: (VP/PP)/NP , VP\((VP/PP)/NP) => VP? Can anyone example this please? This make sense if VP\((VP/PP)/NP) is equivalent to (VP\(VP/PP))/NP and (VP/PP)/NP is equivalent to VP/(PP/NP). But they seem at least non-trivial from the text! Any help would be greatly appreciated. CS
Topic: derivation nlp
Category: Data Science

back propagation through time derivation issue

I read several posts about BPTT for RNN, but I am actually a bit confused about one step in the derivation. Given $$h_t=f(b+Wh_{t-1}+Ux_t)$$ when we compute $\frac{\partial h_t}{\partial W}$, does anyone know why is it simply $$\frac{\partial h_t}{\partial W}=\frac{\partial h_{t}}{\partial h_{t-1}}\frac{\partial h_{t-1}}{\partial W}$$ not $$\frac{\partial h_t}{\partial W}=\frac{\partial h_{t}}{\partial h_{t-1}}\frac{\partial h_{t-1}}{\partial W}+\frac{\partial h_t}{\partial f}h_{t-1}$$ ? What I mean is, since both $W$ and $h_{t-1}$ depends on $W$, why is the second term in the expression above missing? Thank you!
Category: Data Science

Doubt in Derivation of Backpropagation

I was going through the derivation of backpropagation algorithm provided in this document (adding just for reference). I have doubt at one specific point in this derivation. The derivation goes as follows: Notation: The subscript $k$ denotes the output layer The subscript $j$ denotes the hidden layer The subscript $i$ denotes the input layer $w_{kj}$ denotes a weight from the hidden to the output layer $w_{ji}$ denotes a weight from the input to the hidden layer $a$ denotes an activation …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.