Batch normalization backpropagation doubts
I have recently studied the batch normalization layer and its backpropagation process, using as my main sources the original paper and this website showing part of the derivation process, but there is a step in the part that isn't covered that I don't really understand, namely, using the notation of the website, this is when computing: $$ \frac{\partial \widehat{x}_i}{\partial x_i} = \frac{\partial}{\partial x_i} \frac{x_i - \mu}{\sqrt{\sigma^2+\epsilon}} = \frac{1}{\sqrt{\sigma^2+\epsilon}} $$ Applying the quotient rule I expected the following (since $\mu$ and $\sigma^2$ are functions of $x_i$). $$ \frac{\partial \widehat{x}_i}{\partial x_i} = \frac{1}{\sigma^2+\epsilon} \left[\frac{\partial}{\partial x_i} (x_i-\mu)\sqrt{\sigma^2+\epsilon} - (x_i - \mu)\frac{\partial}{\partial x_i} \sqrt{\sigma^2+\epsilon} \right]=\\ \frac{1}{\sigma^2+\epsilon}\left[\left(1-\frac{1}{N}\right)\sqrt{\sigma^2+\epsilon} - \frac{x_i - \mu}{2\sqrt{\sigma^2+\epsilon}}\frac{\partial \sigma^2}{\partial x_i} \right] $$ Then by substituting $$ \frac{\partial \sigma^2}{\partial x_i} = \frac{2(x_i - \mu)}{N} $$ My result was: $$ \frac{\partial \widehat{x}_i}{\partial x_i} = \frac{1}{\sigma^2+\epsilon} \left[\frac{N-1}{N}\sqrt{\sigma^2+\epsilon} - \frac{(x_i-\mu)^2}{N\sqrt{\sigma^2+\epsilon}} \right] $$ I have been stuck on this problem for a few days already and I can't seem to understand what's wrong with my derivation, I have also tried to manipulate the expression above algebraically hoping maybe it would just simplify but it didn't work. Thanks in advance for the help.
Topic derivation batch-normalization backpropagation
Category Data Science