To freeze or not, batch normalisation in ResNet when transfer learning

I'm using a ResNet50 model pretrained on ImageNet, to do transfer learning, fitting an image classification task. The easy way of doing this is simply freezing the conv layers (or really all layers except the final fully connected layer), however I came across a paper where the authors mention that batch normalisation layers should be fine tuned when fitting the new model: Few layers such as Batch Normalization (BN) layers shouldn’t be froze because, the mean and variance of the …
Category: Data Science

Batch normalization

Part 1 Im going through this article and wanted to try and calculate a forward and backward pass with batch normalization. When doing the steps after the first layer I get a batch norm output that are equal for all features. Here is the code (I have on purpose done it in very small steps): w = np.array([[0.3, 0.4],[0.5,0.1],[0.2,0.3]]) X = np.array([[0.7,0.1],[0.3,0.8],[0.4,0.6]]) def mu(x,axis=0): return np.mean(x,axis=axis) def sigma(z, mu): Ai = np.sum(z,axis=0) return np.sqrt((1/len(Ai)) * (Ai-mu)**2) def Ai(z): return np.sum(z,axis=0) …
Category: Data Science

Should batch normalization make my eval inference so dependent on the batch size?

I am using pytorch, and the relevant piece of code is below, from my .forward call: class ModelDense(nn.Module): def __init__(self, raw_features, n, features): super(ModelDense, self).__init__() self.linear_pre = nn.Linear(raw_features, features) self.batchnorm_pre = nn.BatchNorm1d(features) self.tower = ResTowerDense(n, features) self.value_linear1 = nn.Linear(features, features) self.value_batchnorm = nn.BatchNorm1d(features) self.value_linear2 = nn.Linear(features, 1) def forward(self, x, mask0, mask1): y = self.tower(self.batchnorm_pre(self.linear_pre(x))) v = torch.sigmoid(self.value_linear2(self.value_batchnorm(F.relu(self.value_linear1(y))))) Here 'self.tower' is a tower of residual blocks. The output in question is 'v', which is just a sigmoid activation. After training …
Category: Data Science

Why doesn't batch normalization 'zero out' a batch of size one?

I'm using Tensorflow. Consider the example below: >>> x <tf.Tensor: shape=(1,), dtype=float32, numpy=array([-0.22630838], dtype=float32)> >>> tf.keras.layers.BatchNormalization()(x) <tf.Tensor: shape=(1,), dtype=float32, numpy=array([-0.22619529], dtype=float32)> There doesn't seem to be any change at all, besides maybe some perturbation due to epsilon. Shouldn't a normalized sample of size one just be the zero tensor? I figured maybe there was some problem with the fact that the batch size = 1 (variance is zero in this case, so how do you make the variance =1) But …
Category: Data Science

Compute gradients in parallel

Here is part of my code: class SimpleNet(nn.Module): def __init__(self): super().__init__() self.linear1 = nn.Linear(2, 1, bias=False) self.linear2 = nn.Linear(1, 2, bias=False) def forward(self, x): z = self.linear1(x) y_pred = self.linear2(z) return y_pred, z model = SimpleNet().cuda() for epoch in range(1): model.train() for i, dt in enumerate(data.trn_dl): optimizer.zero_grad() output = model(dt[0]) loss2 = 0 for j in range(0,len(output[0])): l1 = torch.autograd.grad(output[0][j][0], output[1], create_graph=True)[0][j] l2 = torch.autograd.grad(output[0][j][1], output[1], create_graph=True)[0][j] loss2 = loss2 + abs(torch.sqrt(l1**2+l2**2)-1) loss1 = F.mse_loss(output[0], dt[1]) loss = loss1+loss2 loss.backward() …
Category: Data Science

Normalization in production

I am currently writing a machine learning pipeline for my time series application. At the end of each month, I get the data gathered, normalize it ([0, 1]), retrain the ML model with the new observation only and predict future values. Question Should I be reading the entire dataset each time I get a new Observation, normalize the entire dataset, create the ML model, then predict? How I got stuck: Let's say I have 1 feature and at t-1 all …
Category: Data Science

Equations in "Batch normalization: theory and how to use it with Tensorflow"

I read the article Batch normalization: theory and how to use it with Tensorflow by Federico Peccia. The batch normalized activation is $$ \bar x_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} $$ where $\mu_B = \frac{1}{m} \sum_{i=1}^m x_i$ is the batch mean and $\sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2$ is the batch variance. The scaled and shifted activation is $y_i = \gamma \bar x_i + \beta$ where $\gamma$ and $\beta$ are parameters that the neural network learns. After these …
Category: Data Science

Batch normalization backpropagation doubts

I have recently studied the batch normalization layer and its backpropagation process, using as my main sources the original paper and this website showing part of the derivation process, but there is a step in the part that isn't covered that I don't really understand, namely, using the notation of the website, this is when computing: $$ \frac{\partial \widehat{x}_i}{\partial x_i} = \frac{\partial}{\partial x_i} \frac{x_i - \mu}{\sqrt{\sigma^2+\epsilon}} = \frac{1}{\sqrt{\sigma^2+\epsilon}} $$ Applying the quotient rule I expected the following (since $\mu$ and …
Category: Data Science

Poor CNN performance after implementing BatchNormalization

I am training a CNN to classify malware images from a dataset named Malimg. Before implementing the BatchNormalization layer, I was getting an accuracy of 95.57% (see below for the graph of loss/accuracy and validation loss/accuracy): Epoch 1/10 6537/6537 [==============================] - 53s 8ms/step - loss: 1.7711 - accuracy: 0.4605 - val_loss: 1.0062 - val_accuracy: 0.6510 Epoch 2/10 6537/6537 [==============================] - 52s 8ms/step - loss: 0.8739 - accuracy: 0.7150 - val_loss: 0.4965 - val_accuracy: 0.8426 Epoch 3/10 6537/6537 [==============================] - 52s …
Category: Data Science

Explanation of Karpathy tweet about common mistakes. #5: "you didn't use bias=False for your Linear/Conv2d layer when using BatchNorm"

I recently found this twitter thread from Andrej Karpathy. In it he states a few common mistakes during the development of a neural network. you didn't try to overfit a single batch first. you forgot to toggle train/eval mode for the net. you forgot to .zero_grad() (inpytorch) before .backward(). you passed softmaxed outputs to a loss that expects raw logits. you didn't use bias=False for your Linear/Conv2d layer when using BatchNorm, or conversely forget to include it for the output …
Category: Data Science

Batch normalization for image CNN - Why not use the mean of the entire batch?

Question For CNN to recognize images, why not use the entire batch data, instead of per feature, to calculate the mean in the Batch Normalization? When each feature is independent, need to use per feature. However the features (pixels) of images having RGB channels with 8 bit color for CNN are related. If there are 256 pixels in R channel in an image, 255 for pixel i and 255 for pixel j are both white meaning the same intensity(?) in …
Category: Data Science

Sequential batch processing vs parallel batch processing?

In deep learning based model training, in general batch of inputs are passed. For example for training a deep learning model with [512] dimensional input feature vector, say for batch size= 4, we mainly pass [4,512] dimenional input. I am curious what are the logical significance of passing the same input after flattening the input across the batch and channel dimenions [2048]. Logically the locality structure will be destroyed but will it significanlty speed up my implementation? And can it …
Category: Data Science

Batch normalization for multiple datasets?

I am working on a task of generating synthetic data to help the training of my model. This means that the training is performed on synthetic + real data, and tested on real data. I was told that batch normalization layers might be trying to find weights that are good for all while training, which is a problem since the distribution of my synthetic data is not exactly equal to the distribution of the real data. So, the idea would …
Category: Data Science

Does Batch Normalization make sense for a ReLU activation function?

Batch Normalization is described in this paper as a normalization of the input to an activation function with scale and shift variables $\gamma$ and $\beta$. This paper mainly describes using the sigmoid activation function, which makes sense. However, it seems to me that feeding an input from the normalized distribution produced by the batch normalization into a ReLU activation function of $max(0,x)$ is risky if $\beta$ does not learn to shift most of the inputs past 0 such that the …
Category: Data Science

How batch normalization layer resolve the vanishing gradient problem?

According to this article: https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484 The vanishing gradient problem occurs when using the sigmoid activation function because sigmoid maps large input space into small space, so the gradient of big values will be close to zero. The article suggests using batch normalization layer. I can't understand how it can works? When using normalization, big values still get big values in another scope (instead of [-inf, inf] they will get [0..1] or [-1..1]) , so in the same cases the values …
Category: Data Science

Using batchnorm and dropout simultaneously?

I am a bit confused about the relation between terms "Dropout" and "BatchNorm". As I understand, Dropout is regularization technique, which is using only during training. BatchNorm is technique, which is using for accelerating training speed, improving accuracy and e.t.c. But I also saw some conflicting opinions about question: is BatchNorm regularization technique? So, can somebody,please, answer some questions: Is BatchNorm regularization technique? Why? Should we use BatchNorm only during training process? Why? Can we use Dropout and BatchNorm simultaneously? …
Category: Data Science

Can Batch Normalization replace tanh in RNN?

Question Can Batch Normalization (BN) be inserted in RNN after $x_t@W_{xh}$, and after $h_{t-1}@W_{hh}$ to remove $f=tanh$ and bias $b_h$? If possible, will this eliminate both exploding and vanishing gradient problems? I believe the effect of tanh to adjust the values from [-inf, +inf] into (-1, 1) can be replaced with the standardization in BN and it makes the bias unnecessary at $x_t@W_{xh}$ and $h_{t-1}@W_{hh}$. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift The auto differentiation of …
Category: Data Science

Why does batchnorm1d in Pytorch compute 0 with the following example (2 lines of code)?

Here is the code import torch import torch.nn as nn x = torch.Tensor([[1, 2, 3], [1, 2, 3]]) print(x) batchnorm = nn.BatchNorm1d(3, eps=0, momentum=0) print(batchnorm(x)) Here is what is printed tensor([[1., 2., 3.], [1., 2., 3.]]) tensor([[0., 0., 0.], [0., 0., 0.]], grad_fn=<NativeBatchNormBackward>) What I am expecting is the following: Using hand calculation, let $x = (1,2,3)$, then $E(x) = (1+2+3)/3 = 2$ and $Var(x) = (1^2 + 2^2 + 3^2) /3 - (2)^2 = 0.9999...$, so that the final …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.