Explanation of Karpathy tweet about common mistakes. #5: "you didn't use bias=False for your Linear/Conv2d layer when using BatchNorm"

I recently found this twitter thread from Andrej Karpathy. In it he states a few common mistakes during the development of a neural network.

  1. you didn't try to overfit a single batch first.
  2. you forgot to toggle train/eval mode for the net.
  3. you forgot to .zero_grad() (inpytorch) before .backward().
  4. you passed softmaxed outputs to a loss that expects raw logits.
  5. you didn't use bias=False for your Linear/Conv2d layer when using BatchNorm, or conversely forget to include it for the output layer. This one won't make you silently fail, but they are spurious parameters
  6. thinking view() and permute() are the same thing ( incorrectly using view)

I am specifically interested in an explanation or motivation for the fifth comment. Even more so given I have a network built akin to

self.conv0_e1 = nn.Conv2d(f_in, f_ot, kernel_size=3, stride=1, padding=1)
self.conv1_e1 = nn.Conv2d(f_ot, f_ot, kernel_size=3, stride=1, padding=1)
self.norm_e1 = nn.BatchNorm2d(num_features=f_ot, eps=0.001, momentum=0.01)
self.actv_e1 = nn.ReLU()
self.pool_e1 = nn.MaxPool2d(kernel_size=2, stride=2)

Where the torch.Conv2d has an implicit bias=True in the constructor.

How would I go about implementing the fifth point in the code sample above? Though, based on the second sentence of the point, it doesn't seem like this matters?..

Topic bias batch-normalization neural-network

Category Data Science


Love this question. First, note the last thing said in the tweet you quote: "This one won't make you silently fail, but they are spurious parameters." Basically, this is a sort of mathematical quibble, but

To see what is happening here, consider what's happening in a BatchNorm2d layer vs Conv2d (quoting the PyTorch Docs):

BatchNorm2d: $y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta$

Conv2d: $\text{out}(N_i, C_{\text{out}_j}) = \text{bias}(C_{\text{out}_j}) + \sum_{k = 0}^{C_{\text{in}} - 1} \text{weight}(C_{\text{out}_j}, k) \star \text{input}(N_i, k)$

Both operations have an additive bias along the channel dimension. The two biases are either redundant or in conflict. If they're redundant, your model is performing extra, unnecessary computation. When they're in conflict, some parameters become useless, e.g. in the case where we set momentum=0 for BatchNorm2d, the preceding Conv2d layer will have a set of trainable parameters with no useful gradient.

You can fix this with Conv2d(..., bias=False, ...). Again, this is unlikely to have a significant impact on most networks, but it can be helpful and is good to know.

Follow-up edit:

The reason that this is relevant is that the BatchNorm2d layer is a linear operation. Multiple linear operations can always combined into a single one. So with Conv2d -> BatchNorm2d, the bias in Conv2d is redundant.

However, as long as you have activation functions, Conv2d layers are not linear, so with two conv layers, Conv2d -> Conv2d, you do not run into this problem.

So it's only the last layer before batchnorm that matters. (That said I've definitely seen examples where practitioners just turn off all biases for conv layers, and it doesn't seem to harm things.)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.