Should weight distribution change more when fine-tuning transformers-based classifier?

Question

Should weight distribution change more when fine-tuning transformers-based classifier?

Marcin Zablocki

2022年4月10日 10:02

I'm using pre-trained DistilBERT model from Huggingface with custom classification head, which is almost the same as in the reference implementation:

class PretrainedTransformer(nn.Module):
    def __init__(
        self, target_classes):
        super().__init__()
        base_model_output_shape=768
        self.base_model = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.classifier = nn.Sequential(
            nn.Linear(base_model_output_shape, out_features=base_model_output_shape),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(base_model_output_shape, out_features=target_classes),
        )

        for layer in self.classifier:
            if isinstance(layer, nn.Linear):
                layer.weight.data.normal_(mean=0.0, std=0.02)
                if layer.bias is not None:
                    layer.bias.data.zero_()

    def forward(self, input_, y=None):
        X, length, attention_mask = input_
        base_output = self.base_model(X, attention_mask=attention_mask)[0]
        base_model_last_layer = base_output[:, 0]
        cls = self.classifier(base_model_last_layer)
        return cls

During training, I use linear LR warmup schedule with max LR=5-e5 and cross entropy loss. In general, the model is able to learn on my dataset and reach high precision/recall metrics.

My question is:

Should weights distributions and biases in classification layers change more during training? It seems like the weights almost do not change at all, even when I do not initialize them as in the code (to mean=0.0 and std=0.02). Is this an indication that something is wrong with my model or it's just because the layers I've added are redundant and model does not learn nothing new?

Take look at the image of weight from the tensorboard:

Topic huggingface transformer weight-initialization pytorch historgram

Category Data Science

user1825567 · Accepted Answer · 2020年2月28日 16:34

Its difficult to judge by just looking at the weight distribution. You will be in a better position to identify the situation if you check the following things also:

1) Check if the loss is decreasing during training. It is very unlikely that a randomly initialised last layer will not need training to succeed in a downstream task. This is because even if the initialise layers are producing meaningful values, the randomly initialised final layer will output garbage values in the beginning. If the loss is decreasing then the model training is fine. Alternatively, check if the final if the final metric is improving e.g., accuracy.

2) Check the gradient received by your custom layers. One of the reason that the weights distribution is not changing could be due to zeroed or negligible gradient. I don't think gradients are zero because your plots show the weights to be changing, but still if If the gradient is zero then check your implementation. Something must be off like not using the actual output of the model while computing the loss for backpropagation. If the grads are very very small, try increasing the learning rate of the optimzer.

Conducting the above 2 analysis will strengthen your understanding of the current situation for the model training.

Should weight distribution change more when fine-tuning transformers-based classifier?

About