How to train a Task Specific Knowledge Distillation model using Hugging face model

I was referring to this code:

https://github.com/philschmid/knowledge-distillation-transformers-pytorch-sagemaker/blob/master/knowledge-distillation.ipynb

From @philschmid

I could follow most of the code, but had few doubts. Please help me to clarify these doubts.

In this code below:

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        
        self.teacher = teacher_model
        
        # place teacher on same device as student
        self._move_model_to_device(self.teacher,self.model.device)
        
        self.teacher.eval()

When I take fine-tuned teacher model it is never fine-tuned in the process of Task Specific Distillation training, as in line self.teacher.eval() mentioned in the code.? Only the output of teacher model is considered for loss calculations.

I couldn't follow this line self._move_model_to_device(self.teacher,self.model.device). What it is actually doing?

In Task Specific Distillation training, I am fine tuning my student model, but in the DistillationTrainer I pass both models. Where it's making sure that only student model weights are learned and not the teacher?

trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    
)
```

Topic huggingface pytorch python-3.x deep-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.