Why is stop-gradient used in Deep Mind's BYOL (Bootstrap Your Own Latent)?
I'm reading Grill's et al. paper regarding their self-supervised approach. I do not understand why the output of the target network is indicated as sg(z'ξ), rather then just (z'ξ), as would seem to be indicated from the loss equations?
Is sg used simply for the sake of signifying that the results of this network do not impact it parameters (ξ)?
Because that would seem redundant to how ξ is defined in the paper (as a weighted moving average of θ).
Am I missing anything?
Topic deepmind unsupervised-learning computer-vision deep-learning
Category Data Science