Computing variance of an SGD iteration
It is known that SGD iteration has huge variance. Given the iteration update: $$ w^{k+1} := w^k - \underbrace{\alpha \ g_i(w^k)}_{p^k}, $$ where $w$ are model weights and $g_i(w^k)$ is gradient of loss function evaluated for sample $i$. How do I compute variance of each update $p^k$? I would like to plot it for each iteration and study its behavior during minimization process.
Topic mathematics variance deep-learning optimization machine-learning
Category Data Science