Understanding SGD for Binary Cross-Entropy loss
I'm trying to describe mathematically how stochastic gradient descent could be used to minimize the binary cross entropy loss.
The typical description of SGD is that I can find online is: $\theta = \theta - \eta *\nabla_{\theta}J(\theta,x^{(i)},y^{(i)})$ where $\theta$ is the parameter to optimize the objective function $J$ over, and x and y come from the training set. Specifically the $(i)$ indicates that it is the i-th observation from the training set.
For binary cross entropy loss, I am using the following definition (following https://arxiv.org/abs/2009.14119): $$ L_{tot} = \sum_{k=1}^K L(\sigma(z_k),y_k)\\ L = -yL_+ - (1-y)L_- \\ L_+ = log(p)\\ L_- = log(1-p)\\ $$ where $\sigma$ is the sigmoid function, $z_k$ is a prediction (one digit) and $y_k$ is the true value. To better explain this, I am training my model to predict a 0-1 vector like [0, 1, 1, 0, 1, 0], so it might predict something like [0.03, 0.90, 0.98, 0.02, 0.85, 0.1], which then means that e.g. $z_3 = 0.98$.
For combining these definitions, I think that the binary cross entropy loss is minimized by using the parameters $z_k$ (as this is what the model tries to learn), so that in my case $\theta = z$.
Then in order to combine the equations, what I would think makes sense is the following: $z = z - \eta*\nabla_zL_{tot}(z^{(i)},y^{(i)})$
However I am unsure about the following:
- One part of the formula contains $z$, and another part contains $z^{(i)}$, this doesn't make much sense to me. Should I use only $z$ everywhere? But then how would it be clear that we have prediction $z$ for the true $y^{(i)}$?
- In the original SGD formula there is also an $x^{(i)}$. Since this is not part of the binary cross entropy loss function, can I just omit this $x^{(i)}$?
Any help with the above two points and finding the correct equation for SGD for binary cross entropy loss would be greatly appreciated.