Understanding SGD for Binary Cross-Entropy loss

I'm trying to describe mathematically how stochastic gradient descent could be used to minimize the binary cross entropy loss.

The typical description of SGD is that I can find online is: $\theta = \theta - \eta *\nabla_{\theta}J(\theta,x^{(i)},y^{(i)})$ where $\theta$ is the parameter to optimize the objective function $J$ over, and x and y come from the training set. Specifically the $(i)$ indicates that it is the i-th observation from the training set.

For binary cross entropy loss, I am using the following definition (following https://arxiv.org/abs/2009.14119): $$ L_{tot} = \sum_{k=1}^K L(\sigma(z_k),y_k)\\ L = -yL_+ - (1-y)L_- \\ L_+ = log(p)\\ L_- = log(1-p)\\ $$ where $\sigma$ is the sigmoid function, $z_k$ is a prediction (one digit) and $y_k$ is the true value. To better explain this, I am training my model to predict a 0-1 vector like [0, 1, 1, 0, 1, 0], so it might predict something like [0.03, 0.90, 0.98, 0.02, 0.85, 0.1], which then means that e.g. $z_3 = 0.98$.

For combining these definitions, I think that the binary cross entropy loss is minimized by using the parameters $z_k$ (as this is what the model tries to learn), so that in my case $\theta = z$.

Then in order to combine the equations, what I would think makes sense is the following: $z = z - \eta*\nabla_zL_{tot}(z^{(i)},y^{(i)})$

However I am unsure about the following:

  1. One part of the formula contains $z$, and another part contains $z^{(i)}$, this doesn't make much sense to me. Should I use only $z$ everywhere? But then how would it be clear that we have prediction $z$ for the true $y^{(i)}$?
  2. In the original SGD formula there is also an $x^{(i)}$. Since this is not part of the binary cross entropy loss function, can I just omit this $x^{(i)}$?

Any help with the above two points and finding the correct equation for SGD for binary cross entropy loss would be greatly appreciated.

Topic sgd mathematics multilabel-classification gradient-descent machine-learning

Category Data Science


You are confusing a number of definitions. The loss definition you provided is correct, yet the terms you used are not precise. I'll try to make the following concepts clearer for you: parameters, predictions and logits. I want you to focus on the logit concept, which is I believe the issue here.

First, binary classification is a learning task where we want to predict which of two classes 0 (negative class) and 1 (positive class) an example $x$ comes from.

Binary cross entropy is a loss function that is frequently used for such tasks. And, to use this loss function, the model is expected to output one real number $\hat{y} \in [0,1]$ for each example $x$. $\hat{y}$ represents the probability that the example is from the positive class 1. I'd rather write the loss as follows: $$\begin{align} L &= \sum_{i=1}^n l(\hat{y_i}, y_i)\\ l(\hat{y_i}, y_i) &= -y_i log(\hat{y_i}) -(1-y_i) log(1-\hat{y_i}) \end{align}$$

Now, the way our predictions $\hat{y}$ are computed depends on the family of models we choose to use.

For example, if you use a logistic regression model, the model computes predictions as follows $\hat{y} = \sigma(z)$, where $z \in \mathbb{R}$ is called the logit (not the prediction) and $\sigma$ is the sigmoid function. In logistic regression, the logit is a linear function of your features $z = \theta x$, where $\theta$ is the parameter vector (which is independent from your set of examples) and $x$ is the example vector. So, $$\hat{y_i} = \sigma(z_i) = \sigma(\theta x_i) $$

In this case, the loss becomes: $$\begin{align} L &= \sum_{i=1}^n -y_i log(\hat{y_i}) -(1-y_i) log(1-\hat{y_i}) \\ &= \sum_{i=1}^n -y_i log(\sigma(\theta x_i) ) -(1-y_i) log(1-\sigma(\theta x_i) ) \end{align}$$

Now, compute the gradient of $L$ with respect to $\theta$ and plug it in your SGD update rule.

To summarize, predictions are related to logits by the sigmoid function, and logits are related to example features by model parameters.

I used logistic regression to simplify the discussion. Using a neural network, the relationship between logits and model parameters becomes more complicated.

Last, I want to clarify that SGD can be used with a variety of models, so when you say it contains $x_i$ in its formula, you need to specify which family of models you are talking about.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.