In practice, what is the cost function of a neural network?

I want to ask a fairly simple question I think. I have a deep background in pure mathematics, so I don't have too much trouble understanding the mathematics of the cost function, but I would just like to clarify what exactly the cost function is in a neural network in practice (i.e. implementing it on real datasets).

Given a fixed training sample, we can view the cost function as a function of the weights and biases, and thus optimizing this function is just finding the minimum of this function.

In practice, what is the cost function when you have thousands of training samples? Is it the sum of the cost functions over all the training examples?

Topic cost-function deep-learning neural-network

Category Data Science


There are many options, but there are two common ones: crossentropy for classification and mean squared error (MSE) for regression.

$$ \text{Crossentropy}\\ L(y, \hat y) = -\dfrac{1}{N}\sum_{i=1}^N \bigg[y_i\log(\hat y_i) +(1 - y_i)\log(1 - \hat y_i)\bigg] $$

$$ \text{MSE}\\ L(y, \hat y) = \dfrac{1}{N}\sum_{i=1}^N \bigg(y_i - \hat y_i\bigg)^2 $$

In both cases, the predicted $\hat y_i$ is a function of the weights and biases in the model. Also, there is an extension of crossentropy for when there are multiple classes. It follows from maximum likelihood estimation with a multinomial $y_i$ (as opposed to the binomial $y_i$ that yields the equation I gave.

However, you can pick lots of other loss functions that have varying degrees of utility. There are analogues of quantile regression, generalized linear models, etc, much as crossentropy and MSE give neural network analogues of logistic and linear regression, respectively.


Cost function is a guiding light for any ML/DL model. All the weights/Biases are updated in order to minimize the Cost function. To reduce this optimisation algorithms are used like Gradient Descent, ADAM, Mini Batch Gradient Descent etc..

When you have thousand of training data Cost Function is usually sum across all the training data. But we do have algorithms like Mini Batch Gradient Descent which does not do weights updates on all the training examples but on a batch data up to certain number of iteration.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.