Let's explore the use case for binary classification. In binary classification the labels are drawn from Bernoulli distribution. For each example the likelihood of the Bernoulli distribution is

$p^y*(1-p)^{(1-y)}$.

We want to maximize the likelihood of the entire dataset, which means we want to maximize the product of all the examples.

Because we want it to be convenient for the optimizer we do two things:

  1. We want a minimization problem, so we minimize the negative likelihood instead of maximizing likelihood. 
  2. We convert from a product to a sum (which is much easier to get derivatives) In order to do it we apply log transformation which is monotonic transformation which promises us that the optimal parameters will remain the same i.e. minimizing the log likelihood is equivalent for minimizing the likelihood. 

Log loss is the negative log likelihood. Taking the log of a single example we get:

$y_i \log(p_i) + (1-y_i) \log (1-p_i)$.

The log loss is just the negative sum of all the examples. 

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.