This is the loss function that you aim to minimize by tuning the parameters theta given the data x,y. the loss is actually the negative conditional log-likelihood of the output sequence y given the input sequence x. what you want to find is a distribution P(y|x) parametrized by theta that gives you the most probable output sequence y given an input sequence x. minimizing the loss function means that you shape the distribution based on the examples in your training data such that for every sequence x in the training data the most probable output sequence y_predict agrees best with the actually observed output sequence y. You do this in the hope that the model will generalize well on unseen data, i.e. when you feed in a new sequence x that the model hasn't seen before, it will give you an accurate estimate of the corresponding sequence y that most likely will be associated with x.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.