Using BERT for co-reference resolving, what's the loss function?

I'm working my way around using BERT for co-reference resolving. I'm following this highly-cited paper BERT for Coreference Resolution: Baselines and Analysis (https://arxiv.org/pdf/1908.09091.pdf). I have following questions, the details can't be found easily from the paper, hope you guys help me out.

What’s the input? is it antecedents + parapraph? What’s the output? clusters mention, antecedent ? More importantly What’s the loss function?

For comparison, in another highly-cited paper by [Clark .et al] using Reinforcement Learning, it's very clear about what reward function is. https://cs.stanford.edu/people/kevclark/resources/clark-manning-emnlp2016-deep.pdf

Topic bert nlp

Category Data Science


NER is approached as a sequence-labeling problem and the end of the network, there is a categorical distribution estimated by softmax that is trained using cross-entropy loss.

The paper you are specifically asking about claims to be based on the End-to-end Neural Coreference Resolution paper that does something more tricky. They explicitly consider all spans to the entities spans and compute the probability for that. Nevertheless, when they get the probabilities they still do in principle the cross-entropy loss (cf. their code).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.