Using softmax for multilabel classification (as per Facebook paper)
I came across this paper by some Facebook researchers where they found that using a softmax and CE loss function during training led to improved results over sigmoid + BCE. They do this by changing the one-hot label vector such that each '1' is divided by the number of labels for the given image (e.g. from [0, 1, 1, 0] to [0, 0.5, 0.5, 0]).
However, they do not mention how this could then be used in the inference stage, because the required threshold for selecting the correct labels is not clear.
Does anyone know how this would work?