Using softmax for multilabel classification (as per Facebook paper)

I came across this paper by some Facebook researchers where they found that using a softmax and CE loss function during training led to improved results over sigmoid + BCE. They do this by changing the one-hot label vector such that each '1' is divided by the number of labels for the given image (e.g. from [0, 1, 1, 0] to [0, 0.5, 0.5, 0]).

However, they do not mention how this could then be used in the inference stage, because the required threshold for selecting the correct labels is not clear.

Does anyone know how this would work?

Topic convolutional-neural-network probability multilabel-classification deep-learning machine-learning

Category Data Science


In regards to your question:

However, they do not mention how this could then be used in the inference stage, because the required threshold for selecting the correct labels is not clear.

Does anyone know how this would work?

While this answer may be unsatisfying, I believe the answer is: you don't use it for inference.

The paper describes how the multi-label classification using the softmax is done during pre-training only, where they just had to compute the loss of the multi-label softmax relative to the ground truths they already knew about. The Facebook paper discusses how they used either the features found during the pre-training on hashtag data or used the hashtag trained neural network as merely a point of weight initialization – not for actual inference on "live data."

The softmax function only gives a relative level of confidence in the labels and gives probability values that are more of an "ordinal" than "cardinal" use, so in order to use the softmax values during inference, one would need a separate way to determine how many labels to extract, whether that be a pre-determined constant number n (the paper notes that each image had about 2 canonical hashtags per image on average so that is a possible choice), a separate algorithm/model that decides how many labels an image should have, etc.

Sources:

D. Mahajan et al., “Exploring the Limits of Weakly Supervised Pretraining,” Sep. 2018.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.