Difference in performance Sigmoid vs. Softmax

For the same Binary Image Classification task, if in the final layer I use 1 node with Sigmoid activation function and binary_crossentropy loss function, then the training process goes through pretty smoothly (92% accuracy after 3 epochs on validation data).

However, if I change the final layer to 2 nodes and use the Softmax activation function with sparse_categorical_crossentropy loss function, then the model doesn't seem to learn at all and stuck at 55% accuracy (the ratio of the negative class).

Is this difference in performance normal? I thought for a binary classification task, Sigmoid with Binary Crossentropy and Softmax with Sparse Categorical Crossentropy should output similar if not identical results? Or did I do something wrong?

Note: I use Adam optimizer and there is a single label column containing 0s and 1s.

Edit: Code for the 2 cases

Case 1: Sigmoid with binary_crossentropy

def addTopModelMobilNetV1(bottom_model, num_classes):
    top_model = bottom_model.output
    top_model = layers.GlobalAveragePooling2D()(top_model)
    top_model = layers.Dense(1024, activation='relu')(top_model)
    top_model = layers.Dense(1024, activation='relu')(top_model)
    top_model = layers.Dense(512, activation='relu')(top_model)
    top_model = layers.Dense(1, activation='sigmoid')(top_model)
    return top_model

fc_head = addTopModelMobilNetV1(mobilnet_model, num_classes)
model = Model(inputs=mobilnet_model.input, outputs=fc_head)
# print(model.summary())

earlystopping_cb = callbacks.EarlyStopping(patience=3, restore_best_weights=True)
model.compile(loss='binary_crossentropy', optimizer=optimizers.Adam(), metrics=['accuracy'])
history = model.fit_generator(generator=train_generator, 
                              validation_data = val_generator,
                              epochs = 10,
                              callbacks = [earlystopping_cb]

Case 2: Softmax with sparse_categorical_crossentropy

def addTopModelMobilNetV1(bottom_model, num_classes):
    top_model = bottom_model.output
    top_model = layers.GlobalAveragePooling2D()(top_model)
    top_model = layers.Dense(1024, activation='relu')(top_model)
    top_model = layers.Dense(1024, activation='relu')(top_model)
    top_model = layers.Dense(512, activation='relu')(top_model)
    top_model = layers.Dense(2, activation='softmax')(top_model)
    return top_model

fc_head = addTopModelMobilNetV1(mobilnet_model, num_classes)
model = Model(inputs=mobilnet_model.input, outputs=fc_head)

earlystopping_cb = callbacks.EarlyStopping(patience=3, restore_best_weights=True)

model.compile(loss='sparse_categorical_crossentropy', optimizer=optimizers.Adam(), metrics=['accuracy'])

history = model.fit_generator(generator=train_generator, 
                                  validation_data = val_generator,
                                  epochs = 10,
                                  callbacks = [earlystopping_cb]

Topic sigmoid softmax training loss-function image-classification

Category Data Science

It is based on the output classes if they are mutually exclusive or not. For example in a multi-label classification problem, we use multiple sigmoid functions for each output because it is considered as multiple binary classification problems.

But if the output classes are mutually exclusive. In this case, the best choice is to use softmax, because it will give a probability for each class and summation of all probabilities = 1. For instance, if the image is a dog, the output will be 90% a dag and 10% a cat.

In binary classification, the only output is not mutually exclusive, we definitely use the sigmoid function. Because there are no other classes to apply the Mutual exclusivity.

You can find a summary here: https://stackoverflow.com/a/55936594/16310106


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.