Distilling the knowledge of a binary cross entropy with sigmoid function model to a softmax model
I have a complex CNN architecture that uses a binary cross-entropy and sigmoid function for classification. However, due to hardware restraints I would like to compress my model using knowledge distillation and unfortunately most papers deals with knowledge distillation using two models with softmax and sparse categorical entropy for the distilling the knowledge of the larger network. I'd like to know if it is possible to use a complex model that uses binary cross entropy and sigmoid function for activation to a smaller model that uses softmax for classification using knowledge distillation? And if so, what are the changes that I must do to train both models?
For the sake of a concrete example the following dogs and cat classification model, how can I take the following model with binary cross entropy and sigmoid classification layer and convert it to a softmax with the proper cross entropy as loss function?
img_input = layers.Input(shape=(150, 150, 3))
x = layers.Conv2D(16, 3, activation='relu')(img_input)
x = layers.MaxPooling2D(2)(x)
x = layers.Conv2D(32, 3, activation='relu')(x)
x = layers.MaxPooling2D(2)(x)
x = layers.Conv2D(64, 3, activation='relu')(x)
x = layers.MaxPooling2D(2)(x)
x = layers.Flatten()(x)
x = layers.Dense(512, activation='relu')(x)
output = layers.Dense(1, activation='sigmoid')(x)
model = Model(img_input, output)
model.compile(loss='binary_crossentropy',
optimizer=Adam(lr=0.001),
metrics=['acc'])