Understanding CNN by visualizing class activations using GRAD_CAM

I followed the blog Where CNN is looking? to understand and visualize the class activations in order to predict something. The given example works very well.

I have developed a custom model using autoencoders for image similarity. The model accepts 2 images and predicts the score for similarity. The model has the following layers:


Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 256, 256, 3)  0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 256, 256, 3)  0                                            
__________________________________________________________________________________________________
encoder (Sequential)            (None, 7, 7, 256)    3752704     input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
Merged_feature_map (Concatenate (None, 7, 7, 512)    0           encoder[1][0]                    
                                                                 encoder[2][0]                    
__________________________________________________________________________________________________
mnet_conv1 (Conv2D)             (None, 7, 7, 1024)   2098176     Merged_feature_map[0][0]         
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 7, 7, 1024)   4096        mnet_conv1[0][0]                 
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 7, 7, 1024)   0           batch_normalization_1[0][0]      
__________________________________________________________________________________________________
mnet_pool1 (MaxPooling2D)       (None, 3, 3, 1024)   0           activation_1[0][0]               
__________________________________________________________________________________________________
mnet_conv2 (Conv2D)             (None, 3, 3, 2048)   8390656     mnet_pool1[0][0]                 
__________________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, 3, 3, 2048)   8192        mnet_conv2[0][0]                 
__________________________________________________________________________________________________
activation_2 (Activation)       (None, 3, 3, 2048)   0           batch_normalization_2[0][0]      
__________________________________________________________________________________________________
mnet_pool2 (MaxPooling2D)       (None, 1, 1, 2048)   0           activation_2[0][0]               
__________________________________________________________________________________________________
reshape_1 (Reshape)             (None, 1, 2048)      0           mnet_pool2[0][0]                 
__________________________________________________________________________________________________
fc1 (Dense)                     (None, 1, 256)       524544      reshape_1[0][0]                  
__________________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, 1, 256)       1024        fc1[0][0]                        
__________________________________________________________________________________________________
activation_3 (Activation)       (None, 1, 256)       0           batch_normalization_3[0][0]      
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 1, 256)       0           activation_3[0][0]               
__________________________________________________________________________________________________
fc2 (Dense)                     (None, 1, 128)       32896       dropout_1[0][0]                  
__________________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, 1, 128)       512         fc2[0][0]                        
__________________________________________________________________________________________________
activation_4 (Activation)       (None, 1, 128)       0           batch_normalization_4[0][0]      
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, 1, 128)       0           activation_4[0][0]               
__________________________________________________________________________________________________
fc3 (Dense)                     (None, 1, 64)        8256        dropout_2[0][0]                  
__________________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, 1, 64)        256         fc3[0][0]                        
__________________________________________________________________________________________________
activation_5 (Activation)       (None, 1, 64)        0           batch_normalization_5[0][0]      
__________________________________________________________________________________________________
dropout_3 (Dropout)             (None, 1, 64)        0           activation_5[0][0]               
__________________________________________________________________________________________________
fc4 (Dense)                     (None, 1, 1)         65          dropout_3[0][0]                  
__________________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, 1, 1)         4           fc4[0][0]                        
__________________________________________________________________________________________________
activation_6 (Activation)       (None, 1, 1)         0           batch_normalization_6[0][0]      
__________________________________________________________________________________________________
dropout_4 (Dropout)             (None, 1, 1)         0           activation_6[0][0]               
__________________________________________________________________________________________________
reshape_2 (Reshape)             (None, 1)            0           dropout_4[0][0]                  
==================================================================================================

The encoder layer consists of the following layers:

conv2d_1
batch_normalization_1
activation_1
max_pooling2d_1
conv2d_2
batch_normalization_2
activation_2
max_pooling2d_2
conv2d_3
batch_normalization_3
activation_3
conv2d_4
batch_normalization_4
activation_4
conv2d_5
batch_normalization_5
activation_5
max_pooling2d_3

I want to change my custom network to accept one input instead of two using the encoder part only and generate the heatmaps to understand what does the encoder part has learned.

Therefore, the idea is, in case the network predicts 'not similar' then I can generate the heatmaps of images one by one and compare them.

What I have done is the following:

I have passed the two images to the network and got the prediction as described in the blog:

preds = model.predict([x, y])
class_idx = np.argmax(preds[0])
class_output = model.output[:, class_idx]

Set the last convolutional layer and compute the gradient of the class output value with respect to the feature map.

last_conv_layer = model.get_layer('encoder')
grads = K.gradients(class_output, last_conv_layer.get_output_at(-1))[0]

The output of grads:

Tensor("gradients/Merged_feature_map/concat_grad/Slice_1:0", shape=(?, 7, 7, 256), dtype=float32)

Then I done pool the gradients as described in the blog:

pooled_grads = K.mean(grads, axis=(0, 1, 2))
iterate = K.function([input_img], [pooled_grads, last_conv_layer.get_output_at(-1)[0]])

At this moment when I checked the inputs and outputs it shows the following:

iterate.inputs
[tf.Tensor 'input_1:0' shape=(?, 256, 256, 3) dtype=float32]

iterate.outputs
[tf.Tensor 'Mean:0' shape=(256,) dtype=float32, tf.Tensor 'strided_slice_1:0' shape=(7, 7, 256) dtype=float32]

But I am now getting the error on the following code line:

pooled_grads_value, conv_layer_output_value = iterate([x])

The error is:

You must feed a value for placeholder tensor 'input_2' with dtype float and shape [?,256,256,3]
     [[{{node input_2}}]]

It seems that it is asking for second image input but as seen above 'iterate.inputs' is only one image.

Where have I done a mistake? How can I limit it to accept only one image? Or, any other way to achieve the task in a more batter way?

Topic interpretation heatmap cnn keras visualization

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.