Convolutional neural network overfitting. Dropout not helping
I am playing a little with convnets. Specifically, I am using the kaggle cats-vs-dogs dataset which consists on 25000 images labeled as either cat or dog (12500 each).
I've managed to achieve around 85% classification accuracy on my test set, however I set a goal of achieving 90% accuracy.
My main problem is overfitting. Somehow it always ends up happening (normally after epoch 8-10). The architecture of my network is loosely inspired by VGG-16, more specifically my images are resized to $128x128x3$, and then I run:
Convolution 1 128x128x32 (kernel size is 3, strides is 1)
Convolution 2 128x128x32 (kernel size is 3, strides is 1)
Max pool 1 64x64x32 (kernel size is 2, strides is 2)
Convolution 3 64x64x64 (kernel size is 3, strides is 1)
Convolution 4 64x64x64 (kernel size is 3, strides is 1)
Max pool 2 32x32x64 (kernel size is 2, strides is 2)
Convolution 5 16x16x128 (kernel size is 3, strides is 1)
Convolution 6 16x16x128 (kernel size is 3, strides is 1)
Max pool 3 8x8x128 (kernel size is 2, strides is 2)
Convolution 7 8x8x256 (kernel size is 3, strides is 1)
Max pool 4 4x4x256 (kernel size is 2, strides is 2)
Convolution 8 4x4x512 (kernel size is 3, strides is 1)
Fully connected layer 1024 (dropout 0.5)
Fully connected layer 1024 (dropout 0.5)
All the layers except the last one have relus as activation functions.
Note that I have tried different combinations of convolutions (I started with simpler convolutions).
Also, I have augmented the dataset by mirroring the images, so that in total I have 50000 images.
Also, I am normalizing the images using min max normalization, where X is the image
$X = X - 0 / 255 - 0$
The code is written in tensorflow and the batch sizes are 128.
The mini-batches of training data end up overfitting and having an accuracy of 100% while the validation data seems to stop learning at around 84-85%.
I have also tried to increase/decrease the dropout rate.
The optimizer being used is AdamOptimizer with a learning rate of 0.0001
At the moment I have been playing with this problem for the last 3 weeks and 85% seems to have set a barrier in front of me.
For the record, I know I could use transfer learning to achieve much higher results, but I am interesting on building this network as a self-learning experience.
Update:
I am running the SAME network with a different batch size, in this case I am using a much smaller batch size (16 instead of 128) so far I am achieving 87.5% accuracy (instead of 85%). That said, the network ends up overfitting anyway. Still I do not understand how a dropout of 50% of the units is not helping... obviously I am doing something wrong here. Any ideas?
Update 2:
Seems like the problem had to do with the batch size, as with a smaller size (16 instead of 128) I am achieving now 92.8% accuracy on my test set, with the smaller batch size the network still overfits (the mini batches end up with an accuracy of 100%) however, the loss (error) keeps decreasing and it is in general more stable. The cons are a MUCH slower running time, but it is totally worth the wait.