Minibatch SGD performs better than Adam for Region proposal network training

I am using both minibatch SGD (with momentum) and Adam for training a region proposal network. The library used is KERAS. The batch size in both cases is 5 and initial learning rate is 0.01. The learning rate decay schedule is also same for both optimizers

The rpn classification loss steadily reduces in case of SGD with momentum but diverges in case of Adam. The performance of SGD with momentum is noticeably better after about 500 epochs

Given that everything is same in case of both optimizers, why does Adam perform worse. Any intuitive explanations would be great.

Topic object-detection mini-batch-gradient-descent keras deep-learning machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.