Why can distributed deep learning provide higher accuracy (lower error) than non-distributed one with the following cases?

Based on some papers which I read, distributed deep learning can provide faster training time. In addition, it also provides better accuracy or lower prediction error. What are the reasons?

Question edited:

I am using Tensorflow to run distributed deep learning (DL) and compare the performance with non-distributed DL. I use the number of dataset 1000 samples and step size 10000. The distributed DL uses 2 workers and 1 parameter server. Then, the following cases are considered when running the code:

  1. Each worker and non-distributed DL use 1000 samples for training sets, same mini-batch size 200

  2. Each worker uses 500 samples for training sets (first 500 samples for worker 1 and the rest 500 samples for worker 2), non-distributed DL use 1000 samples for training sets, same mini-batch size 200

  3. Each worker uses 500 samples for training sets (first 500 samples for worker 1 and the rest 500 samples for worker 2) with mini-batch size 100, non-distributed DL use 1000 samples for training sets with mini-batch size 200

Based on the simulation, for all cases, distributed DL has lower RMSE than non-distributed DL. In this case, the RMSEs of distributed DL are as follows: Distributed DL in Case 2 Distributed DL in Case 1 Distributed DL in Case 3 Non-distributed.

In addition, I also add the training time (i.e., the number of steps is 2 x 10000) for non-distributed DL, the results are still not as good as distributed DL.

One reason can be the mini-batch size, however, I wonder the other reasons why the distributed DL has better performance using the aforementioned cases?

Topic tensorflow deep-learning distributed

Category Data Science


About the accuracy: Going with the strongest reason; memory problems will diminish due to the distribution of the computation. That will allow you to increase your training batch size which will reduce the gradient noise due to small mini-batch sizes. The steeper gradient moves will be towards the minima, with less noise.

You can refer to this video for deeper understanding: https://www.youtube.com/watch?v=-_4Zi8fCZO4&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=16

About the speed: It is more obvious I think. You distribute your gradient descent computations to multiple machines or CPUs/GPUs/TPUs, so a faster training speed you acquire as a result.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.