What is the difference between Pytorch's DataParallel and DistributedDataParallel?

I am going through this imagenet example.

And, in line 88, the module DistributedDataParallel is used. When I searched for the same in the docs, I haven’t found anything. However, I found the documentation for DataParallel.

So, would like to know what is the difference between the DataParallel and DistributedDataParallel modules.

Topic pytorch gpu distributed

Category Data Science


DataParallel is easier to debug, because your training script is contained in one process. DataParallel may also cause poor GPU-utilization, because one master GPU must hold the model, combined loss, and combined gradients of all GPUs.

For a more detailed explanation, see here.


As the Distributed GPUs functionality is only a couple of days old [in the v2.0 release version of Pytorch], there is still no documentation regarding that. So, I had to go through the source code's docstrings for figuring out the difference. So, the docstring of the DistributedDataParallel module is as follows:

Implements distributed data parallelism at the module level. This container parallelizes the application of the given module by splitting the input across the specified devices by chunking in the batch dimension. The module is replicated on each machine and each device, and each such replica handles a portion of the input. During the backwards pass, gradients from each node are averaged. The batch size should be larger than the number of GPUs used locally. It should also be an integer multiple of the number of GPUs so that each chunk is the same size (so that each GPU processes the same number of samples).

And the docstring for the dataparallel is as follows:

Implements data parallelism at the module level. This container parallelizes the application of the given module by splitting the input across the specified devices by chunking in the batch dimension. In the forward pass, the module is replicated on each device, and each replica handles a portion of the input. During the backwards pass, gradients from each replica are summed into the original module. The batch size should be larger than the number of GPUs used. It should also be an integer multiple of the number of GPUs so that each chunk is the same size (so that each GPU processes the same number of samples).

This reply in the Pytorch forums was also helpful in understanding the difference between the both,

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.