Speed of training decrease by adding more GPUs

Question

Speed of training decrease by adding more GPUs

skh251

2022年2月16日 08:03

I am using the distributed Tensorflow with Mirror Strategy. I am training the VGG16 based on custom Estimator. However, by increasing the number of GPUs time of training is increased. As I check, the GPUs Utilization is about 100% and it seems the input function can feed data to GPUs. As all GPUs are in a single machine, Is there any clue to found out the problem. This is the computation graph and I am wondreing the Groups_Deps cause the problem.

Topic gpu tensorflow distributed

Category Data Science

RonsenbergVI · Accepted Answer · 2019年9月3日 17:25

Using GPUs can accelerate training but as you increase the number of GPUs your training as to be distributed which means your data has to be moved to several GPUs which can cost in term of bandwidth. I would profile the training and see the time spent to move the data to GPUs and et results back. Plus it's harder to syncronize training like this.

If you are using tensorflow 1.14+ try to change the distribution method to "MirroredStrategy" . This tend to work better with multiple GPUs I found.

Speed of training decrease by adding more GPUs

About