Why model trains slower on GCP than on my local machine?

Question

Why model trains slower on GCP than on my local machine?

Stiefel

2022年3月19日 10:52

I'm using tensorflow-cloud and train a 3D voxel CNN. My local machine: NVIDIA GeForce RTX 2080 Ti 11GB, Intel Core i7 3GhZ, 32 GB RAM This is my machine config on tfc:

tfc.MachineConfig(cpu_cores=8, memory=30, accelerator_type=tfc.AcceleratorType.NVIDIA_TESLA_T4, accelerator_count=1),

To me this looks comparable. However, the training job takes 2-3 times as long as on my local machine. Do I share the cloud machine with other training jobs? Also the the job might be IO limited, on my local machine my training set (12GB) is stored on a SSD. Any idea or suggestion?

Topic google-cloud-platform tensorflow

Category Data Science

Stiefel · Accepted Answer · 2022年3月19日 10:52

The solution is easy: our training was CPU limited due to online augmentation code implemented in python. Turns out, the gcp machines have strong GPUs but compared to our local machines weak CPUs.

Increasig the number of cpu_cores to a higher number (32 or 64) helps, but also makes things very expensive, since also the number of GPUs has to be increased (2 or 4).

The solution is probably to port the python code to tensorflow / CUDA.

Why model trains slower on GCP than on my local machine?

About