Why model trains slower on GCP than on my local machine?

I'm using tensorflow-cloud and train a 3D voxel CNN. My local machine: NVIDIA GeForce RTX 2080 Ti 11GB, Intel Core i7 3GhZ, 32 GB RAM This is my machine config on tfc:

tfc.MachineConfig(cpu_cores=8, memory=30, accelerator_type=tfc.AcceleratorType.NVIDIA_TESLA_T4, accelerator_count=1),

To me this looks comparable. However, the training job takes 2-3 times as long as on my local machine. Do I share the cloud machine with other training jobs? Also the the job might be IO limited, on my local machine my training set (12GB) is stored on a SSD. Any idea or suggestion?

Topic google-cloud-platform tensorflow

Category Data Science


The solution is easy: our training was CPU limited due to online augmentation code implemented in python. Turns out, the gcp machines have strong GPUs but compared to our local machines weak CPUs.

Increasig the number of cpu_cores to a higher number (32 or 64) helps, but also makes things very expensive, since also the number of GPUs has to be increased (2 or 4).

The solution is probably to port the python code to tensorflow / CUDA.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.