Tensorflow Training Crashing

I have created a GCP VM with Tesla K80 GPU attached to it. I have installed Nvidia 465 drivers for Ubuntu 20.04 along with Cuda 11.

I am trying to use tensorflow on the GCP machine and each time when the training starts the machine crashes after few epochs. Here is the log

216/216 [==============================] - ETA: 0s - loss: 2.5774 - accuracy: 0.2203   
216/216 [==============================] - 173s 800ms/step - loss: 2.5774 - accuracy: 0.2203 - val_loss: 47.4114 - val_accuracy: 0.1372 - lr: 0.0100
Epoch 2/50
216/216 [==============================] - ETA: 0s - loss: 1.9055 - accuracy: 0.3265  
216/216 [==============================] - 137s 633ms/step - loss: 1.9055 - accuracy: 0.3265 - val_loss: 46.8945 - val_accuracy: 0.2023 - lr: 0.0100
Epoch 3/50
216/216 [==============================] - ETA: 0s - loss: 1.7601 - accuracy: 0.3899  
216/216 [==============================] - 137s 633ms/step - loss: 1.7601 - accuracy: 0.3899 - val_loss: 1.9010 - val_accuracy: 0.3895 - lr: 0.0100
Epoch 4/50
216/216 [==============================] - ETA: 0s - loss: 1.5993 - accuracy: 0.4417  
216/216 [==============================] - 137s 632ms/step - loss: 1.5993 - accuracy: 0.4417 - val_loss: 1.7880 - val_accuracy: 0.3919 - lr: 0.0100
Epoch 5/50
216/216 [==============================] - ETA: 0s - loss: 1.2965 - accuracy: 0.5580  
216/216 [==============================] - 134s 618ms/step - loss: 1.2965 - accuracy: 0.5580 - val_loss: 1.9468 - val_accuracy: 0.3919 - lr: 0.0100
Epoch 6/50
 60/216 [=======......................] - ETA: 1:20 - loss: 1.0874 - accuracy: 0.63542021-06-10 19:12:36.997237: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-10 19:12:36.997296: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
Aborted (core dumped)

Please advises if you have run into a similar sort of error before.

Topic cuda gpu tensorflow deep-learning python

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.