Monitor Model Training Progress over HPC Clusters

Question

Monitor Model Training Progress over HPC Clusters

Adhish Thite

2021年10月20日 22:42

As a part of my research in Deep Learning, I have to frequently train models which require a lot of computing power. As such, I use my university's HPC environment to submit my jobs and to train my models.

However, I run into one major issue - MONITORING THE TRAINING PERFORMANCE METRICS!

I generally build my models with Keras, and it is convenient to check the console from time to time to get to know about the model training/performance.

There's a tool - CometML, which I use when I train models on my own system. However, as the HPC does not allow socket connections, it's not possible to monitor.

Is there a way/tool which can be used to monitor the metrics? For now, I take a dump of the logs from time-to-time and download them into my system and then check. But it's extremely time-consuming and inefficient.

If there is an efficient way/tool, please let me know.

Thanks.

Topic machine-learning-model hpc deep-learning machine-learning

Category Data Science

Monitor Model Training Progress over HPC Clusters

About