Monitor Model Training Progress over HPC Clusters
As a part of my research in Deep Learning, I have to frequently train models which require a lot of computing power. As such, I use my university's HPC environment to submit my jobs and to train my models.
However, I run into one major issue - MONITORING THE TRAINING PERFORMANCE METRICS!
I generally build my models with Keras, and it is convenient to check the console from time to time to get to know about the model training/performance.
There's a tool - CometML, which I use when I train models on my own system. However, as the HPC does not allow socket connections, it's not possible to monitor.
Is there a way/tool which can be used to monitor the metrics? For now, I take a dump of the logs from time-to-time and download them into my system and then check. But it's extremely time-consuming and inefficient.
If there is an efficient way/tool, please let me know.
Thanks.
Topic machine-learning-model hpc deep-learning machine-learning
Category Data Science