Does GPU decreases training time for on-policy RL?

I was wondering whether using a GPU will be effective if I am using an on-policy (eg PPO) RL as the model?

I.e, how can we use a GPU to decrease training time for an on-policy RL model?

I recently trained a model and GPU utilization was around 2%.

Topic policy-gradients gpu reinforcement-learning

Category Data Science


GPUs/TPUs are used to increase the processing speed when training deep learning models due to its parallel processing capability.

Reinforcement learning on the other hand is predominantly CPU intensive due to the sequential interaction between the agent and environment. Considering you want to utilize on-policy RL algorithms, it gonna be tough.

Completely analyse your use case, it will help you understand if there is a possibility of increasing GPU utilization.

  • If you are only looking to reduce training time you can go for multi-threading parallel reinforcement learning algorithm (MPRL)

  • If you are looking for GPU Utilization you must go for Deep Reinforcement Learning which is a combination of RL + DNN
    Few algorithms Deep Q Networks (DQN) and Advantage Actor Critic (A2C) Asynchronous Advantage Actor Critic (A3C) parallelises the environment based on the number of CPU cores available to decrease overall training time.

    Deep RL models cannot be trained by batch due to the alternating processes of policy evaluation and update. For this, deep RL can be deployed as a CPU only process. However, if the model requires updating millions of parameters, the processing time for policy update increases drastically. At the same time, the time taken to load and retrieve the tensors onto the GPU for training becomes significant if the policy evaluation and update phases are relatively fast. This is usually the case for smaller models with hundreds to thousands of parameters or during DQN whereby a batch of episodic memories are trained individually.

    Current state of the art algorithms include Deep Q Networks (DQN) and Advantage Actor Critic (A2C) whereby the model is usually written in Tensorflow or PyTorch with GPU support while the rest of the environment and training script is written for CPU. Due to this implementation, the policy evaluation phase occurs in the CPU while the policy update phase – where the model weights are updated – occur in the CPU or GPU.

    An experiment was conducted to test both cases where the computation for deep RL occurs only in the CPU or toggles between the CPU and GPU.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.