Improving the Actor Critic algorithm proposed by Keras

Question

Improving the Actor Critic algorithm proposed by Keras

Siderius

2021年11月10日 10:31

In this page of keras's website, a reinforcement learning algorithm based in an actor critic scheme has been described. It is a deep policy gradient algorithm (hence DPG). Of course keras functions are central in this code, for this reason tensorflow tries to have an access to a NVIDIA gpu for the acceleration, otherwise it does use the accessible cores. I believe that this code is not optimized because it uses only one core, the main part of the code is the tf.GradientTape() followed by the loop over the steps per episode:

  (...)
  with tf.GradientTape() as tape:
        for timestep in range(1, max_steps_per_episode):

            state = tf.convert_to_tensor(state)
            state = tf.expand_dims(state, 0)

            action_probs, critic_value = model(state)
            (...)

When this code is running I observe a 100% cpu usage just for one core.

On the other hand the improved version i.e. the deep deterministic policy gradient (hence DDPG), which is definitely more expensive in computational terms than DPG, is better optimized since it parallelizes all the operations. JUst to give an idea: with 3 DDPG's running I observe a 66% usage of all the cores. I don't want to compare these two algorithms in terms of their learning capabilities and results but I want to do it with respect to the computational implementation. In the DDPG code, just before the optimization function, there is the following statement :

# Eager execution is turned on by default in TensorFlow 2. Decorating with tf.function allows
# TensorFlow to build a static graph out of the logic and computations in our function.
# This provides a large speed up for blocks of code that contain many small TensorFlow operations such as this one (Keras)
@tf.function
def update( self, state_batch, action_batch, reward_batch, next_state_batch ):
   (...)

So the decoration @tf.function provides the speed up to the code.

My question is: I want to give a speedup to the DPG code. Imho in order to do that I should rewrite the code expressing the body of the episode loop as a function with the decoration @tf.function just before the function. It would be a nightmare since I have to pass everything to that function, in addition there is the problem of global variables. Of course, if I naively try do add the tensorflow decoration before the tf.GradientTape function I receive an error.

Is there a more direct way to provide the speed-up without reorganizing the main part of the code into a function?

Topic policy-gradients keras-rl gpu tensorflow reinforcement-learning

Category Data Science

Improving the Actor Critic algorithm proposed by Keras

About