Time horizon T in policy gradients (actor-critic)

Question

Time horizon T in policy gradients (actor-critic)

Dummie Variable

2022年4月21日 09:00

I am currently going through the Berkeley lectures on Reinforcement Learning. Specifically, I am at slide 5 of this lecture.

At the bottom of that slide, the gradient of the expected sum of rewards function is given by $$ \nabla J(\theta) = \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log{\pi_\theta(a_{i,t} \vert s_{i,t}) (Q(s_{i,t},a_{i,t}) - V(s_{i,t}))} $$ The q-value function is defined as $$Q(s_t,a_t) = \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})\vert s_t,a_t]$$ At first glance, this makes sense, because I compare the value of taking the chosen action $a_{i,t}$ to the average value in time step $t$ and can evaluate how good my action was.

My question is: a specific state $s_{spec}$ can occur in any timestep, for example, $s_1 = s_{spec} = s_{10}$. But isn't there a difference in value depending on whether I hit $s_{spec}$ at timestep 1 or 10 when $T$ is fixed? Does this mean that for every state there is a different q value for each possible $t \in \{0,\ldots,T\}$? I somehow doubt that this is the case, but I don't quite understand how the time horizon $T$ fits in.

Or is $T$ not fixed (perhaps it's defined as the time step in which the trajectory ends in a terminal state - but that'd mean that during trajectory sampling, each simulation would take a different number of timesteps)?

Topic policy-gradients actor-critic reinforcement-learning deep-learning machine-learning

Category Data Science

haruishi · Accepted Answer · 2018年9月18日 15:10

In this case, I think it doesn't matter when you reach $s_{spec}$, but how the q-value gets updated because of taking an action at that state. Therefore there shouldn't be different q-values for each possible $t\in \{0, ..., T\}$, only q-values for each possible actions. I'm sure it does make a difference for being at a state at a specific timestep, but it's the agents job to learn this by using the RL algorithms (like policy gradient method in the lecture).

In regards to $T$ being fixed or not, horizon $T$ can be infinite or fixed to a finite number. For example, if $T$ is fixed to $10$, the agent should learn a policy that maximizes the total discounted rewards in the finite amount of time, but it may not be the most optimal policy. When $T$ is infinite, there is more time to explore and figure out the most optimal policy.

The closest method I know that takes notice to when the state-action pair was encountered is Experience Replay that is used in DQN.

I'm also learning Reinforcement Learning right now! I recommend Deep RL Bootcamp since they give you labs in Python which are really intuitive.

Time horizon T in policy gradients (actor-critic)

About