Time horizon T in policy gradients (actor-critic)
I am currently going through the Berkeley lectures on Reinforcement Learning. Specifically, I am at slide 5 of this lecture.
At the bottom of that slide, the gradient of the expected sum of rewards function is given by $$ \nabla J(\theta) = \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log{\pi_\theta(a_{i,t} \vert s_{i,t}) (Q(s_{i,t},a_{i,t}) - V(s_{i,t}))} $$ The q-value function is defined as $$Q(s_t,a_t) = \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})\vert s_t,a_t]$$ At first glance, this makes sense, because I compare the value of taking the chosen action $a_{i,t}$ to the average value in time step $t$ and can evaluate how good my action was.
My question is: a specific state $s_{spec}$ can occur in any timestep, for example, $s_1 = s_{spec} = s_{10}$. But isn't there a difference in value depending on whether I hit $s_{spec}$ at timestep 1 or 10 when $T$ is fixed? Does this mean that for every state there is a different q value for each possible $t \in \{0,\ldots,T\}$? I somehow doubt that this is the case, but I don't quite understand how the time horizon $T$ fits in.
Or is $T$ not fixed (perhaps it's defined as the time step in which the trajectory ends in a terminal state - but that'd mean that during trajectory sampling, each simulation would take a different number of timesteps)?