Actor Network Target Value in A2C Reinforcement Learning
In DQN, we use;
$Target = r+\gamma v(s')$ equation to train (fit) our network. It is easy to understand since we use the $Target$ value as the dependent variable like we do in supervised learning. I.e. we can use codes in python to train the model like,
model.fit(state,target, verbose = 0)
where $r$ and $v(s')$ can be found by model prediction.
When it comes to A2C network, things becomes more complicated. Now we have got two networks. Actor and Ctitic. It is said, the Critic network is not different from the what it is done in DQN. Only difference is now we have got only one output neuron in the network. So, similarly, we calculate the $Target = r+\gamma v(s')$ after acting via sampled action from $\pi(a|s)$ distribution. And we train the model with model.fit(state,target, verbose = 0)
in python as well.
However, in Actor case it is so confusing. Now, we have another neural network which takes states as input and gives probabilities as the output by using softmax activation function. It is cool. But the point I stuck is, the dependent variable to adjust the Actor network. So in python,
model2.fit(state,?,verbose=0)
What is the ?
value to supervise the network to adjust the weights? and WHY?
In several resources I found the Advantage value which is nothing but the $Target - V(s)$. And also there is something called as the actor loss which is calculated by,
Why?
Thanks in advance!
Topic actor-critic reinforcement-learning machine-learning
Category Data Science