Dimensionality of the target for DQN agent training

From what I understand, a DQN agent has as many outputs as there are actions (for each state). If we consider a scalar state with 4 actions, that would mean that the DQN would have a 4 dimensional output.

However, when it comes to the target value for training the agent, it is usually described as a scalar value = reward + discount*best_future_Q.

How could a scalar value be used to train a Neural Network having a vector output?

For example see image in https://towardsdatascience.com/deep-q-learning-tutorial-mindqn-2a4c855abffc

Topic dqn q-learning deep-learning machine-learning

Category Data Science


I am of the opinion that this architecture is only one among others that can solve the same problem (for example one may have only 2 outputs one for the chosen action and one for the $Q$ value of that action, but I will not elaborate further on this).

What this architecture does is to output the whole function $Q(a)$, that is the $Q$ value as a function of action $a$. So each output node represents the $Q$ value for a certain action $a$ corresponding to that node (so node 1 corresponds to the $Q$ value for action $a_1$, node 2 to $Q$ value for action $a_2$ and so on..)

BUT this is NOT a vector output, in the usual sense of the term. This is a functional output that represents the whole function $Q(a)$ for each action $a$.

As far as the learning/decision rule is concerned things are as usual. Hope this is clear.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.