Deep Q-learning, how to set q-value of non-selected actions?
I am learning Deep Q-learning by applying it to a real world problem. I have been through some tutorials and papers available online but I counldn't figure out the solution for the following problem statement.
Let's say we have $N$ possible actions in each state to select from. When in state $s$ we make a move by selecting an action $a_i, i=1\dots N$, as the result we get a reward $r$ and end up in a new state $s^\prime$. In order to update the Neural Network with this experience $(s, a, r, s^\prime)$, the only ground truth q-value that we have is for action $a_i$. In other words we do not have any ground truth q-values for all other possible actions ($a_j, j=1\dots N, j\neq i$). Then how should we feed this training data sample to the Neural Network?
Following are the possibilities I was thinking about,
Set other q-values to don't cares. In this scenario we do not update the weights of the last hidden layer that connects to the output values of all $a_j$'s. However, due to interconnections between the earlier layers, any update to the weights will affect the q-values of $a_j$'s.
Set other q-values to the current predicted values by the Neural Network. This makes the error for those q-values zero, but like above solution the change in the weights will affect the $a_j$'s q-values eventually.
Use one Neural Network for each possible actions. This perfectly makes sense as the solution to me, but Deep Q-learning uses one network to predict the q-values of all possible actions (regardless of having two networks, one for policy and the second as target).
If someone who has experience and knowledge to help me with this to understand, I would highly appreciate it.
Topic q-learning deep-learning neural-network
Category Data Science