How to Form the Training Examples for Deep Q Network in Reinforcement Learning?
Trying to pick up basics of reinforcement learning by self-study from some blogs and texts. Forgive me if the question is too basic and different bits that I understand are a bit messy, but even after consulting a few references, I cannot really get how Deep Q learning with a neural network works.
I understood the Bellman equation like this
$$V^\pi(s)= R(s,\pi(s)) + \gamma \sum_{s'} P(s'|s,\pi(s)) V^\pi(s')$$
and the update rule of Q table.
$$Q_{n+1}(s_t, a_t)=Q_n(s_t, a_t)+\alpha(r+\gamma\max_{a\in\mathcal{A}}Q(s_{t+1}, a)-Q_n(s_t, a_t))$$
But when training a neural network to represent the mapping, how exactly do I get the training samples? To make it more concrete, suppose the state $s\in\mathbb{R}^d$ is a $d$-dimensional vector and there are $|\mathcal{A}|$ actions possible in total where $\mathcal{A}$ is the action space.
From some readings, I understood the neural network will have $d$ input neurons, $|\mathcal{A}|$ output neurons and hidden layers in between. After sufficient number of epochs, the forward pass for any $s\in\mathbb{R}^d$ will generate the $Q$ values for different actions at the output layer. For this network then, the training data for supervised learning should have the shape $N\times(d+|\mathcal{A}|)$ where $N$ is the number of training samples.
Now suppose I am using an environment from the gym
library. According to their documentations, this is how you take an action on an environment, to get a new state, reward, and other information.
state, reward, done, info = env.step(action)
So how to generate $N\times(d+|\mathcal{A}|)$ training samples from the above line of code? Even if I execute it $N$ times, it will give me the single step rewards of those actions, not the $Q$ values accounting for discounted future rewards.
Topic dqn openai-gym q-learning reinforcement-learning
Category Data Science