How to Form the Training Examples for Deep Q Network in Reinforcement Learning?

Question

How to Form the Training Examples for Deep Q Network in Reinforcement Learning?

Della

2022年4月19日 21:39

Trying to pick up basics of reinforcement learning by self-study from some blogs and texts. Forgive me if the question is too basic and different bits that I understand are a bit messy, but even after consulting a few references, I cannot really get how Deep Q learning with a neural network works.

I understood the Bellman equation like this

$$V^\pi(s)= R(s,\pi(s)) + \gamma \sum_{s'} P(s'|s,\pi(s)) V^\pi(s')$$

and the update rule of Q table.

$$Q_{n+1}(s_t, a_t)=Q_n(s_t, a_t)+\alpha(r+\gamma\max_{a\in\mathcal{A}}Q(s_{t+1}, a)-Q_n(s_t, a_t))$$

But when training a neural network to represent the mapping, how exactly do I get the training samples? To make it more concrete, suppose the state $s\in\mathbb{R}^d$ is a $d$-dimensional vector and there are $|\mathcal{A}|$ actions possible in total where $\mathcal{A}$ is the action space.

From some readings, I understood the neural network will have $d$ input neurons, $|\mathcal{A}|$ output neurons and hidden layers in between. After sufficient number of epochs, the forward pass for any $s\in\mathbb{R}^d$ will generate the $Q$ values for different actions at the output layer. For this network then, the training data for supervised learning should have the shape $N\times(d+|\mathcal{A}|)$ where $N$ is the number of training samples.

Now suppose I am using an environment from the gym library. According to their documentations, this is how you take an action on an environment, to get a new state, reward, and other information.

state, reward, done, info = env.step(action)

So how to generate $N\times(d+|\mathcal{A}|)$ training samples from the above line of code? Even if I execute it $N$ times, it will give me the single step rewards of those actions, not the $Q$ values accounting for discounted future rewards.

Topic dqn openai-gym q-learning reinforcement-learning

Category Data Science

Neil Slater · Accepted Answer · 2022年4月19日 21:39

But when training a neural network to represent the mapping, how exactly do I get the training samples? To make it more concrete, suppose the state $s\in\mathbb{R}^d$ is a $d$-dimensional vector and there are $|\mathcal{A}|$ actions possible in total where $\mathcal{A}$ is the action space.

Your training data for the neural network is the (maybe feature engineered) state vector that represents $s$, the action choice $a$, and the action value that is your best estimate of $Q(s,a)$ for the current policy. If you are using some form of TD learning, like Q-learning, then this best estimate is also called the TD target.

Your target policy is constantly changing in Q learning, because it depends on the current Q estimates. So you cannot simply store a set of value estimates along with the start state and chosen action, like a supervised learning training data set. What you have to do instead is calculate the TD target just before you use it. That is where the Bellman equation and update rules come in. The ones you quote have everything built in to do a tabular update. The TD target is this part:

$$r+\gamma\max_{a\in\mathcal{A}}Q(s_{t+1}, a)$$

This TD target is the output value ($y$) for the neural network. The input ($x$) to associate with this target in the simplest form is the vector representations of $(s, a)$ concatenated. In practice though you will often see a function that outputs all action values at once with just the state vector as input.

The typical loop for training on data from experience replay when you have this structure involves calling the neural network multiple times, but does not require you to somehow collect experience from each possible action in each state. I have simplified this to avoid discussing more detail than necessary, so it will not work without addition of more loop control, options for copying learning network to the target network etc. For real examples you will want to unpick a reference DQN implementation.

# DURING EXPERIENCE GATHERING

# Assume we have a current state value
# This is the greedy action. Exploring not shown
action = np.argmax(learning_nn(state))

next_state, reward, done, info = env.step(action)

store_in_experience_replay(state, action, reward, next_state, done)

# DURING UPDATES
# In real projects, this code will be vectorised to process
# a mini-batch with multiple samples at once.

s, a, r, next_s, done = sample_from_experience_replay()

# This gets the current vals, we're going to modify it later to make the target
q_val_targets = learning_nn(s)

if done:
  td_target = r # we're done, so Q(next_s, *) = 0 by definition
else:
  td_target = r + gamma * max(target_nn(next_s))
end

# Here's the "trick" to allow us to train from the single
# sample. We assume all the other actions have same prediction as we just
# generated, so their error will be 0
q_val_targets[a] = td_target

# I just made up this syntax, you will probably have a helper function anyway
learning_nn.train_on_batch([s], [q_val_targets])

This is not the only way to do it. You could instead arbitrarily set error or gradient to zero for all outputs other than the one associated with the TD target. However, something like the code above is relatively common because NN frameworks will mostly support training on input/output pairs more easily than tweaking the internal steps during backpropagation.

How to Form the Training Examples for Deep Q Network in Reinforcement Learning?

About