experience replay memory: saving the next state required when state does not depend on action?

so, I am using an agent with a state-action-policy and I am trying to understand the concept of experience replay memory (ERM). As far as I learned until now, the ERM is basically a buffer that stores sets experiences:

e_t = {s_t, a_t, r_t+1, s_t+1}

Where s is the state, a the action and r the reward, as usual. Basically, in order to use a network that learns to predict the correct action from such experiences, the network's input should be exactly of the form of the experiences, i.e. state of current and next step, as well as predicted action and received reward. The network thus has four inputs.

First question, is it correct that the next state s_t+1 is fed into the network as input? Or is it a label?

Second question, how is this initialized? The network needs to be trained on experiences, right away, so I assume, the first time we generate action predictions for various state examples, given the initial parameters of our model, until the ERM is completely filled for the first time, and then start optimizing this with SGD (or similar).

Third question, what if the action that we take just influences the reward we get, but does not have any actual influence on the next state? For instance, think of an agent that decides whether to buy/sell/idle (actions) a stock based on its past price (state), receiving a reward that depends on the taken action as well as the evolution of the price in the next timestep. If we assume that our action does not have any significant influence on where the stock price will go, the next state (price at next step) is independent of our action, yet, we receive a reward that is dependent on our action as well as the next price. How would an experience replay look in that case? In that case, would we also need to save the next state? Or rather just

e_t = {s_t,a_t,r_t+1}

because firstly s_t+1 is already encoded in the reward r_t+1 and secondly our action does not change s_t+1?

Thanks! Best, JZ

Topic policy-gradients reinforcement-learning machine-learning

Category Data Science

For question one; if you look at Q-learning for example the next state is retrieved from the replay buffer and used in value estimation / loss calculation during critic training:

next_action = actor_target(next_state) + noise
target_Q = self.critic_target(next_state, next_action)

current_Q = self.critic(state, action)
critic_loss = F.mse_loss(current_Q, target_Q)

This may vary from algorithm to algorithm, so it would depend on your specific use-case.

For question two you are correct. You need something in the experience replay buffer before you can use it for training, so you must at least partially fill it with the state-action transitions of some initial agent or agents. Whether these agents are randomly initialized, or partially pre-optimized (with random search, for example) is up to you.

For question three; similar to question one, the necessary content of the replay buffer depends on the specific algorithm that you are using. Personally I have not come across any instances where a replay buffer was implemented without the next_state, but someone with more experience might know more.

Edit: also, this may help: state-action-reward-new state: confusion of terms


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.