Q value is estimated under state V value and action A value for DDQN

Question

Q value is estimated under state V value and action A value for DDQN

zoraiz ali

2021年9月2日 17:03

How Q value is estimated under state V value and action A value. Given the below DDQN algorithm, the deep network is divided into two parts on the end layer, including state value function V(s) which represents the reward value of the state, while the action advantage function A(a) means the extra reward value of choosing an action. DDQN Algorithm Input: observation information obst = [St, At−1], Q-network and its parameters θ, target Qˆ − network and its parameters θ

 Output: weights θ∗     
 Initialize the experience replay repository D to capacity N, and
 Initial the historical
 observations repository D0 to capacity Tp
 Initialize the Q −network with random weights θ, and Initial the
 target Qˆ − network with θ− = θ
 Initialize the parameters α, β of two streams of fully-connected
 layers in dueling deep Q-network, and the Q-value of each action could be obtained by Q = V+A
 for episode =1, M do
     for t =1,T do
          Fetch the observation from D0, and form the input
             St = [st, st−1, ..., st−Tp]
             according to the [St, At−1]
          Select At = argmaxAQ(St, At; θ), otherwise select a
              random action At with probability ε
          Execute action At, observe reward Rt and calculate St+1
          Store the transition (St, At, Rt, St+1) in D
          Sample minibatch (St, At, Rt, St+1) randomly from D
          Set y(t) = R(t)+γ ∗ argmaxa′ Qt(St+1, At+1; θ−) and train
              the network with
              loss function L(θ) = E[(y(t) − Qt(St, At, θ))2]
          Reset the Q-network Qˆ = Q every C steps
     End for
 End for
 Return weights θ∗for Q-network

Given below picture represent the architecture of the deep network

In the given algorithm author divided the Q value into two parts: the state action value and the action advantage value.

I don't understand how I determine the state-action value and action advantage value. Anyone help me to understand this algorithm.

Topic data-science-model q-learning reinforcement-learning deep-learning

Category Data Science

Q value is estimated under state V value and action A value for DDQN

About