Q value is estimated under state V value and action A value for DDQN
How Q value is estimated under state V value and action A value. Given the below DDQN algorithm, the deep network is divided into two parts on the end layer, including state value function V(s) which represents the reward value of the state, while the action advantage function A(a) means the extra reward value of choosing an action. DDQN Algorithm Input: observation information obst = [St, At−1], Q-network and its parameters θ, target Qˆ − network and its parameters θ
Output: weights θ∗
Initialize the experience replay repository D to capacity N, and
Initial the historical
observations repository D0 to capacity Tp
Initialize the Q −network with random weights θ, and Initial the
target Qˆ − network with θ− = θ
Initialize the parameters α, β of two streams of fully-connected
layers in dueling deep Q-network, and the Q-value of each action could be obtained by Q = V+A
for episode =1, M do
for t =1,T do
Fetch the observation from D0, and form the input
St = [st, st−1, ..., st−Tp]
according to the [St, At−1]
Select At = argmaxAQ(St, At; θ), otherwise select a
random action At with probability ε
Execute action At, observe reward Rt and calculate St+1
Store the transition (St, At, Rt, St+1) in D
Sample minibatch (St, At, Rt, St+1) randomly from D
Set y(t) = R(t)+γ ∗ argmaxa′ Qt(St+1, At+1; θ−) and train
the network with
loss function L(θ) = E[(y(t) − Qt(St, At, θ))2]
Reset the Q-network Qˆ = Q every C steps
End for
End for
Return weights θ∗for Q-network
Given below picture represent the architecture of the deep network
In the given algorithm author divided the Q value into two parts: the state action value and the action advantage value.
I don't understand how I determine the state-action value and action advantage value. Anyone help me to understand this algorithm.
Topic data-science-model q-learning reinforcement-learning deep-learning
Category Data Science