I'm trying to understand how Q-learning deals with games where the optimal policy is a mixed strategy. The Bellman equation says that you should choose $max_a(Q(s,a))$ but this implies a single unique action for each $s$. Is Q-learning just not appropriate if you believe that the problem has a mixed strategy?
From what I understand, a DQN agent has as many outputs as there are actions (for each state). If we consider a scalar state with 4 actions, that would mean that the DQN would have a 4 dimensional output. However, when it comes to the target value for training the agent, it is usually described as a scalar value = reward + discount*best_future_Q. How could a scalar value be used to train a Neural Network having a vector output? For …
If you train an agent using reinforcement learning (with Q-function in this case), should you give a negative reward (punish) if the agent proposes illegal actions for the presented state? I guess over time if you only select from between the legal actions, the illegal ones would eventually drop out, but would punishing them cause them to drop out sooner and possibly cause the agent to explore more possible legal actions sooner? To expand on this further; say you're training …
Based on DeepMind publication, I've recreated the environment and I am trying to make the DQN find and converge to an optimal policy. The task of an agent is to learn how to sustainably collect apples (objects), with the regrowth of the apples depending on its spatial configuration (the more apples around, the higher the regrowth). So in short: the agent has to find how to collect as many apples as he can (for collecting an apple he gets a …
Why is DQN frequently used while there is hardly any occurrence of Deep Sarsa? I found this paper https://arxiv.org/pdf/1702.03118.pdf using it, but nothing else which might be relevant. I assume the cause could be the Ape-X architecture which came up the year after the Deep Sarsa paper and allowed to generate an immense amount of experience for off-policy algorithms. Does it make sense or is their any other reason?
How we can have RF-QLearning or SVR-QLearning (Combine these algorithm with a Q-Learning )? I want to replace the DNN section of Qlearning with a RF or SVR but the problem is that there is no clear training data that I can put in my code at tensorflow or keras! How we can do this?
I am learning Deep Q-learning by applying it to a real world problem. I have been through some tutorials and papers available online but I counldn't figure out the solution for the following problem statement. Let's say we have $N$ possible actions in each state to select from. When in state $s$ we make a move by selecting an action $a_i, i=1\dots N$, as the result we get a reward $r$ and end up in a new state $s^\prime$. In …
Trying to pick up basics of reinforcement learning by self-study from some blogs and texts. Forgive me if the question is too basic and different bits that I understand are a bit messy, but even after consulting a few references, I cannot really get how Deep Q learning with a neural network works. I understood the Bellman equation like this $$V^\pi(s)= R(s,\pi(s)) + \gamma \sum_{s'} P(s'|s,\pi(s)) V^\pi(s')$$ and the update rule of Q table. $$Q_{n+1}(s_t, a_t)=Q_n(s_t, a_t)+\alpha(r+\gamma\max_{a\in\mathcal{A}}Q(s_{t+1}, a)-Q_n(s_t, a_t))$$ But …
i installed a self driving car project from superdatascience site , when i open the map using terminal after a while the map window close up or it closes directly after i maximize the map window and it gives me this error : [INFO ] [Base ] Leaving application in progress... Traceback (most recent call last): File "map.py", line 235, in <module> CarApp().run() File "/usr/lib/python2.7/dist-packages/kivy/app.py", line 826, in run runTouchApp() File "/usr/lib/python2.7/dist-packages/kivy/base.py", line 502, in runTouchApp EventLoop.window.mainloop() File "/usr/lib/python2.7/dist-packages/kivy/core/window/window_sdl2.py", line …
I have a question related to an alterative Q-Learning approach. I'd like to know if this already exists and I am not aware of it, or it doesn't exist because there are theoretical problems behind it. Traditional Q-Learning In traditional Q-learning, the update of the Q-value happens at every iteration. The agent is in state s, performs action a, reaches state s' and obtains reward r. The Q-value for that pair state-action is updated according to the Bellman equation. As …
As I understand it, in reinforcement learning, off-policy Monte Carlo control is when the state-action value function $Q(s,a)$ is estimated as a weighted average of the observed returns. However, in Q-learning the value of $Q(s, a)$ is estimated as the maximum expected return. Why is this not used in Monte Carlo control? Suppose I have a simple 2-dimensional bridge game, where the objective is to get from a to b. I can move left, right, up or down. Lets say …
I used Q-learning for routing. I have used the Bellman equation. I have certain other technical aspects in the code that add some novelty. But I have mixed doubts regarding episode and corresponding convergence in my case. I am unable to understand what would be an episode. E.g. a service comes, I assign a route to it and do some other stuff. I want the service acceptance to be more in the 'long' run (as more services come, some depart …
I am trying to create a q learning algorithm to control traffic light systems. I am representing the state with a matrix : state = [[no. of cars on up, no. of cars down],[no. of cars on left, no. of cars right]] but its stochastic since after allowing cars to move through one road , there are probability that cars would enter as well. I wrote the probability as follow every 4 seconds prob that 0 cars enter on one …
I am working on the DDQN algorithm which is given in the following paper. I am facing a problem with the Q value. The author calculate Q value by this Q(s, a; θ , α, β) = V(s; θ , β) + A(s, a; θ , α). Q value is divided into two parts: the state–action value and action-advantage value. The action-advantage value is independent of state and environment noise, which is a relative action–value in each state relative to …
How Q value is estimated under state V value and action A value. Given the below DDQN algorithm, the deep network is divided into two parts on the end layer, including state value function V(s) which represents the reward value of the state, while the action advantage function A(a) means the extra reward value of choosing an action. DDQN Algorithm Input: observation information obst = [St, At−1], Q-network and its parameters θ, target Qˆ − network and its parameters θ …
Im studying the deep q learning algorithm. You can see it in the picture here: DQN I have a few questions about the deep q learning algorithm. What do they mean with row 14: If D_i = 0, set Y_i = ... They want me to take an action a' which maximizes the function Q which means i have to insert every action a in that state. If i have a1 and a2 I have to insert a1 and then …
I'm trying to solve a RL problem; the Contextual Bandit problem using Deep Q Learning. My data is all simulated. I have this environment: class Environment(): def __init__(self): self._observation = np.zeros((3,)) def interact(self, action): self._observation = np.zeros((3,)) c1, c2, c3 = np.random.randint(0, 90, 3) self._observation[0]=c1 self._observation[1]=c2 self._observation[2]=c3 reward = -1.0 condition = False if (c1<30) and (c2<30) and (c3<30) and action==0: condition = True elif (30<=c1<60) and (30<=c2<60) and (30<=c3<60) and action==1: condition = True elif (60<=c1<90) and (60<=c2<90) and …
I am trying to construct a Q-table. I have state space and action space. State space consists of large number of complex and dynamic number of elements, but discrete. Theoretically, I understood everything about Q-table. I can also construct Q-table if state and action spaces are integers. But I am unable to implement for state and action spaces if they are complex in nature. Complex here refers to the complexity of representation of state and action information opposed to integer …