Dimensionality of the target for DQN agent training

From what I understand, a DQN agent has as many outputs as there are actions (for each state). If we consider a scalar state with 4 actions, that would mean that the DQN would have a 4 dimensional output. However, when it comes to the target value for training the agent, it is usually described as a scalar value = reward + discount*best_future_Q. How could a scalar value be used to train a Neural Network having a vector output? For …
Category: Data Science

Reinforcement learning: negative reward (punish) illegal actions?

If you train an agent using reinforcement learning (with Q-function in this case), should you give a negative reward (punish) if the agent proposes illegal actions for the presented state? I guess over time if you only select from between the legal actions, the illegal ones would eventually drop out, but would punishing them cause them to drop out sooner and possibly cause the agent to explore more possible legal actions sooner? To expand on this further; say you're training …
Category: Data Science

DQN fails to find optimal policy

Based on DeepMind publication, I've recreated the environment and I am trying to make the DQN find and converge to an optimal policy. The task of an agent is to learn how to sustainably collect apples (objects), with the regrowth of the apples depending on its spatial configuration (the more apples around, the higher the regrowth). So in short: the agent has to find how to collect as many apples as he can (for collecting an apple he gets a …
Category: Data Science

Why DQN but no Deep Sarsa?

Why is DQN frequently used while there is hardly any occurrence of Deep Sarsa? I found this paper https://arxiv.org/pdf/1702.03118.pdf using it, but nothing else which might be relevant. I assume the cause could be the Ape-X architecture which came up the year after the Deep Sarsa paper and allowed to generate an immense amount of experience for off-policy algorithms. Does it make sense or is their any other reason?
Category: Data Science

Deep Q-learning, how to set q-value of non-selected actions?

I am learning Deep Q-learning by applying it to a real world problem. I have been through some tutorials and papers available online but I counldn't figure out the solution for the following problem statement. Let's say we have $N$ possible actions in each state to select from. When in state $s$ we make a move by selecting an action $a_i, i=1\dots N$, as the result we get a reward $r$ and end up in a new state $s^\prime$. In …
Category: Data Science

How to Form the Training Examples for Deep Q Network in Reinforcement Learning?

Trying to pick up basics of reinforcement learning by self-study from some blogs and texts. Forgive me if the question is too basic and different bits that I understand are a bit messy, but even after consulting a few references, I cannot really get how Deep Q learning with a neural network works. I understood the Bellman equation like this $$V^\pi(s)= R(s,\pi(s)) + \gamma \sum_{s'} P(s'|s,\pi(s)) V^\pi(s')$$ and the update rule of Q table. $$Q_{n+1}(s_t, a_t)=Q_n(s_t, a_t)+\alpha(r+\gamma\max_{a\in\mathcal{A}}Q(s_{t+1}, a)-Q_n(s_t, a_t))$$ But …
Category: Data Science

IndexError: index 804 is out of bounds for axis 0 with size 800

i installed a self driving car project from superdatascience site , when i open the map using terminal after a while the map window close up or it closes directly after i maximize the map window and it gives me this error : [INFO ] [Base ] Leaving application in progress... Traceback (most recent call last): File "map.py", line 235, in <module> CarApp().run() File "/usr/lib/python2.7/dist-packages/kivy/app.py", line 826, in run runTouchApp() File "/usr/lib/python2.7/dist-packages/kivy/base.py", line 502, in runTouchApp EventLoop.window.mainloop() File "/usr/lib/python2.7/dist-packages/kivy/core/window/window_sdl2.py", line …
Category: Data Science

Alternative approach for Q-Learning

I have a question related to an alterative Q-Learning approach. I'd like to know if this already exists and I am not aware of it, or it doesn't exist because there are theoretical problems behind it. Traditional Q-Learning In traditional Q-learning, the update of the Q-value happens at every iteration. The agent is in state s, performs action a, reaches state s' and obtains reward r. The Q-value for that pair state-action is updated according to the Bellman equation. As …
Category: Data Science

Why not use max(returns) instead of average(returns) in off-policy Monte Carlo control?

As I understand it, in reinforcement learning, off-policy Monte Carlo control is when the state-action value function $Q(s,a)$ is estimated as a weighted average of the observed returns. However, in Q-learning the value of $Q(s, a)$ is estimated as the maximum expected return. Why is this not used in Monte Carlo control? Suppose I have a simple 2-dimensional bridge game, where the objective is to get from a to b. I can move left, right, up or down. Lets say …
Category: Data Science

Q-learning episode and relation to convergence in MY scenario?

I used Q-learning for routing. I have used the Bellman equation. I have certain other technical aspects in the code that add some novelty. But I have mixed doubts regarding episode and corresponding convergence in my case. I am unable to understand what would be an episode. E.g. a service comes, I assign a route to it and do some other stuff. I want the service acceptance to be more in the 'long' run (as more services come, some depart …
Category: Data Science

How to create transition probability (state) for q-learning algorithm designed to control traffic light system using python?

I am trying to create a q learning algorithm to control traffic light systems. I am representing the state with a matrix : state = [[no. of cars on up, no. of cars down],[no. of cars on left, no. of cars right]] but its stochastic since after allowing cars to move through one road , there are probability that cars would enter as well. I wrote the probability as follow every 4 seconds prob that 0 cars enter on one …
Category: Data Science

Deep Q-learning

I am working on the DDQN algorithm which is given in the following paper. I am facing a problem with the Q value. The author calculate Q value by this Q(s, a; θ , α, β) = V(s; θ , β) + A(s, a; θ , α). Q value is divided into two parts: the state–action value and action-advantage value. The action-advantage value is independent of state and environment noise, which is a relative action–value in each state relative to …
Category: Data Science

Q value is estimated under state V value and action A value for DDQN

How Q value is estimated under state V value and action A value. Given the below DDQN algorithm, the deep network is divided into two parts on the end layer, including state value function V(s) which represents the reward value of the state, while the action advantage function A(a) means the extra reward value of choosing an action. DDQN Algorithm Input: observation information obst = [St, At−1], Q-network and its parameters θ, target Qˆ − network and its parameters θ …
Category: Data Science

Understanding DQN Algorithm

Im studying the deep q learning algorithm. You can see it in the picture here: DQN I have a few questions about the deep q learning algorithm. What do they mean with row 14: If D_i = 0, set Y_i = ... They want me to take an action a' which maximizes the function Q which means i have to insert every action a in that state. If i have a1 and a2 I have to insert a1 and then …
Category: Data Science

Simple Q-learning neural network using numpy

import numpy as np from numpy import exp, array, random, dot R = np.matrix([[-1, -1, -1, -1,1, -1], # for correct action the reward is 1 and for wrong action it's -1 [-1, -1, -1, 1, -1, 1], [-1, -1, -1, 1, -1, -1], [-1, 1, 1, -1, 1, -1], [-1, 1, 1, -1, -1, 1], [-1, 1, -1, -1, 1, 1]]) Q = np.matrix(np.zeros([6, 6])) # Q matrix gamma = 0.99 # Gamma (learning parameter). lr = 0.1 # …
Category: Data Science

Am I using this neural network in a wrong way?

I'm trying to solve a RL problem; the Contextual Bandit problem using Deep Q Learning. My data is all simulated. I have this environment: class Environment(): def __init__(self): self._observation = np.zeros((3,)) def interact(self, action): self._observation = np.zeros((3,)) c1, c2, c3 = np.random.randint(0, 90, 3) self._observation[0]=c1 self._observation[1]=c2 self._observation[2]=c3 reward = -1.0 condition = False if (c1<30) and (c2<30) and (c3<30) and action==0: condition = True elif (30<=c1<60) and (30<=c2<60) and (30<=c3<60) and action==1: condition = True elif (60<=c1<90) and (60<=c2<90) and …
Category: Data Science

How to construct Q-table for complex, large and dynamic spaces in python?

I am trying to construct a Q-table. I have state space and action space. State space consists of large number of complex and dynamic number of elements, but discrete. Theoretically, I understood everything about Q-table. I can also construct Q-table if state and action spaces are integers. But I am unable to implement for state and action spaces if they are complex in nature. Complex here refers to the complexity of representation of state and action information opposed to integer …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.