I am trying to solve a dynamic programming toy example. Here is the prompt: imagine you arrive in a new city for $N$ days and every night need to pick a restaurant to get dinner at. The qualities of the restaurants are iid according to distribution $F$ (assume [0,1]). The goal is to maximize the sum of the qualities of the restaurants that you get dinner at over the $N$ days. Every day you need to choose whether you go …
I´m trying to code a CAM or more simply a dictionary storing the pointer of the data accessible by a key. I try to do it with a GPU but all attempts have been inefficient compared on using System.Collections.Generic.Dictionary. Does anybody know how to implement this with CUDA to obtain a better performance than with a CPU?
Currently, I am learning about Bellman Operator in Dynamic Programming and Reinforcement Learning. I would like to know why is Bellman operator contraction with respect to infinity norm? Why not another norm e.g. Euclidian norm?
TD(0) algorithm is defined as the iterative update of the following: $$ V(s) \leftarrow V(s) + \alpha({r + \gamma V(s')} - V(s) ) $$ Now, if we assume alpha to be equal to 1, we get the traditional Policy Evaluation formula in Dynamic programming. Is it correct?
In some resources, the belman equation is shown as below: $v_{\pi}(s) = \sum\limits_{a}\pi(a|s)\sum\limits_{s',r}p(s',r|s,a)\big[r+\gamma v_{\pi}(s')\big] $ The thing that I confused is that, the $\pi$ and $p$ parts at the right hand side. Since the probability part - $p(s',r|s,a)$- means that the probability of being at next state ($s'$), and since being at next state ($s'$) has to be done via following a specific action, the $p$ part also includes the probability of taking the specific actions inside it. But then, …
I am trying to develop an rl agent using DQN algorithm.During training, the agent interacts with environment which is a simulated one.Each episode takes around 10 mins to run. This way if want my agent to train for some 1000000(to achieve convergence) episodes, its becoming computationally infeasible, Is there a way anyone is aware to speed up my training process, like using parallel threading or using cuda. Or is it something because of the algorithm? my episode here basically is …
I am trying to grasp fundamental mathematics behind the Reinforcement Learning and so far I have unterstood how the Value Iteration and Policy algorithms do converge (contractions, etc.) I have still some problems about understanding the Bellman Equality. The Value function for a state $s$ under a policy $\pi$ is the expected discounted cumulative reward: $$ V^\pi(s_0=s) = \mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t R(s_t,\pi(s_t)) |s_0=s\right]$$. During the derivation of the Bellman Equations, when the expected cumulative rewards are calculated on an infinite horizon, meaning …