Structured policies in dynamic programming: solving a toy example

I am trying to solve a dynamic programming toy example. Here is the prompt: imagine you arrive in a new city for $N$ days and every night need to pick a restaurant to get dinner at. The qualities of the restaurants are iid according to distribution $F$ (assume [0,1]). The goal is to maximize the sum of the qualities of the restaurants that you get dinner at over the $N$ days. Every day you need to choose whether you go …
Category: Data Science

Coding a Content Addressable Memory on a GPU

I´m trying to code a CAM or more simply a dictionary storing the pointer of the data accessible by a key. I try to do it with a GPU but all attempts have been inefficient compared on using System.Collections.Generic.Dictionary. Does anybody know how to implement this with CUDA to obtain a better performance than with a CPU?
Category: Data Science

Confusion about the Bellman Equation

In some resources, the belman equation is shown as below: $v_{\pi}(s) = \sum\limits_{a}\pi(a|s)\sum\limits_{s',r}p(s',r|s,a)\big[r+\gamma v_{\pi}(s')\big] $ The thing that I confused is that, the $\pi$ and $p$ parts at the right hand side. Since the probability part - $p(s',r|s,a)$- means that the probability of being at next state ($s'$), and since being at next state ($s'$) has to be done via following a specific action, the $p$ part also includes the probability of taking the specific actions inside it. But then, …
Category: Data Science

Reducing the training time of an RL agent

I am trying to develop an rl agent using DQN algorithm.During training, the agent interacts with environment which is a simulated one.Each episode takes around 10 mins to run. This way if want my agent to train for some 1000000(to achieve convergence) episodes, its becoming computationally infeasible, Is there a way anyone is aware to speed up my training process, like using parallel threading or using cuda. Or is it something because of the algorithm? my episode here basically is …
Category: Data Science

About the time differences in the Bellman equation

I am trying to grasp fundamental mathematics behind the Reinforcement Learning and so far I have unterstood how the Value Iteration and Policy algorithms do converge (contractions, etc.) I have still some problems about understanding the Bellman Equality. The Value function for a state $s$ under a policy $\pi$ is the expected discounted cumulative reward: $$ V^\pi(s_0=s) = \mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t R(s_t,\pi(s_t)) |s_0=s\right]$$. During the derivation of the Bellman Equations, when the expected cumulative rewards are calculated on an infinite horizon, meaning …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.