Reinforcement learning policy gradient derivation

I was reading a document about Reinforcement Learning policy gradient http://web.stanford.edu/class/cs234/CS234Win2019/slides/lnotes8.pdf when I encountered this expression $ \nabla_{\theta} \mathbb{E_{\pi_{\theta}}}[r_{t^{t}}] = \mathbb{E_{\pi_{\theta}}} \left[ r_{t^{'}} \sum_{t = 0}^{t^{'}} \nabla_{\theta} \log \pi_{\theta} (a_t|s_t) \right] $ which is on page 6 just below (11). The problem is I have no idea how is this expression derived. The document says that it can be derived the same way as (11) but I do not understand how. Any pointers or hints would be appreciated.
Category: Data Science

Q-learning episode and relation to convergence in MY scenario?

I used Q-learning for routing. I have used the Bellman equation. I have certain other technical aspects in the code that add some novelty. But I have mixed doubts regarding episode and corresponding convergence in my case. I am unable to understand what would be an episode. E.g. a service comes, I assign a route to it and do some other stuff. I want the service acceptance to be more in the 'long' run (as more services come, some depart …
Category: Data Science

How to write a reward function that optimizes for profit and revenue?

So I want to write a reward function for a reinforcement learning model which picks products to display to a customer. Each product has a profit margin %. Higher price products will have a higher profit margin but lower probability of being purchased. Lower price products have a lower profit margin, but higher probability of being purchased. The goal is to maintain an AVERAGE margin of 5% for ALL products sold, while maximizing the total revenue. What's the best way …
Category: Data Science

What is a good reward function when objective is to minimize the average along with the variance?

I am trying to formulate a problem where we are trying to minimize the average resource allocated to different users. Due to some inherent properties of the environment, some users can be easily minimized while it is difficult for other users due to which a fairness issue arises. While the main objective is to minimize the average resource consumed by all the users, I also want to ensure that the allocation is fair so the variance of the resource allocation …
Category: Data Science

Reinforcement Learning End Effector Moving To Camera and Stops Learning

I am working on training a 3 finger Jaw gripper. The environment I setup is this: UR10 3 finger robot Pybullet for Simulation Stable baselines and DDPG Observation space is RGB image stacked with Depth and Segmentation Mask Action space is dx,dy,dz added to current position of end effector (wrist of robot) alpha, beta, gamma as orientation angles of end effector and joint positions of fingers. Reward 1: (1 - ((end effector distance from object)/(some max distance)))*10 Reward 2: When …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.