Multi-task reinforcement learning with different action spaces

I'm currently working on a project in which I need apply multi -task reinforcement learning. Over the same state space, each agent aims to do a separate task, but the action spaces of agents are different from each other. I thought IMPALA would be a good choice at first glance, but it requires actions to be shared somehow, which is not applicable in my case. Can someone please give me an idea if there is an appropriate multi-task reinforcement learning …
Category: Data Science

Time horizon T in policy gradients (actor-critic)

I am currently going through the Berkeley lectures on Reinforcement Learning. Specifically, I am at slide 5 of this lecture. At the bottom of that slide, the gradient of the expected sum of rewards function is given by $$ \nabla J(\theta) = \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log{\pi_\theta(a_{i,t} \vert s_{i,t}) (Q(s_{i,t},a_{i,t}) - V(s_{i,t}))} $$ The q-value function is defined as $$Q(s_t,a_t) = \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})\vert s_t,a_t]$$ At first glance, this makes sense, because I compare the value of taking the chosen action …
Category: Data Science

Reinforcement Learning - PPO: Why do so many implementations calculate the returns using the GAE? (Mathematical reason)

There are so many PPO implementations that use GAE and do the following: def compute_gae(next_value, rewards, masks, values, gamma=0.99, tau=0.95): values = values + [next_value] gae = 0 returns = [] for step in reversed(range(len(rewards))): delta = rewards[step] + gamma * values[step + 1] * masks[step] - values[step] gae = delta + gamma * tau * masks[step] * gae returns.insert(0, gae + values[step]) return returns ... advantage = returns - values ... critic_loss = (returns - value).pow(2).mean() Soure: https://github.com/higgsfield/RL-Adventure-2/blob/master/3.ppo.ipynb, I …
Category: Data Science

Keras on-policy "Advantage Actor Critic" implementation

understand and implement on-policy "Advantage Actor-Critic" The Keras RL example is straight and simple, it uses The Kersa functional API to create an actor-critic and after each episode calculate loss and Gradient(episodic or off-policy). Because it calculates gradient at end of each episode it seems to be an off-policy implementation(which takes random actions to try to explore the environment). What I want to do, is implement an on-policy Advantage actor-critic that calculates and updates loss and gradient at each step …
Category: Data Science

Soft actor-critic reinforcement learning for 100x100 maze environment

I am doing a project which requires a soft actor-critic reinforcement learning agent to learn how to reach a goal in a 100x100 maze environment as the one below: The state space is discrete and only the agent's current position is passed as the state. For example, the state is (50, 4) in the image. The action-space is also discrete and just includes [left, right, up, down, up-left, up-right, down-left, down-right]. The reward function is just 100 for reaching the …
Category: Data Science

how to in enhance A3C entropy?

I'm trying to implement this A3C code in my custom environment, and I have a basic understanding of the algorithm. The algorithm worked, but it did not give me a good performance. I looked into multiple implementations, and each one seemed different to me, like this one, for example, now the algorithm that I write is as follows: a3c class ActorCritics(nn.Module): def __init__(self,input,n_actions,env,gamma=0.99): super(ActorCritics,self).__init__() self.gamma=gamma self.env=env self.n_actions=n_actions self.pi1=nn.Linear(input,128) self.v1=nn.Linear(input,128) self.pi2=nn.Linear(128,64) self.v2=nn.Linear(128,64) self.pi3=nn.Linear(64,32) self.v3=nn.Linear(64,32) self.pi4=nn.Linear(32,16) self.v4=nn.Linear(32,16) self.pi5=nn.Linear(16,8) self.v5=nn.Linear(16,8) self.pi6=nn.Linear(8,4) self.v6=nn.Linear(8,4) self.pi7=nn.Linear(4,2) …
Category: Data Science

Actor Network Target Value in A2C Reinforcement Learning

In DQN, we use; $Target = r+\gamma v(s')$ equation to train (fit) our network. It is easy to understand since we use the $Target$ value as the dependent variable like we do in supervised learning. I.e. we can use codes in python to train the model like, model.fit(state,target, verbose = 0) where $r$ and $v(s')$ can be found by model prediction. When it comes to A2C network, things becomes more complicated. Now we have got two networks. Actor and Ctitic. …
Category: Data Science

Evaluating a trained Reinforcement Learning Agent?

I am new to reinforcement learning agent training. I have read about PPO algorithm and used stable baselines library to train an agent using PPO. So my question here is how do I evaluate a trained RL agent. Consider for a regression or classification problem I have metrics like r2_score or accuracy etc.. Are there any such parameters or how do I test the agent, conclude that the agent is trained well or bad. Thanks
Category: Data Science

A2C learning very slowly when I try to make it learn on batches as compared to making it learn on each step

I tried this on openai gym environment - LunarLander-v2. I wrote two algorithms with just one difference: Made it learn on each step. Made it learn at the end of each episode. There is a significant difference between the performance of both. First one started touching around 200 reward after 3000 episodes whereas second algorithm is getting only -50 reward after 8000 episodes. Btw 3000 episodes of first algo took the same run time as the 8000 of second. I …
Category: Data Science

Pytorch XLA to solve the spawn problems in a Colab Env

As reference only, here is my code It seems that torch.multiprocessing.set_start_method("spawn") can't be used in an Colab Env. Only 'fork' is allowed. I have implemented A3C - Data Parallelism to solve the Breakout Atari Game. As I use multi-agents, I need to spawn several processes. This is representing a single agent : TotalReward = namedtuple("TotalReward", field_names="reward") def data_func(net, device, train_queue, batch_size, entropy_beta, env_name, n_envs, gamma, reward_steps, **kwargs): env = GymEnvVec(env_name, n_envs) agent = Agent(net, batch_size, entropy_beta) exp_source = ExperienceSourceFirstLast(env, agent, …
Category: Data Science

Actor Critic Model implementation

I am going to work on a project which requires implementation of A2C model using Tensorflow 2.0. I am new in the Machine Learning field and also in Python. These are topics which I have covered theoretically: Different methods of Machine learning (supervised, unsupervised, reinforcement) Linear and Logistic Regression Required knowledge on Statistic and Probability Neural network Policy gradient Gradient Descent Basic of Tensorflow 2.0 (basic operations, preprocessing of Data) Now I am a bit confused about what structure should …
Category: Data Science

Why can't Policy Gradient Algorithm be seen as an Actor-Critic Method?

During the equation deducing in policy gradient algorithm(e.g., REINFORCE), we are actually using an expectancy of total reward, which we try to maximize. $$\overline{R_\theta}=E_{\tau\sim\pi_\theta}[R(\tau)]$$ Can't it be seen as an Actor-Critic Method since we are using V(s) as a Critic to guide the update of Actor π? (Here we've already introduced an approximation) $$\nabla \overline{R_\theta} = \sum_{n=1}^N R(\tau^{(n)}) \nabla \log p(\tau)$$ If not, what's the clear definition of Actor and Critic defined in Actor-Critic Algorithm.
Category: Data Science

Action selection in actor-critic algorithm:

I have an action space that is just a list of values given by acts = [i for i in range(10, 100, 10)]. According to pytorch documentary, the loss is calculated as below. Could someone explain to me how I can modify this procedure to sample actions from my action space? m = Categorical(probs) action = m.sample() next_state, reward = env.step(action) loss = -m.log_prob(action) * reward loss.backward()```
Category: Data Science

Agent always takes a same action in DQN - Reinforcement Learning

I have trained an RL agent using DQN algorithm. After 20000 episodes my rewards are converged. Now when I test this agent, the agent is always taking the same action , irrespective of state. I find this very weird. Can someone help me with this. Is there a reason, anyone can think of why is the agent behaving this way? Reward plot When I test the agent state = env.reset() print('State: ', state) state_encod = np.reshape(state, [1, state_size]) q_values = …
Category: Data Science

Actions taken by agentn/ agent performance not improving

Hi I am trying to develop an rl agent using PPO algorithm. My agent takes an action(CFM) to maintain a state variable called RAT in between 24 to 24.5. I am using PPO algorithm of stable-baselines library to train my agent.I have trained the agent for 2M steps. Hyper-parameters in the code: def __init__(self, *args, **kwargs): super(CustomPolicy, self).__init__(*args, **kwargs, net_arch=[dict(pi=[64, 64], vf=[64, 64])], feature_extraction="mlp") model = PPO2(CustomPolicy,env,gamma=0.8, n_steps=132, ent_coef=0.01, learning_rate=1e-3, vf_coef=0.5, max_grad_norm=0.5, lam=0.95, nminibatches=4, noptepochs=4, cliprange=0.2, cliprange_vf=None, verbose=0, tensorboard_log="./20_01_2020_logs/", _init_setup_model=True, …
Category: Data Science

Rewards are converged but with a lot of variations

I am training a reinforcement learning agent on an episodic task of fixed episode length. I am tracking the training process by plotting the cumulative rewards over an episode. I am using tensorboard for plotting the rewards. I have trained my agent for 20M steps. So I believe the agent has been given enough time to train. The cumulative rewards for an episode can range from +132 to around -60. My plot with a smoothing of 0.999 Over the episodes, …
Category: Data Science

Having a reward structure which gives high positive rewards compared to the negative rewards

I am training an RL agent using PPO algorithm for a control problem. The objective of the agent is to maintain temperature in a room. It is an episodic task with episode length of 9 hrs and step size(action being taken) for every 15 mins.During the training of an agent, from a given state, agent takes an action.Then I check the temperature of the room after 15 mins(step size) and if this temperature is within limits, I give the action …
Category: Data Science

Proof subtracting baseline doesn't influence gradient can be used to show no gradient exist at all?

I am using David Silver's course in RL to help me write my thesis. However, I am baffled by the proof given in lecture 7 slide 29: slideshow \begin{align} \mathbb{E}_{\pi_\theta}[\nabla_\theta \log_\theta (s,a)B(s)] &= \sum_{s \in S}d^{\pi_\theta} (s) \sum_a \nabla_\theta \pi_\theta(s,a)B(s)\\ &=\sum_{s \in S} d^{\pi_\theta} B(s) \nabla_\theta\sum_{a \in A} \pi_\theta(s,a)\\ &=0 \end{align} Consider in this proof replacing $b(s)$ with the critic's quality estimate $Q_w(s,a)$ (see previous slide(s)). How does this proof not also show that the gradient of the objective function …
Category: Data Science

Formulation of a reward structure

I am new to reinforcement learning and experimenting with training of RL agents. I have a doubt about reward formulation, from a given state if a agent takes a good action i give a positive reward, and if the action is bad, i give a negative reward. So if i give the agent very high positive rewards when it takes a good action, like 100 times positive value as compared to negative rewards, will it help agent during the training? …
Category: Data Science

How to handle differences between training and deploying of an RL agent

Hi I am training an RL agent for a control problem. The objective of the agent is to maintain temperature in a zone. It is an episodic task with episode length of 10 hrs and actions being taken every 15 mins. Ambient weather is one of the state variable during the training. For training process a profile of ambient temperature has been generated for each hour of the day and used for training. I have trained the agent using PPO …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.