What enables transformers or very deep models "plan" ahead for sequential decision making?

I was watching this amazing lecture by Oriol Vinyals. On one slide, there is a question asking if the very deep models plan. Transformer models or models employed in applications like Dialogue Generation do not have a planning component but behave like they already have the dialogue planned. Dr. Vinyals mentioned that there are papers on "how transformers are building up knowledge to answer questions or do all sorts of very interesting analyses". Can any please refer to a few …
Category: Data Science

Computing probabilities in Plackett-Luce model

I am trying to implement a Plackett-Luce model for learning to rank from click data. Specifically, I am following the paper: Doubly-Robust Estimation for Correcting Position-Bias in Click Feedback for Unbiased Learning to Rank. The objective function is the reward function similar to the one used in reinforcement learning $R_d$ is the reward for document d, $\pi(k \vert d)$ is the probability of document d placed at position k for a given query q. $w_k$ is the weight of position …
Category: Data Science

Find highest reward for epsilon-greedy bandit program

I started to learn reinforcement learning, the first example is handling bandit program using epsilon-greedy method, In this example, there are three bandit machines used, the output is the mean value for all bandit machines and cumulative average with respect to the epsilon value The code - class Bandit: def __init__(self, m): self.m = m self.mean = 0 self.N = 0 def pull(self): return np.random.randn() + self.m def update(self, x): self.N += 1 self.mean = (1 - 1.0/self.N)*self.mean + 1.0/self.N*x …
Category: Data Science

Can i have the input to a neural network be a set of 2d coordinates if i run them through a convolution layer?

I asked this question a few days ago with no response and still dont have an answer so i will ask again. I am training a reinforcement learning agent on a 2d grid. It is fed in its position, and the target positions using x,y coordinates. An example input would be like [[1,3],[2,2],[5,1]]. I thought that since if i just fed in the input with a flatten layer (would be 1,3,2,2,5,1), there would not be a strong enough association between …
Category: Data Science

Can I use a 1d convolution on a set of coordinates?

So i am training a reinforcement learning agent. It is fed in its position, and the target positions using x,y coordinates. An example input would be like [[1,3],[2,2],[5,1]]. I thought that since if i just fed in the input with a flatten layer (would be 1,3,2,2,5,1), there would not be a strong enough association between each coordinate pair. Because of this, i used a 1d convolution layer with 5 filters, and a step and size of 2, which i hoped …
Category: Data Science

Problem when cherry picking actions - Proximal Policy Optimization

I am using the implementation of PPO2 in stable-baselines (a fork of OpenAI's baselines) for a Reinforcement Learning problem. My observation space is $9x9x191$ and my action space is $144$. Given a state, only some actions are "legal". If an "illegal" action is taken, the environment will return the same state. Think of it as the game of Go, where you try to place a stone on an intersection that is already occupied. When a legal action is taken, it …
Category: Data Science

inverted pendulum REINFORCE

I am learning reinforcement learning, and as a practice, I am trying to stabilize an inverted pendulum (gym: Pendulum-v0) in an upright position using policy gradient: REINFORCE. I have some questions, please help me, I tried a lot but couldn't understand. an answer to any question can help me. Thanks in advance. 1- why the observations in the pendulum code are: cos(theta), sin(theta) and theta_dot? not theta and theta_dot only? 2- the action which I should send to the environment …
Category: Data Science

Cartpole - Number of layers and neurons - model hyperparameters

Can anyone please suggest me how to arrive to the best optimal values for number of layers, number of neurons parameters of the deep learning model in DDQN algorithm for cartpole problem. As input and output neurons are 4 and 2 respectively for cartpole, are there any scientific reasons or maths behind choosing number of hidden layers and neurons in them. I have followed this link to build reinforcement learning algorithm https://pylessons.com/CartPole-reinforcement-learning/
Category: Data Science

Reinforcement Learning : Why acting greedily with the optimal value function gives you the optimal policy?

The course of David Silver about Reinforcement Learning explains how you get the optimal policy from the optimal value function. It seems to be very simple, you just have to act greedily, by maximizing at each step the value function. In the case of a small grid world, once you have applied the Policy Evaluation algorithm, you get for example the following matrix for the value function : You start from the up-left corner and the unique actions are the …
Category: Data Science

Reinforcement learning: negative reward (punish) illegal actions?

If you train an agent using reinforcement learning (with Q-function in this case), should you give a negative reward (punish) if the agent proposes illegal actions for the presented state? I guess over time if you only select from between the legal actions, the illegal ones would eventually drop out, but would punishing them cause them to drop out sooner and possibly cause the agent to explore more possible legal actions sooner? To expand on this further; say you're training …
Category: Data Science

Policy gradient/REINFORCE algorithm with RNN: why does this converge with SGM but not Adam?

I am working on training RNN model on caption generation with REINFORCE algorithm. I adopt self-critic strategy (see paper Self-critical Sequence Training for Image Captioning) to reduce the variance. I initialize the model with a pre-trained RNN model (a.k.a. warm start). This pre-trained model (trained with log-likelihood objective) got 0.6 F1 score in my task. When I use adam optimizer to train this policy gradient objective, the performance of my model drops to 0 after a few epochs. However, if …
Category: Data Science

How can I train a model to modify a vector by rewarding the model based on the modified vectors nearest neighbors?

I am experimenting with a document retrieval system in which I have documents represented as vectors. When queries come in, they are turned to vectors by the same method as used for the documents. The query vector's k nearest neighbors are retrieved as the results. Each query has a known answer string. In order to improve performance, I am now looking to create a model that modifies the query vector. What I was looking to do was use a model …
Category: Data Science

DQN fails to find optimal policy

Based on DeepMind publication, I've recreated the environment and I am trying to make the DQN find and converge to an optimal policy. The task of an agent is to learn how to sustainably collect apples (objects), with the regrowth of the apples depending on its spatial configuration (the more apples around, the higher the regrowth). So in short: the agent has to find how to collect as many apples as he can (for collecting an apple he gets a …
Category: Data Science

experience replay memory: saving the next state required when state does not depend on action?

so, I am using an agent with a state-action-policy and I am trying to understand the concept of experience replay memory (ERM). As far as I learned until now, the ERM is basically a buffer that stores sets experiences: e_t = {s_t, a_t, r_t+1, s_t+1} Where s is the state, a the action and r the reward, as usual. Basically, in order to use a network that learns to predict the correct action from such experiences, the network's input should …
Category: Data Science

Training a model that has both 2D and 1D features using a CNN

I'm looking to pre-train a model for an RL agent but I'm having some trouble figuring some stuff out. Dataset: Minerl MineRLNavigateDense-v0 The observation space includes : 2D screen input (64,64) (+ 3 channels of color) 1D (scalar) compass angle 1D (scalar) number of dirt blocks + this is all over time. I am also given the reward based on the action the human took. When training a model using a CNN for time series classification my understanding is that …
Category: Data Science

Why DQN but no Deep Sarsa?

Why is DQN frequently used while there is hardly any occurrence of Deep Sarsa? I found this paper https://arxiv.org/pdf/1702.03118.pdf using it, but nothing else which might be relevant. I assume the cause could be the Ape-X architecture which came up the year after the Deep Sarsa paper and allowed to generate an immense amount of experience for off-policy algorithms. Does it make sense or is their any other reason?
Category: Data Science

Custom Simulator for Deep Reinforcement Learning

I am trying to develop a control method for a specific process in industry. I have a time-series of data for the process and want to develop a prediction model base on attention mechanism to estimate the output of the system. After development of the prediction model, I want to design a controller based on Deep Reinforcement Learning to learn policies for process optimization. But I need a simulated environment to test and train my DRL algorithm on it. How …
Category: Data Science

Reward function to avoid illegal actions, minimize legal action and learn to win - Reinforcement Learning

I'm currently implementing PPO for a game with the following characteristics: Observation space: 9x9x(>150) Action space: 144 In a given state, only a handful of actions (~1-10) are legal The state in time-step t can vary a lot from state t+1 The environment is episodic (~25 steps, depending on level) and ends with a win or a loose In some levels, a random policy (if only legal actions are made) can result in a win, in some levels strategy is …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.