reinforcement-learning

What enables transformers or very deep models "plan" ahead for sequential decision making?

Water Dragon

2022年6月4日 18:29

I was watching this amazing lecture by Oriol Vinyals. On one slide, there is a question asking if the very deep models plan. Transformer models or models employed in applications like Dialogue Generation do not have a planning component but behave like they already have the dialogue planned. Dr. Vinyals mentioned that there are papers on "how transformers are building up knowledge to answer questions or do all sorts of very interesting analyses". Can any please refer to a few …

Topic: transformer reinforcement-learning deep-learning neural-network machine-learning

Category: Data Science

Computing probabilities in Plackett-Luce model

SHASHANK GUPTA

2022年6月4日 14:19

I am trying to implement a Plackett-Luce model for learning to rank from click data. Specifically, I am following the paper: Doubly-Robust Estimation for Correcting Position-Bias in Click Feedback for Unbiased Learning to Rank. The objective function is the reward function similar to the one used in reinforcement learning $R_d$ is the reward for document d, $\pi(k \vert d)$ is the probability of document d placed at position k for a given query q. $w_k$ is the weight of position …

Topic: bayesian ranking reinforcement-learning deep-learning

Category: Data Science

Find highest reward for epsilon-greedy bandit program

vishak raj

2022年6月3日 17:23

I started to learn reinforcement learning, the first example is handling bandit program using epsilon-greedy method, In this example, there are three bandit machines used, the output is the mean value for all bandit machines and cumulative average with respect to the epsilon value The code - class Bandit: def __init__(self, m): self.m = m self.mean = 0 self.N = 0 def pull(self): return np.random.randn() + self.m def update(self, x): self.N += 1 self.mean = (1 - 1.0/self.N)*self.mean + 1.0/self.N*x …

Topic: implementation reinforcement-learning deep-learning machine-learning

Category: Data Science

How does Q-Learning deal with mixed strategies?

Thomas Johnson

2022年5月31日 06:05

I'm trying to understand how Q-learning deals with games where the optimal policy is a mixed strategy. The Bellman equation says that you should choose $max_a(Q(s,a))$ but this implies a single unique action for each $s$. Is Q-learning just not appropriate if you believe that the problem has a mixed strategy?

Topic: q-learning reinforcement-learning machine-learning

Category: Data Science

Can i have the input to a neural network be a set of 2d coordinates if i run them through a convolution layer?

Mercury

2022年5月28日 07:28

I asked this question a few days ago with no response and still dont have an answer so i will ask again. I am training a reinforcement learning agent on a 2d grid. It is fed in its position, and the target positions using x,y coordinates. An example input would be like [[1,3],[2,2],[5,1]]. I thought that since if i just fed in the input with a flatten layer (would be 1,3,2,2,5,1), there would not be a strong enough association between …

Topic: convolutional-neural-network reinforcement-learning neural-network

Category: Data Science

Can I use a 1d convolution on a set of coordinates?

Mercury

2022年5月26日 19:47

So i am training a reinforcement learning agent. It is fed in its position, and the target positions using x,y coordinates. An example input would be like [[1,3],[2,2],[5,1]]. I thought that since if i just fed in the input with a flatten layer (would be 1,3,2,2,5,1), there would not be a strong enough association between each coordinate pair. Because of this, i used a 1d convolution layer with 5 filters, and a step and size of 2, which i hoped …

Topic: convolutional-neural-network reinforcement-learning deep-learning neural-network

Category: Data Science

Problem when cherry picking actions - Proximal Policy Optimization

Max Fischer

2022年5月24日 08:08

I am using the implementation of PPO2 in stable-baselines (a fork of OpenAI's baselines) for a Reinforcement Learning problem. My observation space is $9x9x191$ and my action space is $144$. Given a state, only some actions are "legal". If an "illegal" action is taken, the environment will return the same state. Think of it as the game of Go, where you try to place a stone on an intersection that is already occupied. When a legal action is taken, it …

Topic: policy-gradients reinforcement-learning deep-learning neural-network python

Category: Data Science

inverted pendulum REINFORCE

sara

2022年5月19日 12:06

I am learning reinforcement learning, and as a practice, I am trying to stabilize an inverted pendulum (gym: Pendulum-v0) in an upright position using policy gradient: REINFORCE. I have some questions, please help me, I tried a lot but couldn't understand. an answer to any question can help me. Thanks in advance. 1- why the observations in the pendulum code are: cos(theta), sin(theta) and theta_dot? not theta and theta_dot only? 2- the action which I should send to the environment …

Topic: policy-gradients reinforcement-learning python

Category: Data Science

Cartpole - Number of layers and neurons - model hyperparameters

vimala

2022年5月19日 07:00

Can anyone please suggest me how to arrive to the best optimal values for number of layers, number of neurons parameters of the deep learning model in DDQN algorithm for cartpole problem. As input and output neurons are 4 and 2 respectively for cartpole, are there any scientific reasons or maths behind choosing number of hidden layers and neurons in them. I have followed this link to build reinforcement learning algorithm https://pylessons.com/CartPole-reinforcement-learning/

Topic: dqn reinforcement-learning deep-learning

Category: Data Science

Reinforcement Learning : Why acting greedily with the optimal value function gives you the optimal policy?

tristan

2022年5月18日 06:05

The course of David Silver about Reinforcement Learning explains how you get the optimal policy from the optimal value function. It seems to be very simple, you just have to act greedily, by maximizing at each step the value function. In the case of a small grid world, once you have applied the Policy Evaluation algorithm, you get for example the following matrix for the value function : You start from the up-left corner and the unique actions are the …

Topic: policy-gradients reinforcement-learning evaluation optimization

Category: Data Science

Reinforcement learning: negative reward (punish) illegal actions?

BigBadMe

2022年5月17日 21:05

If you train an agent using reinforcement learning (with Q-function in this case), should you give a negative reward (punish) if the agent proposes illegal actions for the presented state? I guess over time if you only select from between the legal actions, the illegal ones would eventually drop out, but would punishing them cause them to drop out sooner and possibly cause the agent to explore more possible legal actions sooner? To expand on this further; say you're training …

Topic: q-learning reinforcement-learning machine-learning

Category: Data Science

Policy gradient/REINFORCE algorithm with RNN: why does this converge with SGM but not Adam?

Kechen

2022年5月17日 20:00

I am working on training RNN model on caption generation with REINFORCE algorithm. I adopt self-critic strategy (see paper Self-critical Sequence Training for Image Captioning) to reduce the variance. I initialize the model with a pre-trained RNN model (a.k.a. warm start). This pre-trained model (trained with log-likelihood objective) got 0.6 F1 score in my task. When I use adam optimizer to train this policy gradient objective, the performance of my model drops to 0 after a few epochs. However, if …

Topic: policy-gradients rnn reinforcement-learning deep-learning nlp

Category: Data Science

How can I train a model to modify a vector by rewarding the model based on the modified vectors nearest neighbors?

RossDeVito

2022年5月15日 20:28

I am experimenting with a document retrieval system in which I have documents represented as vectors. When queries come in, they are turned to vectors by the same method as used for the documents. The query vector's k nearest neighbors are retrieved as the results. Each query has a known answer string. In order to improve performance, I am now looking to create a model that modifies the query vector. What I was looking to do was use a model …

Topic: vector-space-models training reinforcement-learning information-retrieval machine-learning

Category: Data Science

DQN fails to find optimal policy

macwiatrak

2022年5月14日 23:04

Based on DeepMind publication, I've recreated the environment and I am trying to make the DQN find and converge to an optimal policy. The task of an agent is to learn how to sustainably collect apples (objects), with the regrowth of the apples depending on its spatial configuration (the more apples around, the higher the regrowth). So in short: the agent has to find how to collect as many apples as he can (for collecting an apple he gets a …

Topic: deepmind dqn convergence q-learning reinforcement-learning

Category: Data Science

multidimensional output from a DQN

Bonsi

2022年5月12日 08:52

The output of a DQN gives the Q value of each actions and it is an one dimensional vector. Can we get the output from a DQN as a matrix?

Topic: dqn reinforcement-learning

Category: Data Science

experience replay memory: saving the next state required when state does not depend on action?

user101893

2022年5月11日 21:12

so, I am using an agent with a state-action-policy and I am trying to understand the concept of experience replay memory (ERM). As far as I learned until now, the ERM is basically a buffer that stores sets experiences: e_t = {s_t, a_t, r_t+1, s_t+1} Where s is the state, a the action and r the reward, as usual. Basically, in order to use a network that learns to predict the correct action from such experiences, the network's input should …

Topic: policy-gradients reinforcement-learning machine-learning

Category: Data Science

Training a model that has both 2D and 1D features using a CNN

MouseAndKeyboard

2022年5月11日 02:06

I'm looking to pre-train a model for an RL agent but I'm having some trouble figuring some stuff out. Dataset: Minerl MineRLNavigateDense-v0 The observation space includes : 2D screen input (64,64) (+ 3 channels of color) 1D (scalar) compass angle 1D (scalar) number of dirt blocks + this is all over time. I am also given the reward based on the action the human took. When training a model using a CNN for time series classification my understanding is that …

Topic: cnn reinforcement-learning neural-network

Category: Data Science

Why DQN but no Deep Sarsa?

Robin

2022年5月10日 12:47

Why is DQN frequently used while there is hardly any occurrence of Deep Sarsa? I found this paper https://arxiv.org/pdf/1702.03118.pdf using it, but nothing else which might be relevant. I assume the cause could be the Ape-X architecture which came up the year after the Deep Sarsa paper and allowed to generate an immense amount of experience for off-policy algorithms. Does it make sense or is their any other reason?

Topic: q-learning reinforcement-learning

Category: Data Science

Custom Simulator for Deep Reinforcement Learning

Esmaeel Mohammadi

2022年5月10日 12:30

I am trying to develop a control method for a specific process in industry. I have a time-series of data for the process and want to develop a prediction model base on attention mechanism to estimate the output of the system. After development of the prediction model, I want to design a controller based on Deep Reinforcement Learning to learn policies for process optimization. But I need a simulated environment to test and train my DRL algorithm on it. How …

Topic: attention-mechanism lstm reinforcement-learning

Category: Data Science

Reward function to avoid illegal actions, minimize legal action and learn to win - Reinforcement Learning

Max Fischer

2022年5月10日 09:03

I'm currently implementing PPO for a game with the following characteristics: Observation space: 9x9x(>150) Action space: 144 In a given state, only a handful of actions (~1-10) are legal The state in time-step t can vary a lot from state t+1 The environment is episodic (~25 steps, depending on level) and ends with a win or a loose In some levels, a random policy (if only legal actions are made) can result in a win, in some levels strategy is …

Topic: reinforcement-learning deep-learning neural-network machine-learning

Category: Data Science

About