understand and implement on-policy "Advantage Actor-Critic" The Keras RL example is straight and simple, it uses The Kersa functional API to create an actor-critic and after each episode calculate loss and Gradient(episodic or off-policy). Because it calculates gradient at end of each episode it seems to be an off-policy implementation(which takes random actions to try to explore the environment). What I want to do, is implement an on-policy Advantage actor-critic that calculates and updates loss and gradient at each step …
In this page of keras's website, a reinforcement learning algorithm based in an actor critic scheme has been described. It is a deep policy gradient algorithm (hence DPG). Of course keras functions are central in this code, for this reason tensorflow tries to have an access to a NVIDIA gpu for the acceleration, otherwise it does use the accessible cores. I believe that this code is not optimized because it uses only one core, the main part of the code …
I want to build an agent for binary classification. I have a large dataset with two label (0 and 1). I want to build an agent to predict labels. I build a deep model and now I want to build an agent. I use keras-rl2. but there is a problem: for dqn agent, the fit function has an env argument. I don't know how can I define my problem environment for that. note that my problem has a similarity function …
I am new to reinforcement learning agent training. I have read about PPO algorithm and used stable baselines library to train an agent using PPO. So my question here is how do I evaluate a trained RL agent. Consider for a regression or classification problem I have metrics like r2_score or accuracy etc.. Are there any such parameters or how do I test the agent, conclude that the agent is trained well or bad. Thanks
When I configure a DQN agent, nb_steps_warmup can be set. Is this parameter set for each episode or once globally? What I am trying to ask is, imaging I have a game environment which takes about 3000 max. steps per episode. The DQN is fitted as follows: dqn.fit(env, nb_steps=30000, visualize=True, verbose=2) So, as I understand it, the fitting will run approximately 10 episodes (nb_steps / max. steps per episode). If I set nb_steps_warmup = 5000, what actually happens? A) nb_steps_warmup=5000, …
I'm creating the model for a DDPG agent (keras-rl version) but i'm having some trouble with errors whenever I try adding in batch normalization in the first of two networks. Here is the creation function as i'd like it to be: def buildDDPGNets(actNum, obsSpace): actorObsInput = Input(shape = (1,) + obsSpace, name = "actor_obs_input") a = Flatten()(actorObsInput) a = Dense(600, use_bias = False)(a) a = BatchNormalization()(a) a = Activation("relu")(a) a = Dense(300, use_bias = False)(a) a = BatchNormalization()(a) a = …
I am looking for stabilizing my results of DQN, I found clipping is one technique to do it but I did not understand it completely! 1- what are the effects of clipping the reward, clipping the gradient, clipping the error in stability and how makes results more stable? 2- In DQN nature it has written they clipping the reward? Would you please explain this more? 3- which of them are more effective in stability?
Hi I am trying to develop an rl agent using PPO algorithm. My agent takes an action(CFM) to maintain a state variable called RAT in between 24 to 24.5. I am using PPO algorithm of stable-baselines library to train my agent.I have trained the agent for 2M steps. Hyper-parameters in the code: def __init__(self, *args, **kwargs): super(CustomPolicy, self).__init__(*args, **kwargs, net_arch=[dict(pi=[64, 64], vf=[64, 64])], feature_extraction="mlp") model = PPO2(CustomPolicy,env,gamma=0.8, n_steps=132, ent_coef=0.01, learning_rate=1e-3, vf_coef=0.5, max_grad_norm=0.5, lam=0.95, nminibatches=4, noptepochs=4, cliprange=0.2, cliprange_vf=None, verbose=0, tensorboard_log="./20_01_2020_logs/", _init_setup_model=True, …
I solved the CartPole-v0 with a CEM agent pretty easily (experiments and code), but I struggle to find a setup which works with DQN. Do you know which parameters should be adjusted so that the mean reward is about 200 for this problem? What I tried Adjustments in the model: Deeper / less deep, neurons per layer Memory size (how many steps are stored for replay) What I'm unsure about How should I choose the memory? Is higher always better? …
I am new to reinforcement learning and experimenting with training of RL agents. I have a doubt about reward formulation, from a given state if a agent takes a good action i give a positive reward, and if the action is bad, i give a negative reward. So if i give the agent very high positive rewards when it takes a good action, like 100 times positive value as compared to negative rewards, will it help agent during the training? …
I'm trying to replicate the DQN Atari experiment. Actually my DQN isn't performing well; checking another one's codes, I saw something about experience replay which I don't understand. First, when you define your CNN, in the first layer you have to specify the size (I'm using Keras + Tensorflow so in my case it's something like (105, 80, 4), which corresponds to height, width and number of images I feed my CNN.). In the codes I revisited, when they get …
How to implement clipping the reward in DQN in keras? especially how to implement clipping the reward? Is this pseudo code correct: if reward<-threshold reward=-1 elseif reward>threshold reward=1 elseif -threshold<reward<threshold reward=reward/threshold And if reward is always positive how we can change clipping the reward?