Policy Gradient custom loss function not working

Question

Policy Gradient custom loss function not working

Ross Myhovych

2022年2月8日 18:02

I was experimenting with my policy gradient reinforcement learning algorithm, and I was wondering if I could use a similar method to the supervised cross-entropy. So, instead of using existing labels, I would generate a label for every step in the trajectory.

Depending on the value of the action, I would shift the stochastic policy (Neural Network) output to a more efficient output and train it as a label to a cross entropy loss function.

Example of an action: Real output: [ 0.2, 0.8 ]; Value: [ -0.5 ]; Action taken: [ 1 ] (0.8 probability). Created label: [ 0.3, 0.7 ] (Second action not that great, reduce it's probability by a little bit)

My method didn't work, and I am really curious to know why.

Topic policy-gradients implementation reinforcement-learning neural-network

Category Data Science

Brian Spiering · Accepted Answer · 2020年4月15日 16:45

There could be many reasons. Custom loss functions are difficult to get right.

One conceptual problem is that the policy of the agent should not be the label. The label should be the reward signal from the environment. If the label is the reward signal from the environment, the agent would learn which policy would maxim rewards. If the policy of the agent is the reward, the agent would "chase its own tail" and not learn environmental rewards

Policy Gradient custom loss function not working

About