Policy Gradient custom loss function not working

I was experimenting with my policy gradient reinforcement learning algorithm, and I was wondering if I could use a similar method to the supervised cross-entropy. So, instead of using existing labels, I would generate a label for every step in the trajectory.

Depending on the value of the action, I would shift the stochastic policy (Neural Network) output to a more efficient output and train it as a label to a cross entropy loss function.

Example of an action: Real output: [ 0.2, 0.8 ]; Value: [ -0.5 ]; Action taken: [ 1 ] (0.8 probability). Created label: [ 0.3, 0.7 ] (Second action not that great, reduce it's probability by a little bit)

My method didn't work, and I am really curious to know why.

Topic policy-gradients implementation reinforcement-learning neural-network

Category Data Science


There could be many reasons. Custom loss functions are difficult to get right.

One conceptual problem is that the policy of the agent should not be the label. The label should be the reward signal from the environment. If the label is the reward signal from the environment, the agent would learn which policy would maxim rewards. If the policy of the agent is the reward, the agent would "chase its own tail" and not learn environmental rewards

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.