Policy Gradient custom loss function not working
I was experimenting with my policy gradient reinforcement learning algorithm, and I was wondering if I could use a similar method to the supervised cross-entropy. So, instead of using existing labels, I would generate a label for every step in the trajectory.
Depending on the value of the action, I would shift the stochastic policy (Neural Network) output to a more efficient output and train it as a label to a cross entropy loss function.
Example of an action: Real output: [ 0.2, 0.8 ]; Value: [ -0.5 ]; Action taken: [ 1 ] (0.8 probability). Created label: [ 0.3, 0.7 ] (Second action not that great, reduce it's probability by a little bit)
My method didn't work, and I am really curious to know why.