Is my exploration scheme in reinforcement learning done correctly?

so, I am training a deterministic policy, represented by basically a convolutional networks. I have an action space which is basically a vector of weights / probabilities, output by the network. The actions encoded in that vector then determine the value of my reward function, which is to be minimized. I train sequentially on time series data and, at time t, always provide the last actions a(t-1) as input to the cnn besides the state s(t). Thus:

a(t) = model(a(t-1),s(t))

Now, I am relatively new to RL; so far, I often read that an important element of RL is also exploration, i.e. the tendency to choose random actions that explore the possible space of outcomes of the reward function.

In my model, in order to introduce exploration, my approach is to add noise to the policy-derived, optimal action a(t):

a(t) += noise

The noise I choose are uniformly distributed numbers between [0,1]. As the vector a is supposed to represent probabilities, after adding noise to each entry in a, I renormalize a (by dividing by the sum of its entries), in order to receive a valid, noisy probability vector. This vector is then fed to the reward function. Would that be a valid approach to do it?

Thanks! Best, JZ

Topic policy-gradients cnn reinforcement-learning deep-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.