How to calculate Temperature variable in softmax(boltzmann) exploration

Hi I am developing a reinforcement learning agent for a continous state/discrete action space. I am trying to use boltmzann/softmax exploration as action selection strategy. My action space is of size 5000.

My implementation of boltzmann exploration:

def get_action(state,episode,temperature = 1):
    state_encod = np.reshape(state, [1, state_size])
    q_values = model.predict(state_encod)         
    prob_act = np.empty(len(q_values[0]))

    for i in range(len(prob_act)):
        prob_act[i] = np.exp(q_values[0][i]/temperature)

    #numpy matrix element-wise division for denominator (sum of numerators)
    prob_act = np.true_divide(prob_act,sum(prob_act))

    action_q_value = np.random.choice(q_values[0],p=prob_act)
    action_keys = np.where(q_values[0] == action_q_value)
    action_key = action_keys[0][0]
    action = index_to_action_mapping[action_key]
    return action    

If my temperature variable is 200, after 100 episodes I get an error

ValueError: probabilities contain NaN

If my temperature is 1 in very few episodes i get NaN error.

Why is this happening. Am I doing something wrong here? How to select the temperature variable? Can someone help me with this.

Thanks.

Topic dqn softmax ai reinforcement-learning deep-learning

Category Data Science


As Liuyang said in his answer, the cause of that error is that your temperature eventually reaches a point which causes the exponential to overflow (and result in NaN).

There are several ways to deal with this. For example, you can normalize the values in q_values by dividing by np.max(q_values). In this way, the maximum of q_values becomes 1, and it is much easier to control overflows. Another alternative is to clip q_values[0][i]/temperature before exponentiation, using np.min(q_values[0][i]/temperature, 10), or other value which you see fit. However, this makes you loose important information when ranking the actions (if all values of q_values[0][i]/temperature are above 10, you get the same result you would get if temperature was infinite). I prefer the normalization alternative.


the reason you get error is that when temperature annealed, the softmax would obtain a big value which the exponential cannot calculate. The softmax would approximate the value into nan or 0 or 1 which causes the error. I got the same problem and still fixing it. In my case, I just make sure the reward is not too big which can guarantee the softmax a not big q value.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.