Am I using this neural network in a wrong way?

I'm trying to solve a RL problem; the Contextual Bandit problem using Deep Q Learning. My data is all simulated. I have this environment:

class Environment():

  def __init__(self):
      
      self._observation = np.zeros((3,))
  
  def interact(self, action):
      self._observation = np.zeros((3,))
      c1, c2, c3 = np.random.randint(0, 90, 3)
      self._observation[0]=c1
      self._observation[1]=c2
      self._observation[2]=c3
      reward = -1.0
      condition = False
      if (c130) and (c230) and (c330) and action==0:
          condition = True
      elif (30=c160) and (30=c260) and (30=c360) and action==1:
          condition = True
      elif (60=c190) and (60=c290) and (60=c390) and action==2:
          condition = True
      else:
          if action==4:
              condition = True
      if condition:
        reward = 0.0
            
      return {Observation: self._observation,
                  Reward: reward}

I tried many different neural architectures and they were all fully-connected, so I'm going with this one for representation purposes:

n_inputs = 3
n_outputs = 4

model = keras.models.Sequential([
        keras.layers.Dense(32, activation=relu, input_shape=[n_inputs]),
        keras.layers.Dense(32, activation=relu),
        keras.layers.Dense(n_outputs)])

loss_fn = keras.losses.mean_squared_error
optimizer = keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss=loss_fn, optimizer=optimizer)

It takes three inputs which are the observations returned from the environment (three integers). As you can notice the observations are normalized.

And to get experiences I use the following code:

def epsilon_greedy_policy(observation, epsilon=0):
  if np.random.rand()  epsilon:
    return np.random.randint(4)
  else:
    Q_values = model.predict(observation[np.newaxis])
    return np.argmax(Q_values[0])

def sample_experiences():
  batch = [replay_buffer[index] for index in range(len(replay_buffer))]
  observations, rewards, actions = [np.array([experience[field_index] for experience in batch]) for field_index in range(3)]
  return observations, rewards, actions

def play_one_step(env, observation, epsilon):
  action = epsilon_greedy_policy(observation, epsilon)
  observation, reward = env.interact(action).values()
  replay_buffer.append((observation, reward, action))
  return observation, reward

Now to update the weights of the model in order to predict Q-values that converge to the real ones, I execute the following snippet of code:

epsilon = 0.01
obs = np.random.randint(0,90,3)

for train_step in tqdm(range(1000)):

  for i in range(128):
    obs, reward = play_one_step(env, obs, epsilon)

  observations, rewards, actions = sample_experiences()
  target_Q_values = rewards
  mask = tf.one_hot(actions, n_outputs)
  with tf.GradientTape() as tape:
    all_Q_values = model(observations)
    Q_values = tf.reduce_sum(all_Q_values * mask, axis=1, keepdims=True)
    loss = tf.reduce_mean(loss_fn(target_Q_values, Q_values))
  grads = tape.gradient(loss, model.trainable_variables)
  optimizer.apply_gradients(zip(grads, model.trainable_variables))

  replay_buffer.clear()

I took this code from a book. At first I wanted to use something like model.fit(X,y). But then my inputs are observations, but what we want is the predictions of the model to approach the rewards/real Q-values. So we're not trying to estimate the relationship between $X$ and $y$ but to establish a relationship/function $y=f(X)$ such as $f(X)$ is as close as possible to the rewards.

The problem is not matter how I tuned the model (which is fairly simple in this case). I get catastrophic results. By catastrophic I mean totally random results. I'm trying to have a look at how the model is doing by this following code:

check0 = np.random.randint(0,30,3)

for i in range(30):
  arr = np.random.randint(0,30,3)
  check0 = np.vstack((check0, arr))

predictions = model.predict(check0)

c = 0
for i in range(predictions.shape[0]):
  if np.argmax(predictions[i])==0:
    c+=1

(c/predictions.shape[0])*100

If the model predictions were close to the rewards, we'd get 100%. But sometimes I get 9%, I rerun I get 45%, 65%...

I asked first on ai.stackexchange.com and one first advice was to use normalization, which I used. But I still get the same results. The helper told me that I needed to look into the details of each block of code, which I did and everything seems to work pretty fine. I suspect the training block but I don't have the necessary knowledge to analyze it in every detail. I know what the functions do but that's it. I don't know the details of their implementations.

I saw a similar problem on this forum but it seems the author had a continuous problem. With Contextual Bandits each step is an episode so I think we can run as much steps as we want without caring about a terminal state or else.

I hope someone can help me figure out where does the issue rise. Thank you.

Topic dqn q-learning keras reinforcement-learning deep-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.