Am I using this neural network in a wrong way?
I'm trying to solve a RL problem; the Contextual Bandit problem using Deep Q Learning. My data is all simulated. I have this environment:
class Environment():
def __init__(self):
self._observation = np.zeros((3,))
def interact(self, action):
self._observation = np.zeros((3,))
c1, c2, c3 = np.random.randint(0, 90, 3)
self._observation[0]=c1
self._observation[1]=c2
self._observation[2]=c3
reward = -1.0
condition = False
if (c130) and (c230) and (c330) and action==0:
condition = True
elif (30=c160) and (30=c260) and (30=c360) and action==1:
condition = True
elif (60=c190) and (60=c290) and (60=c390) and action==2:
condition = True
else:
if action==4:
condition = True
if condition:
reward = 0.0
return {Observation: self._observation,
Reward: reward}
I tried many different neural architectures and they were all fully-connected, so I'm going with this one for representation purposes:
n_inputs = 3
n_outputs = 4
model = keras.models.Sequential([
keras.layers.Dense(32, activation=relu, input_shape=[n_inputs]),
keras.layers.Dense(32, activation=relu),
keras.layers.Dense(n_outputs)])
loss_fn = keras.losses.mean_squared_error
optimizer = keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss=loss_fn, optimizer=optimizer)
It takes three inputs which are the observations returned from the environment (three integers). As you can notice the observations are normalized.
And to get experiences I use the following code:
def epsilon_greedy_policy(observation, epsilon=0):
if np.random.rand() epsilon:
return np.random.randint(4)
else:
Q_values = model.predict(observation[np.newaxis])
return np.argmax(Q_values[0])
def sample_experiences():
batch = [replay_buffer[index] for index in range(len(replay_buffer))]
observations, rewards, actions = [np.array([experience[field_index] for experience in batch]) for field_index in range(3)]
return observations, rewards, actions
def play_one_step(env, observation, epsilon):
action = epsilon_greedy_policy(observation, epsilon)
observation, reward = env.interact(action).values()
replay_buffer.append((observation, reward, action))
return observation, reward
Now to update the weights of the model in order to predict Q-values that converge to the real ones, I execute the following snippet of code:
epsilon = 0.01
obs = np.random.randint(0,90,3)
for train_step in tqdm(range(1000)):
for i in range(128):
obs, reward = play_one_step(env, obs, epsilon)
observations, rewards, actions = sample_experiences()
target_Q_values = rewards
mask = tf.one_hot(actions, n_outputs)
with tf.GradientTape() as tape:
all_Q_values = model(observations)
Q_values = tf.reduce_sum(all_Q_values * mask, axis=1, keepdims=True)
loss = tf.reduce_mean(loss_fn(target_Q_values, Q_values))
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
replay_buffer.clear()
I took this code from a book. At first I wanted to use something like model.fit(X,y)
. But then my inputs are observations, but what we want is the predictions of the model to approach the rewards/real Q-values. So we're not trying to estimate the relationship between $X$ and $y$ but to establish a relationship/function $y=f(X)$ such as $f(X)$ is as close as possible to the rewards.
The problem is not matter how I tuned the model (which is fairly simple in this case). I get catastrophic results. By catastrophic I mean totally random results. I'm trying to have a look at how the model is doing by this following code:
check0 = np.random.randint(0,30,3)
for i in range(30):
arr = np.random.randint(0,30,3)
check0 = np.vstack((check0, arr))
predictions = model.predict(check0)
c = 0
for i in range(predictions.shape[0]):
if np.argmax(predictions[i])==0:
c+=1
(c/predictions.shape[0])*100
If the model predictions were close to the rewards, we'd get 100%. But sometimes I get 9%, I rerun I get 45%, 65%...
I asked first on ai.stackexchange.com and one first advice was to use normalization, which I used. But I still get the same results. The helper told me that I needed to look into the details of each block of code, which I did and everything seems to work pretty fine. I suspect the training block but I don't have the necessary knowledge to analyze it in every detail. I know what the functions do but that's it. I don't know the details of their implementations.
I saw a similar problem on this forum but it seems the author had a continuous problem. With Contextual Bandits each step is an episode so I think we can run as much steps as we want without caring about a terminal state or else.
I hope someone can help me figure out where does the issue rise. Thank you.
Topic dqn q-learning keras reinforcement-learning deep-learning
Category Data Science