inverted pendulum REINFORCE
I am learning reinforcement learning, and as a practice, I am trying to stabilize an inverted pendulum (gym: Pendulum-v0) in an upright position using policy gradient: REINFORCE.
I have some questions, please help me, I tried a lot but couldn't understand. an answer to any question can help me. Thanks in advance.
1- why the observations in the pendulum code are: cos(theta), sin(theta) and theta_dot? not theta and theta_dot only?
2- the action which I should send to the environment (env.step(action)) is it correct like this (in python):
prob=output.eval(arguments={observations: state})[0][0][0]
# output of the neural network
# Bernoulli dist. either 1 or 0 according to a threshold (rand.)
action = 1 if np.random.uniform() prob else 0
3- the reward function is defined in the pendulum code as follows, but I couldn't understand why! shouldn't it be something like if the pendulum is upright (with some tolerance) so the reward is high, otherwise zero?
costs = angle_normalise((th)**2 +.1*thdot**2 + .001*(action**2))
# normalize between -pi and pi
reward=-costs
4- The pendulum motion equation is different from what is known (like here), the environment code use the following equation, where is comes from?
newthdot = thdot + (-3*g/(2*l) * np.sin(th + np.pi) + 3./(m*l**2)*u) * dt
Topic policy-gradients reinforcement-learning python
Category Data Science