inverted pendulum REINFORCE

I am learning reinforcement learning, and as a practice, I am trying to stabilize an inverted pendulum (gym: Pendulum-v0) in an upright position using policy gradient: REINFORCE.

I have some questions, please help me, I tried a lot but couldn't understand. an answer to any question can help me. Thanks in advance.

1- why the observations in the pendulum code are: cos(theta), sin(theta) and theta_dot? not theta and theta_dot only?

2- the action which I should send to the environment (env.step(action)) is it correct like this (in python):

prob=output.eval(arguments={observations: state})[0][0][0]  
# output of the neural network 

# Bernoulli dist. either 1 or 0 according to a threshold (rand.)
action = 1 if np.random.uniform()  prob else 0 

3- the reward function is defined in the pendulum code as follows, but I couldn't understand why! shouldn't it be something like if the pendulum is upright (with some tolerance) so the reward is high, otherwise zero?

costs = angle_normalise((th)**2 +.1*thdot**2 + .001*(action**2))
# normalize between -pi and pi
reward=-costs

4- The pendulum motion equation is different from what is known (like here), the environment code use the following equation, where is comes from?

 newthdot = thdot + (-3*g/(2*l) * np.sin(th + np.pi) + 3./(m*l**2)*u) * dt

Topic policy-gradients reinforcement-learning python

Category Data Science


I know some of your questions.

First of all, I wrote a REINFORCE code for this problem ( hyperparameters did not set properly). This link is for cartpole (by REINFORCE)

  1. $cos(\theta)$ shows the height of the pendulum. $sin(\theta)$ shows how much to the left or right it is. But if you use $\theta$ itself, it encodes this information much more complicated. Maybe this kind of inputs rely on this point.

  2. You can refer to the link to make sure, but I think your action range is wrong: if you use env = gym.make("Pendulum-v0") to make your environment, then your action domain is between -2 and 2. You can find your action range by typing print(env.action_space). This print says you have one output and it's domain is between -2 and 2.

  3. As I understand the reward from this link, this reward minimizes the following:

    $(\theta^2 + 0.1*(\frac{d\ \theta}{dt})^2 + 0.001*\text{action}^2)$

it has 3 terms:

  1. $\theta^2$ : minimizing this is our main goal because getting $\theta$ to 0 means the pendulum go to the high up.

  2. $(\frac{d\ \theta}{dt})^2$: minimizing this term means to be steady and do not move any more.

  3. $\text{action}^2$: I think this term is for minimizing energy or sth.

  4. I do not know that exactly but it seems that they simulate this by state space modeling.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.