inverted pendulum REINFORCE

Question

inverted pendulum REINFORCE

sara

2022年5月19日 12:06

I am learning reinforcement learning, and as a practice, I am trying to stabilize an inverted pendulum (gym: Pendulum-v0) in an upright position using policy gradient: REINFORCE.

I have some questions, please help me, I tried a lot but couldn't understand. an answer to any question can help me. Thanks in advance.

1- why the observations in the pendulum code are: cos(theta), sin(theta) and theta_dot? not theta and theta_dot only?

2- the action which I should send to the environment (env.step(action)) is it correct like this (in python):

prob=output.eval(arguments={observations: state})[0][0][0]  
# output of the neural network 

# Bernoulli dist. either 1 or 0 according to a threshold (rand.)
action = 1 if np.random.uniform()  prob else 0

3- the reward function is defined in the pendulum code as follows, but I couldn't understand why! shouldn't it be something like if the pendulum is upright (with some tolerance) so the reward is high, otherwise zero?

costs = angle_normalise((th)**2 +.1*thdot**2 + .001*(action**2))
# normalize between -pi and pi
reward=-costs

4- The pendulum motion equation is different from what is known (like here), the environment code use the following equation, where is comes from?

 newthdot = thdot + (-3*g/(2*l) * np.sin(th + np.pi) + 3./(m*l**2)*u) * dt

Topic policy-gradients reinforcement-learning python

Category Data Science

reza karbasi · Accepted Answer · 2021年8月20日 07:24

I know some of your questions.

First of all, I wrote a REINFORCE code for this problem ( hyperparameters did not set properly). This link is for cartpole (by REINFORCE)

$cos(\theta)$ shows the height of the pendulum. $sin(\theta)$ shows how much to the left or right it is. But if you use $\theta$ itself, it encodes this information much more complicated. Maybe this kind of inputs rely on this point.
You can refer to the link to make sure, but I think your action range is wrong: if you use env = gym.make("Pendulum-v0") to make your environment, then your action domain is between -2 and 2. You can find your action range by typing print(env.action_space). This print says you have one output and it's domain is between -2 and 2.
As I understand the reward from this link, this reward minimizes the following:

$(\theta^2 + 0.1*(\frac{d\ \theta}{dt})^2 + 0.001*\text{action}^2)$

it has 3 terms:

$\theta^2$ : minimizing this is our main goal because getting $\theta$ to 0 means the pendulum go to the high up.
$(\frac{d\ \theta}{dt})^2$: minimizing this term means to be steady and do not move any more.
$\text{action}^2$: I think this term is for minimizing energy or sth.
I do not know that exactly but it seems that they simulate this by state space modeling.

inverted pendulum REINFORCE

About