Building a simulator for continuous state, discrete action reinforcement learning

I am trying to build a simulator that optimizes the performance and temperature of a device. I want the device to perform well, but without making the device too hot. If the device becomes too hot, I want the internal circuitry to push down the device performance to reduce the temperature. It is hard to perform repeated ground truth experiments on the device so I need to build a simulator in which to train the agent. I am new to RL but believe that it should work so I am starting to learn it.

I see the action space to be a discrete list of actions involving the internal circuitry. I believe that the state space is a tuple (performance, temperature). However, I get confused about assigning rewards. I started by giving a discrete reward value for each action and multiplying it with the distance from the critical temperature to compute a reward that the agent can use for the next iteration. But I get confused about how the next state is to be computed by the simulator.

Is the new state computed from the reward earned from the current action? Is it computed by knowing some state transition matrix? Is it computed from the action taken (if yes how does the state-action relationship look like for the simulator)? Is it even right to assign a reward for a given action setting, or should it be for a state setting? The inner assumption I made is that a given action gives a specific performance level, which is a state, but there seem to be some holes in the theory.

Any ideas on how to build the simulator? Any ideas for the above questions?

Topic simulation reinforcement-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.