How to optimize for time correlated hidden function - the magical candy machine

Let's assume that we have this magical candy machine which takes candies as input and delivers again candies as output.

  1. For any given time t, it picks a random function which is strictly increasing up to a point such as f(2) = 6 here, and then it strictly decreases. It likes to be challenging but if you become greedy it punishes you like most of the stuff in life.

    f(1) = 5

    f(2) = 6

    f(3) = 4

    f(4) = 2

    f(100) = 0

  2. The tricky point is that this function is changing all the time, but still highly correlated with time. So f() will be similar between t(1) and t(2), but very different between t(1) and t(100).

I want to write a program to maximize my candies using this magical candy machine. I know the fundamentals of ML but I'm not sure which approach would fit best here. Any ideas?

Note: You can only play only once every minute.

Topic game regression optimization

Category Data Science


It sounds like a use case for stochastic optimization algorithms: since your reward function is not observed but highly correlated in time, I would expect stochastic optimization algorithms to converge fast to the maximum of the reward function and follow it closely as it slowly changes.

Because of this, I expect that you would obtain pretty good results with simple stochastic optimization algorithms without going to complex models such as Reinforcement Learning. Also, it's always good to start easy.

Note: whether stochastic optimization algorithms can converge to the maximum of the reward function depends on how fast the reward function changes.


Edit, after request for example:

I would start with Random Search. Basically the idea is to

  1. pick a position at random
  2. pick one new position at random
  3. select the position with highest reward
  4. repeat steps 2-3

The trick is in generating good candidate points (step 2) in order to speed up convergence to the maximum. I have a few basic ideas that can help:

  1. pick a random position by drawing from a random distribution (say gaussian) with mean equal to the current position and standard deviation somehow inversely proportional to the current reward $F_t$ (something like $\alpha e^{(-\beta F_t)}$) so that if we are close to the maximum of the reward function we pick a new candidate close to the current position, and if we have a low reward we can explore farther away from the current position. These two strategies are usually called "exploitation" and "exploration"

  2. the idea above has the disadvantage of being "memory-less", which makes it so that it doesn't take into account the gradient. In facts, drawing from a gaussian distribution can either push you closer or away from the maximum with equal chance, being the gaussian symmetric around the mean. A possible workaround would be to draw the new random candidate position from a skew normal distribution, with skewness parameter somehow dependent on the sign of the gradient of the reward function at the previous step.

I think that in your case you don't need much fancier algorithms since - if I understood correctly - your reward function is not multimodal thus we do not risk getting stuck in a local maximum.


Thought #1

From the short description you give, and assuming you have some data (or can synthesise it), I would have a go at training a Hidden Markov Model. Have a look here for an awesome visual primer. Intuitively, HMMs do what you describe.

There are some observables:

  1. how much candy you put in the machine,
  2. how much comes out, and
  3. which timestep we are in

There are certain states (here just discrete states):

  1. amount of dispensed candy goes up (strictly increasing)
  2. amount of dispensed candy goes down (strictly decreasing)

Lastly, there is a transition: between these state, given the inputs. These are the observables, as far as you are concerned.

With HMMs, we are however assuming that there is more than meets the eye; there are some hiddens states (latent states), which are unobservable to use. The model keeps track of these and uses them in conjunction with the observables list above to decide how to act - it has the full mapping function. Below is a schematic diagram of such a model:

enter image description here

This just shows several states, along with the probabilities of transitioning to the other states, or stating in the same state. Check out the source for a detailed description of the diagram. One can already imagine sketching this drawing for your problem.

For more on the theory of HMMs, I recommend this text based walkthrough by Jeff Bilmes, and this YouTube series by mathematicalmonk (chapter 14 of the playlist). If you go this route, you might consider trying out the Pomegranate library (in Python).

The final logic of deciding what to do, what action to take, is something you could either hard-code or use a model to take it for you. For example, a hard-coded approach would be something like:

if current_state == 'normal':
    take_lots_of_candy()
elif current_state == 'unsure':
    take_some_candy()
elif current_State == 'dangerous':
    pause_taking_candy()
else:
    catch_other_states()

For a model-based approach, you could expand further on the ideas outlines below.


Thought #2

Of course, you are in essence just mapping some input to some outputs. Almost any machine learning algorithm will have a good go at it and - depending on the complexity of the internal workings of your magical machine - will be better or worse. Algorithms such as boosting (e.g. via gradient descent) generally require you to put some kind of information into the model e.g. by specifying a regression equation, mapping inputs to outputs. A good implementation is the mboost package in R. There is a great tutorial to help decide what might be important.


Thought #3

Coming back to the point of just mapping inputs to outputs: a simple Multi-Layer Perceptron (MLP) from Scikit-Learn - a.k.a. feed-forward neural network - should theoretically offer the flexibility to approximate any function you might throw at it (with a few assumptions). You could also implement this using one of the various deep learning frameworks, should your MLP just not cut the mustard.


Final thoughts

I would lastly like to highlight that your final model will only likely be as good as the data you feed it, and that while it may work well on your synthesised training data, there are no guarantees of generalisation, without putting more thought into the specific problem and specific distributions from which your data are sampled.

If you go the way of Markov models, you could perhaps then use their state predictions as inputs to further models; thus creating an ensemble.

If you want to get really fancy and more towards current state-of-the-art models in sequence analysis, you could venture into the world of Recurrent Neural Networks, e.g. utilising LSTM cells, which maintain an internal state representation, including new information and selectively discarding old information. This might be a good way to take all your points into consideration in one large model. The catch here would be that you generally require vast amounts of data to train such a model.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.