Q-learning episode and relation to convergence in MY scenario?

I used Q-learning for routing. I have used the Bellman equation. I have certain other technical aspects in the code that add some novelty. But I have mixed doubts regarding episode and corresponding convergence in my case.

I am unable to understand what would be an episode. E.g. a service comes, I assign a route to it and do some other stuff. I want the service acceptance to be more in the 'long' run (as more services come, some depart too) which I guess would be better with Q-learning routing rather than some shortest path type of policy. So, should my episode consist of a set of services? I mean is this some kind of training? But I think RL doesn't need training examples..

Stemming from the same doubt, even if I have one service arrival as 1 episode, how can I know if my q-learning is working and it is the cause for higher acceptance in the long run? Specifically, is there any 'average' reward kind of system there for bellman equation type learning? I can use that maybe...Currently, I do give a reward of +1 or -1 if service is accepted or rejected but I am asking how to confirm the Q working.

Topic reward q-learning reinforcement-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.