Action-value estimation of deterministic policies with Monte Carlo method

In Monte Carlo-based action value estimation problem for a deterministic policy (estimation of $q_{\pi}(s,a)$),the estimation problem seems not to be well-defined because $q_{\pi}(s,a)$ by definition means the value of an arbitrary action $a$ at a given state $s$ when initial action $a$ is applied at that state and then following actions from policy $\pi$ at the next states. But, in a real application under a given deterministic policy $\pi$, how can you choose the initial action $a$ arbitrarily at state $s$ because it is already fixed by the policy $\pi$: $a=\pi(s)$?

Topic monte-carlo reinforcement-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.