It's better to start with understanding what the state-value and action-value functions are, and then move on to advantage. The explanation below is based on Reinforcement Learning by Sutton and Barto.
As you make more an more steps in the environment, you collect more and more rewards and you can denote the discounted sum of future rewards with:
$$
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdot\cdot\cdot \gamma^{T-1} R_T
$$
The $R_i$ are the random variables representing the future reward and the $\gamma$ is the discount factor. The term $G_t$ is called the return.
The state-value function is the expected return given that you start from a specific state and you follow the policy $\pi$ afterwards.
$$
V_\pi(s_t) = \mathbb{E}_\pi [G_t|S_t = s_t]
$$
The action-value function is the expected return given that you start from a specific state and take a specific action, and follow the policy $\pi$ afterwards.
$$
Q_\pi(s_t, a_t) = \mathbb{E}_\pi [G_t|S_t = s_t, A_t = a_t]
$$
In simple words, if you have a policy (way of acting), then the state-value function, $V_\pi$, will tell you from any state what is reward that you expect to get using that policy. The action-value function, $Q_\pi$, will tell you something very similar, but instead of following the policy at the next timestep, you can take an action that your policy wouldn't choose. The reason you would want to do that is because your policy is usually not the optimal policy, so you would want to understand how your expected return changes if you take small deviations from your policy. However, after the first step even in the action-value function, you'll follow your policy.
The difference between the state-value function and the action-value function is the advantage:
$$
A_\pi(s_t, a_t) = Q_\pi(s_t, a_t) - V_\pi(s_t)
$$
If you understand the definitions above, then it's easy to see what the advantage function represent. Let's keep the state constant for now, so $s_t$ is constant and $V_\pi(s_t)$ is what you could expect to get by following the policy. The first term allows small deviations from the policy. Let's consider three scenarios:
- The small deviation is an action that is part of your policy (i.e. you don't deviate), then your action-value and state-value functions are the same, and your advantage is zero.
- If you take an action, that is not part of your policy and you end up with smaller returns, then the action-value function will reflect that and it will be smaller than the state-value function, so your advantage is negative.
- If you take an action, that is not part of your policy and you end up with larger returns, then the action-value function will be larger than the state-value function, so your advantage is positive.
The reason the advantage function is useful, is because it segments your actions, those with positive advantages can be used to improve your policy.