Show Thoughts

Q-learning is guaranteed to converge to an optimal policy $V^*$ (under certain conditions).