Q-Learning In Machine Learning, Vol. 8, No. 3-4. (1992), pp. 279-292, doi:10.1023/a%3a1022676722315 by ChristopherJ Watkins, Peter Dayan

@article{watkins-and-dayan-1992, abstract = {\(\mathcal{Q}\) -learning (Watkins, 1989) is a simple way for agents to learn how to act optimally in controlled Markovian domains. It amounts to an incremental method for dynamic programming which imposes limited computational demands. It works by successively improving its evaluations of the quality of particular actions at particular states. This paper presents and proves in detail a convergence theorem for \(\mathcal{Q}\) -learning based on that outlined in Watkins (1989). We show that \(\mathcal{Q}\) -learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discretely. We also sketch extensions to the cases of non-discounted, but absorbing, Markov environments, and where many \(\mathcal{Q}\) values can be changed each iteration, rather than just one.}, author = {Watkins, Christopher J. C. H. and Dayan, Peter}, booktitle = {Machine Learning}, citeulike-article-id = {13223300}, citeulike-linkout-0 = {http://dx.doi.org/10.1023/a\%253a1022676722315}, citeulike-linkout-1 = {http://link.springer.com/article/10.1023/A\%3A1022676722315}, doi = {10.1023/a\%253a1022676722315}, keywords = {learning, q-learning, reinforcement-learning}, number = {3-4}, pages = {279--292}, posted-at = {2014-06-12 13:56:57}, priority = {2}, publisher = {Kluwer Academic Publishers-Plenum Publishers}, title = {{Q-Learning}}, url = {http://dx.doi.org/10.1023/a\%253a1022676722315}, volume = {8}, year = {1992} }

See the CiteULike entry for more info, PDF links, BibTex etc.

Q-Learning learns the function $\mathcal{Q}$ which maps a state $s$ and an action $a$ to the reward $r$ which is the long-term discounted reward expected for taking action $a$ in state $s$.

`Long-term discounted' means that it is the expected value of $$\sum^I_{i=0} \gamma^{n_i} r_i,$$ where $r_i$ and $n_i$ are rewards and steps to states in which the rewards are received when always taking the most promising action in each step, and $\gamma\leq 1$ is the discount factor.⇒

Q-Learning assumes a world in which one state $s$ can be reached from another stochastically by taking an action $a$. In that world, taking certain actions in certain states stochastically incurs a reward $r$.⇒

Q-learning starts with a random function $\mathcal{Q}$ and repeatedly takes actions and then updates $\mathcal{Q}$ with the observed reward. Actions are taken stochastically. The preference given to actions promising a high reward (according to the current state of $\mathcal{Q}$) is equivalent to the preference of exploitation over exploration. Another parameter of Q-learning is the learning rate which determines how strongly each observed reward changes the $\mathcal{Q}$ function in the next step.⇒

Q-learning is guaranteed to converge to an optimal policy $V^*$ (under certain conditions).⇒

The function $\mathcal{Q}$ induces a strategy $V$ which always takes the action $a$ with the highest expected reward.⇒