Select Other Tags

Neural responses in parietal cortex have been suggested to reflect expected reward.⇒

Q-Learning learns the function $\mathcal{Q}$ which maps a state $s$ and an action $a$ to the reward $r$ which is the long-term discounted reward expected for taking action $a$ in state $s$.

`Long-term discounted' means that it is the expected value of $$\sum^I_{i=0} \gamma^{n_i} r_i,$$ where $r_i$ and $n_i$ are rewards and steps to states in which the rewards are received when always taking the most promising action in each step, and $\gamma\leq 1$ is the discount factor.⇒

Q-Learning assumes a world in which one state $s$ can be reached from another stochastically by taking an action $a$. In that world, taking certain actions in certain states stochastically incurs a reward $r$.⇒

Q-learning starts with a random function $\mathcal{Q}$ and repeatedly takes actions and then updates $\mathcal{Q}$ with the observed reward. Actions are taken stochastically. The preference given to actions promising a high reward (according to the current state of $\mathcal{Q}$) is equivalent to the preference of exploitation over exploration. Another parameter of Q-learning is the learning rate which determines how strongly each observed reward changes the $\mathcal{Q}$ function in the next step.⇒

Q-learning is guaranteed to converge to an optimal policy $V^*$ (under certain conditions).⇒

The function $\mathcal{Q}$ induces a strategy $V$ which always takes the action $a$ with the highest expected reward.⇒