Show Tag: reward

Select Other Tags

The reward prediction error theory of dopamine function says that the difference between expected and actual reward is encoded in dopamine neurons.

To calculate reward prediction error, dopamine neurons need to receive both inputs coding for experienced and expected reward.

Rucci et al. present an algorithm which performs auditory localization and combines auditory and visual localization in a common SC map. The mapping between the representations is learned using value-dependent learning.

Rucci et al. model learning of audio-visual map alignment in the barn owl SC. In their model, projections from the retina to the SC are fixed (and visual RFs are therefore static) and connections from ICx are adapted through value-dependent learning.

Chen et al. presented a system which uses a SOM to cluster states. After learning, the SOM units are extended with a histogram keeping the number of times the unit was BMU and the input belonged to each of a number of known states $$C={c_1,c_2,\dots,c_n}$$.

The system is used in robot soccer. Each class is connected to an action. Actions are chosen by finding the BMU in the net and selecting the action connected to its most likely class.

In an unsupervised, online phase, these histograms are updated in a reinforcement-learning fashion: whenever the action selected lead to success, the bin in the BMU's histogram which was the most likely class is increased. It is decreased otherwise.

Neural responses in parietal cortex have been suggested to reflect expected reward.

Sensorimotor processing may be interpreted as decision making. Therefore, it makes sense to look for representation of expected reward in neural activities.

Anderson suggests that it would make sense if we attended to whatever to attend to promises the greatest reward.

Saliency of a stimulus might say something about its likelihood of offering reward if attended to.

The probability of reward of attending a stimuli is influenced by two factors:

  • the probability of selecting the right thing,
    Thus, highly distinctive things have great bottom-up saliency.
  • the probability of reward given that the right thing is selected,
    • Features that are associated with high reward salient
    • Goals can affect which features promise reward in a situation.

Visual feature combinations become more salient if they are learned to be associated with reward.

Targets which are selected in one trial tend to be more salient in subsequent trials—they are selected faster and rejected slower.

The extent of this effect is modulated by whether or not the selection was rewarded.

It is possible that learning of saccade target selection is influenced by reward.

The question is whether this happens on the saliency- or selection side.

Anderson argues that it is not the selection process that is influenced by reward but saliency evaluation (ie. attentional priority of a stimulus).

Q-Learning learns the function $\mathcal{Q}$ which maps a state $s$ and an action $a$ to the reward $r$ which is the long-term discounted reward expected for taking action $a$ in state $s$.

`Long-term discounted' means that it is the expected value of $$\sum^I_{i=0} \gamma^{n_i} r_i,$$ where $r_i$ and $n_i$ are rewards and steps to states in which the rewards are received when always taking the most promising action in each step, and $\gamma\leq 1$ is the discount factor.

Q-Learning assumes a world in which one state $s$ can be reached from another stochastically by taking an action $a$. In that world, taking certain actions in certain states stochastically incurs a reward $r$.

Q-learning starts with a random function $\mathcal{Q}$ and repeatedly takes actions and then updates $\mathcal{Q}$ with the observed reward. Actions are taken stochastically. The preference given to actions promising a high reward (according to the current state of $\mathcal{Q}$) is equivalent to the preference of exploitation over exploration. Another parameter of Q-learning is the learning rate which determines how strongly each observed reward changes the $\mathcal{Q}$ function in the next step.

The function $\mathcal{Q}$ induces a strategy $V$ which always takes the action $a$ with the highest expected reward.