Show Tag: reinforcement-learning

Select Other Tags

Behrens et al. modeled learning of reward probabilities using a the model of a Bayesian learner.

Behrens et al. found that humans take into account the volatility of reward probabilities in a reinforcement learning task.

The way they took the volatility into account was qualitatively modelled by a Bayesian learner.

Reward mediated learning has been demonstrated in adaptation of orienting behavior.

Possible neurological correlates of reward-mediated learning have been found.

Reward-mediated is said to be biologically plausible.

Rucci et al. present an algorithm which performs auditory localization and combines auditory and visual localization in a common SC map. The mapping between the representations is learned using value-dependent learning.

Soltani and Wang propose a learning algorithm in which neurons predict rewards for actions based on individual cues. The winning neuron stochastically gets reward depending on the action taken.

One of the benefits of Soltani and Wang's model is that it does not require their neurons to perform complex computations. By simply counting active synapses, they calculate log probabilities of reward. The learning rule is what makes sure the correct number of neurons are active given the input.

Rucci et al. explain audio-visual map registration and learning of orienting responses to audio-visual stimuli by what they call value-dependent learning: After each motor response, a modulatory system evaluated whether that response was good, bringing the target into the center of the visual field of the system, or bad. The learning rule used by the system was such that it strengthened connections between neurons from the different neural subpopulations of the network if they were highly correlated whenever the modulatory response was strong, and weakened otherwise.

Chen et al. presented a system which uses a SOM to cluster states. After learning, the SOM units are extended with a histogram keeping the number of times the unit was BMU and the input belonged to each of a number of known states $$C={c_1,c_2,\dots,c_n}$$.

The system is used in robot soccer. Each class is connected to an action. Actions are chosen by finding the BMU in the net and selecting the action connected to its most likely class.

In an unsupervised, online phase, these histograms are updated in a reinforcement-learning fashion: whenever the action selected lead to success, the bin in the BMU's histogram which was the most likely class is increased. It is decreased otherwise.

Classical models assume that learning in cortical regions is well described in an unsupervised learning framework while learning in the basal ganglia can be modeled by reinforcement learning.

Unsupervised learning models have been extended with aspects of reinforcement learning.

The algorithm presented by Weber and Triesch borrows from SARSA.

SOMs can be used for preprocessing in reinforcement learning, simplifying their high-dimensional input via their winner-take-all characteristics.

However, since standard SOMs do not get any goal-dependent input, they focus on globally strongest features (statistically most predictive latent variables) and under-emphasize features which would be relevant for the task.

The model due to Weber and Triesch combines SOM- or K-Means-like learning of features with prediction error feedback as in reinforcement learning. The model is thus able to learn relevant and disregard irrelevant features.

Saeb et al. extend their model by a short-term memory which encodes the last action. This action memory is used to make up for noise and missing information.

Weisswange et al. model learning of multisensory integration using reward-mediated / reward-dependent learning in an ANN, a form of reinforcement learning.

They model a situation similar to the experiments due to Neil et al. and K├Ârding et al. in which a learner is presented with visual, auditory, or audio-visual stimuli.

In each trial, the learner is given reward depending on the accuracy of its response.

In an experiment where stimuli could be caused by the same or different sources, Weisswange found that their model behaves similar to both model averaging or model selection, although slightly more similar to the former.

Q-Learning learns the function $\mathcal{Q}$ which maps a state $s$ and an action $a$ to the reward $r$ which is the long-term discounted reward expected for taking action $a$ in state $s$.

`Long-term discounted' means that it is the expected value of $$\sum^I_{i=0} \gamma^{n_i} r_i,$$ where $r_i$ and $n_i$ are rewards and steps to states in which the rewards are received when always taking the most promising action in each step, and $\gamma\leq 1$ is the discount factor.

Q-Learning assumes a world in which one state $s$ can be reached from another stochastically by taking an action $a$. In that world, taking certain actions in certain states stochastically incurs a reward $r$.

Q-learning starts with a random function $\mathcal{Q}$ and repeatedly takes actions and then updates $\mathcal{Q}$ with the observed reward. Actions are taken stochastically. The preference given to actions promising a high reward (according to the current state of $\mathcal{Q}$) is equivalent to the preference of exploitation over exploration. Another parameter of Q-learning is the learning rate which determines how strongly each observed reward changes the $\mathcal{Q}$ function in the next step.

Q-learning is guaranteed to converge to an optimal policy $V^*$ (under certain conditions).

The function $\mathcal{Q}$ induces a strategy $V$ which always takes the action $a$ with the highest expected reward.

The motmap algorithm uses reinforcement learning to organize behavior in a two-dimensional map.

Rucci et al. model multi-sensory integration in the barn owl OT using leaky integrator firing-rate neurons and reinforcement learning.

Rucci et al. suggest that high saliency in the center of the visual field can act as a reward signal for pre-saccadic neural activation.