# Show Tag: multisensory-integration

Select Other Tags

Cuppini et al. use mutually inhibitive, modality-specific inhibition (inhibitory inter-neurons that get input from one modality and inhibit inhibitory interneurons receiving input from different modalities) to implement a winner-take-all mechanism between modalities; this leads to a visual (or auditory) capture effect without functional multi-sensory integration.

Their network model builds upon their earlier single-neuron model.

Not sure about the biological motivation of this. Also: it would be interesting to know if functional integration still occurs.

Optimal multi-sensory integration is learned (for many tasks).

In many audio-visual localization tasks, humans integrate information optimally.

In some audio-visual time discrimination tasks, humans do not integrate information optimally.

Although no (actually: hardly any) projections from multisensory or non-visual areas to V1 have been found, auditory input seems to influence neural activity in V1.

There are significant projections from auditory cortex as well as from polysensory areas in the temporal lobe to parts of V1 where receptive fields are peripheral.

There doesn't seem to be any region in the brain that is truly and only uni-sensory.

If sensory maps of uni-modal space are brought into register, then cues from different modalities can access shared maps of motor space.

The theoretical accounts of multi-sensory integration due to Beck et al. and Ma et al. do not learn and leave little room for learning.

Thus, they fail to explain an important aspect of multi-sensory integration in humans.

Weisswange et al.'s model does not reproduce population coding.

Bauer and Wermter show how probabilistic population codes and near-optimal integration can develop.

Person tracking can combine cues from single modalities (like motion and color cues), or from different modalities (like auditory and visual cues).

Kalman filters and particle filters have been used in uni- and multi-sensory person tracking.

One theory of the function of consciousness is that it is needed to integrate information from different modalities and processing centers in the brain and coordinate their activity.

There are quite a number of different definitions of multi-sensory integration.

According to Palmer and Ramsey,

"Multisensory integration refers to the process by which information from different sensory modalities (e.g., vision, audition, touch) is combined to yield a rich, coherent representation of an object or event in the environment."

Stimuli in one modality can guide attention in another.

Humans can learn to use stimuli in one modality to guide attention in another.

Palmer and Ramsey show that lack of awareness of a visual lip stream does not inhibit learning of its relevance for a visual localization task: the subliminal lip stream influences visual attention and affects the subjects' performance.

They also showed that similar subliminal lip streams did not affect the occurrence of the Mc Gurk effect.

Together, this suggests that awareness of a visual stimulus is not always needed to use it for guiding visual awareness, but sometimes it is needed for multisensory integration to occur (following Palmer and Ramsey's definition).

Cats, if raised in an environment in which the spatio-temporal relationship of audio-visual stimuli is artificially different from natural conditions, develop spatio-temporal integration of audio-visual stimuli accordingly. Their SC neurons develop preference to audio-visual stimuli with the kind of spatio-temporal relationship encountered in the environment in which they were raised.

Reactions to cross-sensory stimuli can be faster than the fastest reaction to any one of the constituent uni-sensory stimuli (as would be predicted by the race model.).

Frassinetti et al. showed that humans detect near-threshold visual stimuli with greater reliability if these stimuli are connected with spatially congruent auditory stimuli (and vice versa).

Two stimuli in different modalities are perceived as one multi-sensory stimulus if the position in space and point time at which they are presented are not too far apart.

Laurenti et al. found in a audio-visual color identification task that redundant, congruent, semantic auditory information (the utterance of a color word) can decrease latency in response to a stimulus (color of a circle displayed to the subject). Incongruent semantic visual or auditory information (written or uttered color word) can increase response latency. However, congruent semantic visual information (written color word) does not decrease response latency.

The enhancements in response latencies in Laurenti et al.'s audio-visual color discrimination experiments were greater (response latencies were shorter) than predicted by the race model.

Integrating information from multiple stimuli can have advantages:

• shorter reaction times
• lower thresholds of stimulus detection
• detection,
• identification,
• precision of orienting behavior

Improved performance on the behavioral side due to cross-sensory integration is connected to effects of effects on the neurophysiological side.

Rucci et al. present a robotic system based on their neural model of audiovisual localization.

Rucci et al. present an algorithm which performs auditory localization and combines auditory and visual localization in a common SC map. The mapping between the representations is learned using value-dependent learning.

SOMs and SOM-like algorithms have been used to model natural multi-sensory integration in the SC.

Anastasio and Patton model the deep SC using SOM learning.

Anastasio and Patton present a model of multi-sensory integration in the superior colliculus which takes into account modulation by uni-sensory projections from cortical areas.

In the model due to Anastasio and Patton, deep SC neurons combine cortical input multiplicatively with primary input.

Anastasio and Patton's model is trained in two steps:

First, connections from primary input to deep SC neurons are adapted in a SOM-like fashion.

Then, connections from uni-sensory, parietal inputs are trained, following an anti-Hebbian regime.

The latter phase ensures the principles of modality-matching and cross-modality.

Modulatory input from uni-sensory, parietal regions to SC follows the principles of modality-matching and cross-modality:

A deep SC neuron (generally) only receives modulatory input related to some modality if it also receives primary input from that modality.

Modulatory input related to some modality only affects responses to primary input from the other modalities.

SOM learning produces clusters of neurons with similar modality responsiveness in the SC model due to Anastasio and Patton.

The model due to Anastasio and Patton reproduces multi-sensory enhancement.

Deactivating modulatory, cortical input also deactivates multi-sensory enhancement.

Localization of audiovisual targets is usually determined more by the location of the visual sub-target than on that of the auditory sub-target.

Especially in situations where visual stimuli are seen clearly and thus localized very easily, this can lead to the so-called ventriloquism effect (aka visual capture') in which a sound source seems to be localized at the location of the visual target although it is in fact a few degrees away from it.

Cross-modal integration used to be thought of as a feed-forward process. Nowadays, we acknowledge lateral and even cyclic feed-back streams of information.

Synchronized oscillations have been hypothesized to be a potential mechanism for crossmodal integration.

In Anastasio and Patton's SC model, the spatial organization of the SOM is not used to represent the spatial organization of the outside world, but to distribute different sensitivities to the input modalities in different neurons.

It's a bit strange that Anastasio and Patton's and Martin et al.'s SC models do not use the spatial organization of the SOM to represent the spatial organization of the outside world, but to distribute different sensitivities to the input modalities in different neurons.

KNN (or sparse coding) seems to be more appropriate for that.

Denéve et al. use basis function networks with multidimensional attractors for

• function approximation
• cue integration.

They reduce both to maximum likelihood estimation and show that their network performs close to a maximum likelihood estimator.

Ursino et al. divide models of multisensory integration into three categories:

1. Bayesian models (optimal integration etc.),
2. neuron and network models,
3. models on the semantic level (symbolic models).

Multisensory integration in cortex has been studied less than in the midbrain, but there is work on that.

According to Ursino et al., there are two theories about the benefit of multisensory convergence at lower levels of cortical processing: One is that convergence helps resolve ambiguity and improves reliability. The other theory is that it helps predict perceptions.

I believe that one use of multisensory convergence, in early cortex and in sub-cortical regions, is useful because often responses do not depend on the modality but on the content. The SC, for example initiates orienting actions towards salient stimuli. It does not matter whether these are salient visual or auditory stimuli—it's always a good idea to orient towards them.

By combining information from different senses, one can sometimes make inferences that are not possible with information from one modality alone.

Some modalities can yield low-latency, unreliable information and others high-latency, reliable information.

Combining both can produce fast information which improves over time.

Ghahramani et al. infer the cost function presumably guiding natural multisensory integration from behavioral data.

Ghahramani et al. model multisensory integration as a process minimizing uncertainty.

MLE has been a successful model in many sensory cue integration tasks.

Usually, rate perception is influenced more strongly by auditory information than by visual information.

By modulating the reliability of auditory information, visual information can be given greater weight in rate perception.

Roach et al. present a Bayesian model of multisensory integration which takes into account the fact that information from different modalities is only integrated up to a certain amount of incongruence. That model incorporates a Gaussian prior on distances between actual components in cross-sensory stimuli.

With appropriate parameterization, the model proposed by Roach et al. should produce results much like model selection. It is mathematically a little simpler because no explicit decision needs to be made. However, the motivation of a Gaussian function for modeling the actual distance between components in cross-sensory stimuli is a bit unclear: Either the two components belong to a common source or they do not. Why should independent auditory and visual stimuli have a tendency to be close together?

Anastasio drop the strong probabilistic interpretation of SC neurons' firing patterns in their learning model.

The first SC model presented by Rowland et al. is a single-neuron model in which sensory and cortical input is simply summed and passed through a sigmoid squashing function.

The sigmoid squashing function used in Rowland et al.'s first model leads to inverse effectiveness: The sum of weak inputs generally falls into the supra-linear part of the sigmoid and thus produces a superadditive response.

Rucci et al. model learning of audio-visual map alignment in the barn owl SC. In their model, projections from the retina to the SC are fixed (and visual RFs are therefore static) and connections from ICx are adapted through value-dependent learning.

Enhancement, depression, multisensory interaction on the neural level are mathematically defined by Wallace and Stein as

$$100\times\frac{r_{mm}-\max(r_a,r_v)}{\max(r_a,r_v)},$$ where $r_a$, and $r_v$ are the mean responses to only an auditory or a visual stimulus and $r_{mm}$ is the response to the combination of the two.

Irrelevant auditory stimuli can dramatically improve or degrade orientation performance in visual orientation tasks:

In Wilkinson et al.'s experiments, cats' performance in orienting towards near-threshold, medial visual stimuli was much improved by irrelevant auditory stimuli close to the visual stimuli and drastically degraded by irrelevant auditory stimuli far from the visual stimuli.

If visual stimuli were further to the edge of the visual field, then lateral auditory stimuli improved their detection rate even if they were disparate.

Chemical deactivation of AES degrades both the improvement and the degradation of performance in orienting towards visual due to auditory stimuli.

Visuo-vestibular cells in MST perform multisensory integration in the sense that their response to multisensory stimuli is different from their response to either of the uni-sensory cues.

Visuo-vestibular cells tend to be selective for visual and vestibular self-motion cues which indicate motion in the same direction.

The responses of some visuo-vestibular cells were enhanced, that of others was depressed by combined visuo-vestibular cues.

Visual information seems to override vestibular information in estimating heading direction.

The Kalman filter is a good method in many (robotic) multisensory integration problems in dynamic domains.

At the most general level, multisensory integration (or multisensor data fusion) in application contexts is best described in terms of Bayesian theory, its specializations, and approximations to it.

Multisensory integration is used in social signal processing.

Neurons in the deep SC which show an enhancement in response to multisensory stimuli peak earlier.

The response profiles have superadditive, additive, and subadditive phases: Even for cross-sensory stimuli whose unisensory components are strong enough to elicit only an additive enhancement of the cumulated response, the response is superadditive over parts of the time course.

The unity assumption is influenced by exogenous and endogenous factors:

• degree of discrepancy between cross-modal percepts,
• active involvement (in proprioception),
• awareness of discrepancy,
• expectation,
• compellingness.

The probability that two stimuli in different modalities are perceived as one multisensory stimulus generally decreases with increasing temporal or spatial disparity between them.

The probability that two stimuli in different modalities are perceived as one multisensory stimulus generally increases with increasing semantic congruency.

In a sensorimotor synchronization task, Aschersleben and Bertelson found that an auditory distractor biased the temporal perception of a visual target stimulus more strongly than the other way around.

Yan et al. explicitly do not integrate auditory and visual localization. Given multiple visual and an auditory localization, they associate the auditory localization with that visual localization which is closest, using the visual localization as the localization of the audio-visual object.

In determining the position of the audio-visual object, Yan et al. handle the possibility that the actual source of the stimulus has only been heard, not seen. They decide whether that is the case by estimating the probability that the auditory localization belongs to any of the detected visual targets and comparing to the baseline probability that the auditory target has not been detected, visually.

Voges et al. use the strength of the visual detection signal (the peak value of the column-wise sum of the difference image) as a proxy for the confidence of visual detection.

They use visual localization whenever this signal strength is above a certain threshold, and auditory localization if it is below that threshold.

In Vogel et al.'s system, auditory localization serves as a backup in case visual localization fails, and for disambiguation in case more than one visual target is detected.

Voges et al. do not evaluate the accuracy of audio-visual localization.

Kushal et al. do not evaluate the accuracy of audio-visual localization quantitatively. They do show a graph for visual-only, audio-visual, and audio-visual and temporal localization during one test run. That graph seems to indicate that multisensory and temporal integration prevent misdetections—they do not seem to improve localization much.

Kushal et al. use an EM algorithm to integrate audio-visual information for active speaker localization statically and over time.

Studies on audio-visual active speaker localization usually do not report on in-depth evaluations of audio-visual localization accuracy. The reason is, probably, that auditory information is only used as a backup for cases when visual localization fails or for disambiguation in case visual information is not sufficient to tell which of the visual targets is the active speaker.

When visual detection succeeds, it is usually precise enough.

Therefore, active speaker localization is probably a misnomer. It should be called active speaker identification.

Mühling et al. present an audio-visual video concept detection system. Their system extracts visual and auditory bags of words from video data. Visual words are based on SIFT features, auditory words are formed by applying the K-Means algorithm to a Mel-Frequency Cepstral Coefficients analysis of the auditory data. Support vector machines are used for classification.

Integrating information from different modalities can improve

• detection,
• identification,
• precision of orienting behavior,
• reaction time.

Very few perceptions are truly affected only by sensation through one sensory modality.

First systematic studies of neural multisensory integration started in the 1970ies.

Cats, being an altricial species, are born with little to no capability of multi-sensory integration and develop first multi-sensory SC neurons, then neurons exhibiting multi-sensory integration on the neural level only after birth.

Multisensory experience is necessary to develop normal multisensory integration.

Multisensory integration in the SC is similar in anesthetized and alert animals (cats).

Multisensory input can provide redundant information on the same thing.

Redundancy reduces uncertainty and increases reliability.

The redundancy provided my multisensory input can facilitate or even enable learning.

Integrating information is a good thing.

A simple MLP would probably be able to learn optimal multi-sensory integration via backprop

Using a space-coded approach instead of an MLP for learning multi-sensory integration has benefits:

• learning is unsupervised
• can work with missing data

In Anastasio et al.'s model of multi-sensory integration in the SC, an SC neuron is connected to one neuron from each modality whose spiking behavior is a (Poisson) probabilistic function of whether there is a target in that modality or not.

Their single SC neuron then computes the posterior probability of there being a target given its inputs (evidence) and the prior.

Under the assumption that neural noise is independent between neurons, Anastasio et al.'s approach can be extended by making each input neuron its own modality.

Bayesian integration becomes more complex, however, because receptive fields are not sharp. The formulae still hold, but the neurons cannot simply use Poisson statistics to integrate.

In Anastasio et al. use their model to explain enhancement and the principle of inverse effectiveness.

Multisensory integration is a way to reduce uncertainty. This is both a normative argument and it states the evolutionary advantage of using multisensory integration.

Fetsch et al. define cue combination as the combination of multiple sensory cues' arising from the same event or object.

There are two strands in multi-sensory research: mathematical modeling and modeling of neurophysiology.

Yay! I'm bridging that gulf as well!

According to Ma et al,'s work, computations in neurons doing multi-sensory integration should be additive or sub-additive. This is at odds with observed neurophysiology.

My model is normative, performs optimally and it shows super-additivity (to be shown).

Stanford et al. studied single-neuron responses to cross-modal stimuli in their receptive fields. In contrast to previous studies, they systematically tried out different combinations of levels of intensity levels in different modalities.

Morgan et al. studied the neural responses to visual and vestibular self-motion cues in the dorsal portion of the medial superior temporal area (MSTd).

They presented congruent and incongruent stimuli at different levels of reliability and found that at any given level of reliability, the neural computation underlying multi-sensory integration could be described well by a linear addition rule.

However, the weights used in combining the uni-sensory responses changed with cue reliability.

Fetsch et al. explain the discrepancy between observed neurophysiology—superadditivity—and the normative solution to single-neuron cue integration proposed by Ma et al. using divisive normalization:

They propose that the network activity is normalized in order to keep neurons' activities within their dynamic range. This would lead to the apparent reliability-dependent weighting of responses found by Morgan et al. and superadditivity as described by Stanford et al.

Fetsch et al. acknowledge the similarity of their model with that of Ohshiro et al.

Fetsch et al. provide some sort of normative motivation to the model due to Ohshiro et al.

Studies of single-neuron responses to multisensory stimuli have usually not explored the full dynamic range of inputs---they often have used near- or subthreshold stimulus intensities and thus usually found superadditive effects.

Studies of single-neuron responses to multisensory stimuli have over-emphasized the prevalence of superadditivity over that of subadditivity.

Weisswange et al. distinguish between two strategies for Bayesian multisensory integration: model averaging and model selection.

The model averaging strategy computes the posterior probability for the position of the signal source, taking into account the possibility that the stimuli had the same source and the possibility that they had two distinct sources.

The model selection strategy computes the most likely of these two possibilities. This has been called causal inference.

Human children often react to multi-sensory stimuli faster than they do to uni-sensory stimuli. However, the latencies they exhibit up to a certain age do not violate the race model as they do in adult humans.

Multisensory integration develops after birth in many ways.

The race model of multi-sensory integration assumes that the reaction to a multi-sensory stimulus is as fast as the fastest reaction any of the individual stimuli.

Weisswange et al. model learning of multisensory integration using reward-mediated / reward-dependent learning in an ANN, a form of reinforcement learning.

They model a situation similar to the experiments due to Neil et al. and Körding et al. in which a learner is presented with visual, auditory, or audio-visual stimuli.

In each trial, the learner is given reward depending on the accuracy of its response.

In an experiment where stimuli could be caused by the same or different sources, Weisswange found that their model behaves similar to both model averaging or model selection, although slightly more similar to the former.

Do the parts of the sensory map in the deeper SC corresponding to peripheral visual space have better representation than in the visual superficial SC because they integrate more information; does auditory or tactile localization play a more important part in multisensory localization there?

Stein offers an operational definition of multisensory integration as

...the process by which stimuli from different senses combine ... to produce a response that differs from those produced by the component stimuli individually.''

Kao et al. did not find visually responsive neurons in the deep layers of the cat SC within the first three postnatal weeks.

Some animals are born with deep-SC neurons responsive to more than one modality.

However, these neurons don't integrate according to Stein's single-neuron definition of multisensory integration. This kind of multisensory integration develops with experience with cross-modal stimuli.

Task-irrelevant visual cues do not affect visual orienting (visual spatial attention). Task-irrelevant auditory cues, however, seem to do so.

Santangelo and Macaluso suggest that whether or not the effects of endogenous attention dominate the ones of bottom-up processing (automatic processing) depends on semantic association, be it linguistic or learned association (like dogs and barking, cows and mooing).

Santangelo and Macaluso state that "the same frontoparietal attention control systems are ... activated in spatial orienting tasks for both the visual and auditory modality..."

Colonius' and Diederich's explanation for uni-sensory neurons in the deep SC has a few weaknesses: First, they model the input spiking activity for both the target and the non-target case as Poisson distributed. This is a problem, because the input spiking activity is really a function of the target distance from the center of the RF. Second, they explicitly model the probability of the visibility of a target to be independent of the probability of its audibility.

When asked to ignore stimuli in the visual modality and attend to the auditory modality, increased activity in the auditory temporal cortex and decreased activity in the visual occipital cortex can be observed (and vice versa).

Semantic multisensory congruence can

• shorten reaction times,
• lower detection thresholds,
• facilitate visual perceptual learning.

Jack and Thurlow found that the degree to which a puppet resembled an actual speaker (whether it had eyes and a nose, whether it had a lower jaw moving with the speech etc.) and whether the lips of an actual speaker moved in synch with heard speech influenced the strength of the ventriloquism effect.

The unity assumption'' is the hypothesized unconscious assumption (or the belief) of an observer that stimuli in different modalities representing a single cross-sensory object.

In one of their experiments, Warren et al. had their subjects localize visual or auditory components of visual-auditory stimuli (videos of people speaking and the corresponding sound). Stimuli were made compelling' by playing video and audio in sync anduncompelling' by introducing a temporal offset.

They found that their subjects performed as under a unity assumptions'' when told they would perceive cross-sensory stimuli, and when the stimuli were compelling' and under a lowunity assumption'' when they were told there could be separate auditory or visual stimuli and/or the stimuli were made uncompelling'.

Vatakis and Spence found support for the concept of a unity assumption' in an experiment in which participants were to judge whether a visual lip stream or an auditory utterance was presented first: Participants found this task easier if the visual and auditory stream did not match in terms of gender of voice or content, suggesting that their unity hypothesis was weak in these cases, causing them not to integrate them.

Kleesiek et al. use a recurrent neural network with parametric bias (RNNPB) to classify objects from the multisensory percepts induced by interacting with them.

Wozny et al. distinguish between three strategies for multisensory integration: model averaging, model selection, and probability matching.

Wozny et al. found in an audio-visual localization experiment that a majority of their participants' performance was best explained by the statistically sub-optimal probability matching strategy.

Weisswange et al.'s results seem at odds with those of Wozny et al. However, Wozny et al. state that different strategies may be used in different settings.

If it is not given that an auditory and a visual stimulus belong together, then integrating them (binding) unconditionally is not a good idea. In that case, causal inference and model selection are better.

The a-priori belief that there is one stimulus (the unity assumption') can then be seen as a prior for one model—the one that assumes a single, cross-modal stimulus.

With increasing distance between stimuli in different modalities, the likelihood of perceiving them as in one location decreases.

With increasing distance between stimuli in different modalities, the likelihood of perceiving them as one cross-modal stimulus decreases.

In other words, the unity assumption depends on the distance between stimuli.

In an audio-visual localization task, Wallace et al. found that their subjects' localization of the auditory stimulus were usually biased towards the visual stimulus whenever the two stimuli were perceived as one and vice-versa.

Details of instructions and quality of stimuli can influence the strength of the spatial ventriloquism effect.

Sato et al. modeled multisensory integration with adaptation purely computationally. In their model, two localizations (one from each modality) were bound or not bound and localized according to a maximum a-posteriory decision rule.

The unity assumption can be interpreted as a prior (if interpreted as an expectation of a forthcoming uni- or cross-sensory stimulus) or a mediator variable in a Bayesian inference model of multisensory integration.

Martin et al. model multisensory integration in the SC using a SOM algorithm.

Input in Martin et al.'s model of multisensory integration in the SC is an $m$-dimensional vector for every data point, where $m$ is the number of modalities. Data points are uni-modal, bi-modal, or tri-modal. Each dimension of the data point codes stochastically for the combination of modalities of the data point. The SOM learns to map different modality combinations to different regions into its two-dimensional grid.

Input in Martin et al.'s model of multisensory integration in the SC replicates enhancement and, through the non-linear transfer function, superadditivity.

Bell et al. found that playing a sound before a visual target stimulus did not increase activity in the neurons they monitored for long enough to lead to (neuron-level) multisensory integration.

Antonelli et al. use Bayesian and Monte Carlo methods to integrate optic flow and proprioceptive cues to estimate distances between a robot and objects in its visual field.

The leaky-integrate-and-fire model due to Rowland and Stein models a single multisensory SC neuron receiving input from a number of sensory, cortical, and sub-cortical sources.

Each of the sources is modeled as a single input to the SC neuron.

Local inhibitory interaction between neurons in multi-sensory trials is modeled by a single time-variant subtractive term which sets in shortly after the actual sensory input, thus not influencing the first phase of the response after stimulus onset.

The model due to Rowland and Stein does not consider the spatial properties of input or output. In reality, the same source of input—retina, LGN, association cortex may convey information about stimulus conditions from different regions in space and neurons at different positions in the SC react to different stimuli.

Rowland and Stein focus on the temporal dynamics of multisensory integration.

Rowland and Stein's goal is only to generate neural responses like those observed in real SC neurons with realistic biological constraints. The model does not give any explanation of neural responses on the functional level.

The network characteristics of the SC are modeled only very roughly by Rowland and Stein's model.

The model due to Rowland and Stein manages to reproduce the nonlinear time course of neural responses to, and enhancement in magnitude and inverse effectiveness in multisensory integration in the SC.

Since the model does not include spatial properties, it does not reproduce the spatial principle (ie. no depression).

Seeing someone say 'ba' and hearing them say 'ga' can make one perceive them as saying 'da'. This is called the McGurk effect'.

Children do not integrate information the same way adults do in some tasks. Specifically, they sometimes do not integrate information optimally, where adults do integrate it optimally.

In an adapted version of Ernst and Banks' visuo-haptic height estimation paradigm, Gori et al. found that childrern under the age of 8 do not integrate visual and haptic information optimally where adults do.

Ernst and Banks show that humans combine visual and haptic information optimally in a height estimation task.

Neural responses in the sc to spatially and temporally coincident cross-sensory stimuli can be much stronger than responses to uni-sensory stimuli.

In fact, they can be much greater than the sum of the responses to either stimulus alone.

Neural responses (in multi-sensory neurons) in the sc to spatially disparate cross-sensory stimuli is usually weaker than responses to uni-sensory stimuli.

Responses in multi-sensory neurons in the SC follow the so-called spatial principle.

Moving eyes, ears, or body changes the receptive field (in external space) in SC neurons wrt. stimuli in the respective modality.

Stanford et al. state that superadditivity seems quite common in cases of multi-sensory enhancement.

Alais and Burr found in an audio-visual localization experiment that the ventriloquism effect can be interpreted by a simple cue weighting model of human multi-sensory integration:

Their subjects weighted visual and auditory cues depending on their reliability. The weights they used were consistent with MLE. In most situations, visual cues are much more reliable for localization than are auditory cues. Therefore, a visual cue is given so much greater weight that it captures the auditory cue.

According to Landy et al., humans often combine cues (intra- or cross-sensory) optimally, consistent with MLE.

Multiplying probabilities is equivalent to adding their logs. Thus, working with log likelihoods, one can circumvent the necessity of neural multiplication when combining probabilities.

Multisensory integration, however, has been viewed as integration of information in exactly that sense, and it is well known that multisensory neurons respond super-additively to stimuli from different modalities.

In Jazayeri and Movshon's model decoding (or output) neurons calculate the logarithm of the input neurons' tuning functions.

This is not biologically plausible because that would give them transfer functions which are non-linear and non-sigmoid (and typically biologically plausible transfer functions said to be sigmoid).

Most of the multi-sensory neurons in the (cat) SC are audio-visual followed by visual-somatosensory, but all other combinations can be found.

One reason for specifically studying multi-sensory integration in the (cat) SC is that there is a well-understood connection between input stimuli and overt behavior.

Stein defines multi-sensory integration on the single-neuron level as

a statistically significant difference between the number of impulses evoked by a cross-modal combination of stimuli and the number evoked by the most effective of these stimuli individually.''

What we find in the SC we can use as a guide when studying other multi-sensory brain regions.

Multisensory integration is present in neonates to some degree depending on species (more in precocial than in altricial species), but it is subject to postnatal development and then influenced by experience.

An experiment by Burr et al. showed auditory dominance in a temporal bisection task (studying the temporal ventriloquism effect). The results were qualitatively but not quantitatively predicted by an optimal-integration model.

There are two possibilities explaining the latter result:

• audio-visual integration is not optimal in this case, or
• the model is incorrect. Specifically, the assumption of Gaussian noise in timing estimation may not reflect actual noise.

Multisensory enhancement and depression are an increased and decreased response of a multisensory neuron to congruent and incongruent stimuli, respectively.

Multisensory enhancement and depression are very different across neurons.

In many instances of multi-sensory perception, humans integrate information optimally.

AES integrates audio-visual inputs similar to SC.

AES has multisensory neurons, but they do not project to SC.

Non-spatial stimulus properties influence if and how cross-sensory stimuli are integrated.

Multisensory integration in cortical VLPFC was more commonly observed for face-vocalization combinations than for general audio-visual cues.

The idea that neural activity does not primarily represent the world but 'action pointers', as put by Engel et al., speaks to the deep SC which is both 'multi-modal' and 'motor'.

If there is a close connection between the state of the world and the required actions, then it is easy to confuse internal representations of the world with action pointers'.

The ANN model of multi-sensory integration in the SC due to Ohshiro et al. manages to replicate a number of physiological finding about the SC:

• inverse effectiveness,
• long-range inhibition and
• short-range activation,
• multisensory integration,
• different tuning to modalities between neurons,
• weighting of stimuli from different modalities.

It does not learn and it has no probabilistic motivation.

The ANN model of multi-sensory integration in the SC due to Ohshiro et al. uses divisive normalization to model multisensory integration in the SC.

Deactivation of AES and rLS leads to a complete lack of cross-modal enhancement while leaving intact the ability of multi-sensory SC neurons to respond to uni-sensory input and even to add input from different sensory modalities.

Rowland et al. derive a model of cortico-collicular multi-sensory integration from findings concerning the influence of deactivation or ablesion of cortical regions anterior ectosylvian cortex (AES) and rostral lateral suprasylvian cortex.

Rowland et al. derive a model of cortico-collicular multi-sensory integration from findings concerning the influence of deactivation or ablesion of cortical regions anterior ectosylvian cortex (AES) and rostral lateral suprasylvian cortex.

It is a single-neuron model.

Ghahramani et al. discuss computational models of sensorimotor integration.

Need to look at models of multi-sensory integration as well; they are not necessarily models of the SC, but relevant.

Weisswange et al. apply the idea of Bayesian inference to multi-modal integration and action selection. They show that online reinforcement learning can effectively train a neural network to approximate a Q-function predicting the reward in a multi-modal cue integration task.

Yamashita et al. modify Deneve et al.'s network by weakening divisive normalization and lateral inhibition. Thus, their network integrates localization if the disparity between localizations in simulated modalities is low, and maintains multiple hills of activation if disparity is high, thus accounting for the ventriloquism effect.

Yamashita et al. argue that, since whether or not two stimuli in different modalities with a certain disparity are integrated depends on the weight profiles in their network, a Bayesian prior is somehow encoded in these weights.

The model due to Cuppini et al. develops low-level multisensory integration (spatial principle) such that integration happens only with higher-level input.

In their model, Hebbian learning leads to sharpening of receptive fields, overlap of receptive fields, and Integration through higher-cognitive input.

Schroeder names two general definitions of multisensory integration: One includes any kind of interaction between stimuli from different senses, the other only integration of information about the same object of the real world from different sensory modalities.

These definitions both are definitions on the functional level as opposed to the biological level with which Stein's definition is concerned.

Multisensory integration can be thought of as a special case of integration of information from different sources---be they from one physical modality or from many.

Studying multisensory integration instead of the integration of information from different channels from the same modality tends to be easier because the stimuli can be more reliably separated in experiments.

Schroeder argues that multisensory integration is not separate from general cue integration and that information gleaned about the former can help understand the latter.

Anastasio et al. have come up with a Bayesian interpretation of neural responses to multi-sensory stimuli in the SC. According to their view, enhancement, depression and inverse effectiveness phenomena are due to neurons integrating uncertain information from different sensory modalities.

There is multisensory integration in areas typically considered unisensory, eg. primary and secondary auditory cortex.

There is are feedforward and feed-back connections between visual cortex and auditory cortex.

High white matter coherence between the parietal lobe and modality-specific brain regions is correlated with high temporal multi-sensory enhancement (shorter reaction times in multi-sensory trials than in uni-sensory trials).

The ventriloquism aftereffect occurs when an auditory stimulus is initially presented together with a visual stimulus with a certain spatial offset.

The auditory stimulus is typically localized by subjects at the same position as the visual stimulus, and this mis-localization prevails even after the visual stimulus disappears.