Show Tag: cue-combination

Select Other Tags

Li et al. present a purely engineering-based approach to active speaker localization. Their system uses Viola and Jones' object detection algorithm for face detection and cross-correlation for auditory speaker localization.

Ravulakollu et al. loosely use the super colliculus as a metaphor for their robotic visual-auditory localization.

Ravulakollu et al. argue against SOMs and for radial basis functions (RBF) for combining stimuli (for reasons I don't quite understand).

Optimal multi-sensory integration is learned (for many tasks).

In many audio-visual localization tasks, humans integrate information optimally.

In some audio-visual time discrimination tasks, humans do not integrate information optimally.

Individual IC neurons combine localization cues.

The theoretical accounts of multi-sensory integration due to Beck et al. and Ma et al. do not learn and leave little room for learning.

Thus, they fail to explain an important aspect of multi-sensory integration in humans.

Person tracking can combine cues from single modalities (like motion and color cues), or from different modalities (like auditory and visual cues).

Kalman filters and particle filters have been used in uni- and multi-sensory person tracking.

Reactions to cross-sensory stimuli can be faster than the fastest reaction to any one of the constituent uni-sensory stimuli (as would be predicted by the race model.).

Frassinetti et al. showed that humans detect near-threshold visual stimuli with greater reliability if these stimuli are connected with spatially congruent auditory stimuli (and vice versa).

Laurenti et al. found in a audio-visual color identification task that redundant, congruent, semantic auditory information (the utterance of a color word) can decrease latency in response to a stimulus (color of a circle displayed to the subject). Incongruent semantic visual or auditory information (written or uttered color word) can increase response latency. However, congruent semantic visual information (written color word) does not decrease response latency.

The enhancements in response latencies in Laurenti et al.'s audio-visual color discrimination experiments were greater (response latencies were shorter) than predicted by the race model.

Integrating information from multiple stimuli can have advantages:

  • shorter reaction times
  • lower thresholds of stimulus detection
  • detection,
  • identification,
  • precision of orienting behavior

in a cue combination task with correlated errors, some subjects combined cues according to a linear cue combination rule which would have been appropriate for uncorrelated tasks, some combined them suboptimally altogether, and some combined them correctly as according to a linear cue combination rule for correlated tasks.

Localization of audiovisual targets is usually determined more by the location of the visual sub-target than on that of the auditory sub-target.

Especially in situations where visual stimuli are seen clearly and thus localized very easily, this can lead to the so-called ventriloquism effect (aka `visual capture') in which a sound source seems to be localized at the location of the visual target although it is in fact a few degrees away from it.

Magosso et al. present a recurrent ANN model which replicates the ventriloquism effect and the ventriloquism aftereffect.

Denéve et al. use basis function networks with multidimensional attractors for

  • function approximation
  • cue integration.

They reduce both to maximum likelihood estimation and show that their network performs close to a maximum likelihood estimator.

A deep SC neuron which receives enough information from one modality to reliably determine whether a stimulus is in its receptive field does not improve its performance much by integrating information from another modality.

Patton et al. use this insight to explain the diversity of uni-sensory and multisensory neurons in the deep SC.

Humans adapt to an auditory scene's reverberation and noise conditions. They use visual scene recognition to recall reverberation and noise conditions of familiar environments.

A system that stores multiple trained speech recognition models for different environments and retrieves them guided by visual scene recognition has improved speech recognition in reverberated and noisy environments.

Sub-threshold multisensory neurons respond directly only to one modality, however, the strength of the response is strongly influenced by input from another modality.

My theory on sub-threshold multisensory neurons: they receive only inhibitory input from the modality to which they do not directly respond in case that input is outside their receptive field; they receive no excitatory input from that modality if the stimulus is inside their RF.

Ma, Beck, Latham and Pouget argue that optimal integration of population-coded probabilistic information can be achieved by simply adding the activities of neurons with identical receptive fields. The preconditions for this to hold are

  • independent Poisson (or other "Poisson-like") noise in the input
  • identically-shaped tuning curves in input neurons
  • a point-to-point connection from neurons in different populations with identical receptive fields to the same output neuron.

Himmelbach et al. studied one patient in a visuo-haptic grasping task and found that she had a healthy-like ability to adapt her grip online to changes of object size when it was in the central viewing field. This indicates that the problem for patients with lesions of parieto-occipital cortex (POJ) is not an inability to adapt online, but more likely the connection between visuomotor pathways and pathways necessary for grasping.

Multisensory integration is a way to reduce uncertainty. This is both a normative argument and it states the evolutionary advantage of using multisensory integration.

Fetsch et al. define cue combination as the `combination of multiple sensory cues' arising from the same event or object.

The model due to Ma et al. is simple and it requires no learning.

Is it possible to learn the reliability of its sensory modalities from how well they agree with the consensus between the modalities under certain conditions?

Possible conditions:

  • many modalities (what my 2013 model does)
  • similar reliability
  • enough noise
  • enough remaining entropy at the end of learning (worked in early versions of my SOM)

Task-irrelevant visual cues do not affect visual orienting (visual spatial attention). Task-irrelevant auditory cues, however, seem to do so.

Santangelo and Macaluso suggest that whether or not the effects of endogenous attention dominate the ones of bottom-up processing (automatic processing) depends on semantic association, be it linguistic or learned association (like dogs and barking, cows and mooing).

Santangelo and Macaluso state that "the same frontoparietal attention control systems are ... activated in spatial orienting tasks for both the visual and auditory modality..."

Colonius and Diederich argue that deep-SC neurons spiking behavior can be interpreted as a vote for a target rather than a non-target being in their receptive field.

This is similar to Anastasio et al.'s previous approach.

There are a number of problems with Colonius' and Diederich's idea that deep-SC neurons' binary spiking behavior can be interpreted as a vote for a target rather than a non-target being in their RF. First, these neurons' RFs can be very broad, and the strength of their response is a function of how far away the stimulus is from the center of their RFs. Second, the response strength is also a function of stimulus strength. It needs some arguing, but to me it seems more likely that the response encodes the probability of a stimulus being in the center of the RF.

Colonius and Diederich argue that, given their Bayesian, normative model of neurons' response behavior, neurons responding to only one sensory modality outperform neurons responding to multiple sensory modalities.

Children do not integrate information the same way adults do in some tasks. Specifically, they sometimes do not integrate information optimally, where adults do integrate it optimally.

In an adapted version of Ernst and Banks' visuo-haptic height estimation paradigm, Gori et al. found that childrern under the age of 8 do not integrate visual and haptic information optimally where adults do.

Ernst and Banks show that humans combine visual and haptic information optimally in a height estimation task.

Human performance in combining slant and disparity cues for slant estimation can be explained by (optimal) maximum-likelihood estimation.

According to Landy et al., humans often combine cues (intra- or cross-sensory) optimally, consistent with MLE.

An experiment by Burr et al. showed auditory dominance in a temporal bisection task (studying the temporal ventriloquism effect). The results were qualitatively but not quantitatively predicted by an optimal-integration model.

There are two possibilities explaining the latter result:

  • audio-visual integration is not optimal in this case, or
  • the model is incorrect. Specifically, the assumption of Gaussian noise in timing estimation may not reflect actual noise.

In many instances of multi-sensory perception, humans integrate information optimally.

Multisensory stimuli can be integrated within a certain time window; auditory or somatosensory stimuli can be integrated with visual stimuli even though they arrive delayed wrt. visual stimuli.

Enhancement is greatest for weak stimuli and least for strong stimuli. This is called inverse effectiveness.

Descending inputs from association cortex to SC are uni-sensory.

AES integrates audio-visual inputs similar to SC.

AES has multisensory neurons, but they do not project to SC.

AES is a brain region in the cat. We do not know if there is a homologue in humans.

Deactivation of AES and rLS leads to a complete lack of cross-modal enhancement while leaving intact the ability of multi-sensory SC neurons to respond to uni-sensory input and even to add input from different sensory modalities.

Rowland et al. derive a model of cortico-collicular multi-sensory integration from findings concerning the influence of deactivation or ablesion of cortical regions anterior ectosylvian cortex (AES) and rostral lateral suprasylvian cortex.

Rowland et al. derive a model of cortico-collicular multi-sensory integration from findings concerning the influence of deactivation or ablesion of cortical regions anterior ectosylvian cortex (AES) and rostral lateral suprasylvian cortex.

It is a single-neuron model.

Schroeder argues that multisensory integration is not separate from general cue integration and that information gleaned about the former can help understand the latter.

Judging by the abstract, von der Malsburg's Democratic Integration does what I believe is impossible: it lets a system learn the reliability of its sensory modalities from how well they agree with the consensus between the modalities.

The ventriloquism aftereffect occurs when an auditory stimulus is initially presented together with a visual stimulus with a certain spatial offset.

The auditory stimulus is typically localized by subjects at the same position as the visual stimulus, and this mis-localization prevails even after the visual stimulus disappears.