Frassinetti et al. showed that humans detect near-threshold visual stimuli with greater reliability if these stimuli are connected with spatially congruent auditory stimuli (and vice versa).⇒
Rucci et al. present a robotic system based on their neural model of audiovisual localization.⇒
There are a number of approaches for audio-visual localization. Some with actual robots, some just as theoretical ANN or algorithmic models.⇒
Rucci et al. present an algorithm which performs auditory localization and combines auditory and visual localization in a common SC map. The mapping between the representations is learned using value-dependent learning.⇒
Rucci et al.'s neural network learns how to align ICx and SC (OT) maps by means of value-dependent learning: The value signal depends on whether the target was in the fovea after a saccade.⇒
Rucci et al.'s model of learning to combine ICx and SC maps does not take into account the point-to-point projections from SC to ICx reported later by Knudsen et al.⇒
Rucci et al.'s plots of ICc activation look very similar to Jorge's IPD matrices.⇒
Rucci et al. model learning of audio-visual map alignment in the barn owl SC. In their model, projections from the retina to the SC are fixed (and visual RFs are therefore static) and connections from ICx are adapted through value-dependent learning.⇒
fAES is not tonotopic. Instead, its neurons are responsive to spatial features of sounds. No spatial map has been found in fAES (until at least 2004).⇒
Casey et al. use their ANN in a robotic system for audio-visual localization.⇒
Casey et al. focus on making their system work in real time and with complex stimuli and compromise on biological realism.⇒
In Casey et al.'s system, ILD alone is used for SSL.⇒
In Casey et al's experiments, the two microphones are one meter apart and the stimulus is one meter away from the center between the two microphones. There is no damping body between the microphones, but at that interaural distance and distance to the stimulus, ILD should still be a good localization cue.⇒
An auditory and a visual stimulus, separated in time, may be perceived as one audio-visual stimulus, seemingly occurring at the same point in time.⇒
If an auditory and a visual stimulus are close together, spatially, then they are more likely perceived as one cross-modal stimulus than if they are far apart—even if they are separated temporally.⇒
In a sensorimotor synchronization task, Aschersleben and Bertelson found that an auditory distractor biased the temporal perception of a visual target stimulus more strongly than the other way around.⇒
Sanchez-Riera et al. use a probabilistic model for audio-visual active speaker localization on a humanoid robot (the Nao robot).⇒
Sanchez-Riera et al. use the Bayesian information criterion to choose the number of speakers in their audio-visual active speaker localization system.⇒
Sanchez-Riera et al. use the Waldboost face detection system for visual processing.⇒
Yan et al. present a system which uses auditory and visual information to learn an audio-motor map (in a functional sense) and orient a robot towards a speaker. Learning is online.⇒
Sanchez-Riera et al. do not report on localization accuracy, but on correct speaker detections.⇒
Li et al. report that, in their experiment, audio-visual active speaker localization is as good as visual active-speaker localization ($\sim 1^\circ$) as long as speakers are within the visual field.
Outside of the visual field, localization varies between $1^\circ$ and $10^\circ$. The authors do not report provide a detailed quantitative evaluation of localization accuracy.⇒
Yan et al. explicitly do not integrate auditory and visual localization. Given multiple visual and an auditory localization, they associate the auditory localization with that visual localization which is closest, using the visual localization as the localization of the audio-visual object.
In determining the position of the audio-visual object, Yan et al. handle the possibility that the actual source of the stimulus has only been heard, not seen. They decide whether that is the case by estimating the probability that the auditory localization belongs to any of the detected visual targets and comparing to the baseline probability that the auditory target has not been detected, visually.⇒
Yan et al. do not evaluate the accuracy of audio-visual localization.⇒
Yan et al. report an accuracy of auditory localization of $3.4^\circ$ for online learning and $0.9^\circ$ for offline calibration.⇒
Yan et al. perform sound source localization using both ITD and ILD. Some of their auditory processing is bio-inspired.⇒
Voges et al. present an engineering approach to audio-visual active speaker localization.⇒
Voges et al. use the strength of the visual detection signal (the peak value of the column-wise sum of the difference image) as a proxy for the confidence of visual detection.
They use visual localization whenever this signal strength is above a certain threshold, and auditory localization if it is below that threshold.⇒
In Vogel et al.'s system, auditory localization serves as a backup in case visual localization fails, and for disambiguation in case more than one visual target is detected.⇒
Voges et al. do not evaluate the accuracy of audio-visual localization.⇒
Aarabi present a system for audio-visual localization in azimuth and depth which they demonstrate in an active-speaker localization task.⇒
Kushal et al. present an engineering approach to audio-visual active speaker localization.⇒
Kushal et al. do not evaluate the accuracy of audio-visual localization quantitatively. They do show a graph for visual-only, audio-visual, and audio-visual and temporal localization during one test run. That graph seems to indicate that multisensory and temporal integration prevent misdetections—they do not seem to improve localization much.⇒
Kushal et al. use an EM algorithm to integrate audio-visual information for active speaker localization statically and over time.⇒
Studies on audio-visual active speaker localization usually do not report on in-depth evaluations of audio-visual localization accuracy. The reason is, probably, that auditory information is only used as a backup for cases when visual localization fails or for disambiguation in case visual information is not sufficient to tell which of the visual targets is the active speaker.
When visual detection succeeds, it is usually precise enough.
Therefore, active speaker localization is probably a misnomer. It should be called active speaker identification.⇒
In one of their experiments, Warren et al. had their subjects localize visual or auditory components of visual-auditory stimuli (videos of people speaking and the corresponding sound).
Stimuli were made
compelling' by playing video and audio in sync anduncompelling' by introducing a temporal offset.
They found that their subjects performed as under a
unity assumptions'' when told they would perceive cross-sensory stimuli, and when the stimuli were `compelling' and under a lowunity assumption'' when they were told there could be separate auditory or visual stimuli and/or the stimuli were made `uncompelling'.⇒
Bell et al. found that playing a sound before a visual target stimulus did not increase activity in the neurons they monitored for long enough to lead to (neuron-level) multisensory integration.⇒