Show Tag: localization

Select Other Tags

Li et al. present a purely engineering-based approach to active speaker localization. Their system uses Viola and Jones' object detection algorithm for face detection and cross-correlation for auditory speaker localization.

Ravulakollu et al. loosely use the super colliculus as a metaphor for their robotic visual-auditory localization.

Ravulakollu et al. argue against SOMs and for radial basis functions (RBF) for combining stimuli (for reasons I don't quite understand).

In many audio-visual localization tasks, humans integrate information optimally.

Reward mediated learning has been demonstrated in adaptation of orienting behavior.

Horizontal localizations are population-coded in the central nucleus of the mustache bat inferior colliculus.

There are a number of approaches for audio-visual localization. Some with actual robots, some just as theoretical ANN or algorithmic models.

Rucci et al. present an algorithm which performs auditory localization and combines auditory and visual localization in a common SC map. The mapping between the representations is learned using value-dependent learning.

Rucci et al.'s plots of ICc activation look very similar to Jorge's IPD matrices.

Magosso et al. present a recurrent ANN model which replicates the ventriloquism effect and the ventriloquism aftereffect.

Irrelevant auditory stimuli can dramatically improve or degrade orientation performance in visual orientation tasks:

In Wilkinson et al.'s experiments, cats' performance in orienting towards near-threshold, medial visual stimuli was much improved by irrelevant auditory stimuli close to the visual stimuli and drastically degraded by irrelevant auditory stimuli far from the visual stimuli.

If visual stimuli were further to the edge of the visual field, then lateral auditory stimuli improved their detection rate even if they were disparate.

Chemical deactivation of AES degrades both the improvement and the degradation of performance in orienting towards visual due to auditory stimuli.

ITD and ILD are most useful for auditory localization in different frequency ranges:

  • In the low frequency ranges, ITD is most informative for auditory localization.
  • In the high frequency ranges, ILD is most informative for auditory localization.

Auditory localization is different from visual or haptic localization since stimulus location is not encoded in which neural receptors are stimulated but in the differential temporal and intensity pattern of stimulation of receptors in the to ears.

It's easier to separate a target sound from a blanket of background noise if target sound and background noise have different ITDs.

Interaural time and level difference do not help (much) in localizing sounds in the vertical plane. Spectral cues—cues in the change of the frequencies in the sound due to differential reflection from various body parts—help us do that.

There seem to be significant differences in SOC organization between higher mammals and rodents.

Sound-source localization using head-related impulse response functions is precise, but computationally expensive.

Wan et al. use simple cross-correlation (which is computationally cheap, but not very precise) to localize sounds roughly. They then use the rough estimate to speed up MacDonald's cross-channel algorithm which uses head-related impulse response functions.

MacDonald proposes two methods for sound source localization based on head-related transfer functions (actually the HRIR, their representation in the time domain).

The first method for SSL proposed by MacDonald applies the inverse of the HRIR $F^{(i,\theta)}$ to the signal recorded by $i$ For each microphone $i$ and every candidate angle $\theta$. It then uses the Pearson correlation coefficient to compare the resultant signals. Only for the correct angle $\theta$ should the signals match.

The second method for (binaural) SSL proposed by MacDonald applies the HRIR $F^{(o,\theta)}$ to the signals recorded by the left and right microphones every candidate angle θ, where $F^{(o,\theta)}$ is the respective opposite microphone. It then uses the Pearson correlation coefficient to compare the resultant signals. Only for the correct angle θ should the signals match.

The binaural sound-source localization methods proposed by MacDonald can be extended to larger arrays of microphones.

Cross-correlation can be used to estimate the ITD of a sound perceived in two ears.

Rucci et al. claim a mean localization error of 1.54°±1.01° (± presumably meaning standard error) for auditory localization of white-noise stimuli at a direction between $[-60°,60°]$ from their system.

Casey et al. use their ANN in a robotic system for audio-visual localization.

Casey et al. focus on making their system work in real time and with complex stimuli and compromise on biological realism.

In Casey et al.'s system, ILD alone is used for SSL.

In Casey et al's experiments, the two microphones are one meter apart and the stimulus is one meter away from the center between the two microphones. There is no damping body between the microphones, but at that interaural distance and distance to the stimulus, ILD should still be a good localization cue.

The way many sensory organs work naturally provides a homomorphic mapping from the location of a stimulus into the population of peripheral sensory neurons:

The location of a visual stimulus determines which part of the retina is stimulated.

The identity of a peripheral somesthetic neuron immediately identifies the location of sensory stimulation on the body surface.

The identity of peripheral auditory neurons responding to an auditory stimulus is not dependent on the location of that stimulus.

Instead, localization cues must be extracted from the temporal dynamics and spectral properties of binaural auditory signals.

This is in contrast with visual and somesthetic localization.

Sanchez-Riera et al. use a probabilistic model for audio-visual active speaker localization on a humanoid robot (the Nao robot).

Sanchez-Riera et al. use the Bayesian information criterion to choose the number of speakers in their audio-visual active speaker localization system.

Sanchez-Riera et al. use the Waldboost face detection system for visual processing.

Yan et al. present a system which uses auditory and visual information to learn an audio-motor map (in a functional sense) and orient a robot towards a speaker. Learning is online.

Yan et al. use the standard Viola-Jones face detection algorithm for visual processing.

Yan et al. explicitly do not integrate auditory and visual localization. Given multiple visual and an auditory localization, they associate the auditory localization with that visual localization which is closest, using the visual localization as the localization of the audio-visual object.

In determining the position of the audio-visual object, Yan et al. handle the possibility that the actual source of the stimulus has only been heard, not seen. They decide whether that is the case by estimating the probability that the auditory localization belongs to any of the detected visual targets and comparing to the baseline probability that the auditory target has not been detected, visually.

Yan et al. do not evaluate the accuracy of audio-visual localization.

Yan et al. report an accuracy of auditory localization of $3.4^\circ$ for online learning and $0.9^\circ$ for offline calibration.

Yan et al. perform sound source localization using both ITD and ILD. Some of their auditory processing is bio-inspired.

Voges et al. use ITDs (computed via generalized cross-correlation) for sound-source localization.

Voges et al. present an engineering approach to audio-visual active speaker localization.

Voges et al. use a difference image to detect and localize moving objects (humans).

Voges et al. use the strength of the visual detection signal (the peak value of the column-wise sum of the difference image) as a proxy for the confidence of visual detection.

They use visual localization whenever this signal strength is above a certain threshold, and auditory localization if it is below that threshold.

In Vogel et al.'s system, auditory localization serves as a backup in case visual localization fails, and for disambiguation in case more than one visual target is detected.

Vogel et al. suggest using a Kalman or particle filter to integrate information about the speaker's position over time.

Voges et al. do not evaluate the accuracy of audio-visual localization.

Aarabi present a system for audio-visual localization in azimuth and depth which they demonstrate in an active-speaker localization task.

Aarabi choose (adaptive) difference images for visual localization to avoid relying on domain knowledge.

Aarabi use ITD (computed using cross-correlation) and ILD in an array of 3 microphones for auditory localization.

Kushal et al. present an engineering approach to audio-visual active speaker localization.

Kushal et al. do not evaluate the accuracy of audio-visual localization quantitatively. They do show a graph for visual-only, audio-visual, and audio-visual and temporal localization during one test run. That graph seems to indicate that multisensory and temporal integration prevent misdetections—they do not seem to improve localization much.

Kushal et al. use an EM algorithm to integrate audio-visual information for active speaker localization statically and over time.

Acoustic localization cues change from far-field conditions (distance to stimulus $>1\,\mathrm{m}$) to near-field conditions ($\leq 1\,\mathrm{m}$).

There are fine-structure and envelope ITDs. Humans are sensitive to both, but do not weight envelope ITDs very strongly when localizing sound sources.

Some congenitally unilaterally deaf people develop close-to-normal auditory localization capabilities. These people probably learn to use spectral SSL cues.

Humans use a variety of cues to estimate the distance to a sound source. This estimate is much less precise than estimates of the direction towards the sound source.

The SC localizes events.

SC receives tactile localization-related inputs from the trigeminal nucleus.

SC receives auditory localization-related inputs from the IC.

The optic tectum (OT) receives information on sound source localization from ICx.

Bertelson et al. did not find a shift of sound source localization due to manipulated endogenous visual spatial attention—localization was shifted only due to (the salience of) light flashes which would induce (automatic, mandatory) exogenous attention.

Wozny et al. found in an audio-visual localization experiment that a majority of their participants' performance was best explained by the statistically sub-optimal probability matching strategy.

In an audio-visual localization task, Wallace et al. found that their subjects' localization of the auditory stimulus were usually biased towards the visual stimulus whenever the two stimuli were perceived as one and vice-versa.

Ernst and Banks show that humans combine visual and haptic information optimally in a height estimation task.

Alais and Burr found in an audio-visual localization experiment that the ventriloquism effect can be interpreted by a simple cue weighting model of human multi-sensory integration:

Their subjects weighted visual and auditory cues depending on their reliability. The weights they used were consistent with MLE. In most situations, visual cues are much more reliable for localization than are auditory cues. Therefore, a visual cue is given so much greater weight that it captures the auditory cue.

People look where they point and point where they look.

Reasons why pointing and gazing are so closely connected may be

  • that gaze guides pointing,
  • that gazing and pointing use the same information,
  • or that a common motor command guides both.

Individually, auditory cues are highly ambiguous with respect to auditory localization.

Cue combination across auditory cue types and channels (frequencies) are needed to combine auditory cues to a meaningful localization.

Auditory localization within the so-called cone of confusion can be disambiguated using spectral cues: changes in the spectral shape of a sound due to how the sound reflects, bounces and passes through features of an animal's body. Such changes can only be detected for known sounds.

Auditory sound source localization is made effective through the combination of different types of cues across frequency channels. It is thus most reliable for familiar broad-band sounds.

If visual cues were absolutely necessary for the formation of an auditory space map, then no auditory space map should develop without visual cues. Since an auditory space map develops also in blind(ed) animals, visual cues cannot be strictly necessary.

Many localized perceptual events are either only visual or only auditory. It is therefore not plausible that only audio-visual percepts contribute to the formation of an auditory space map.

Visual information plays a role, but does not seem to be necessary for the formation of an auditory space map.

The auditory space maps developed by animals without patterned visual experience seem to be degraded only in some species (in guinea pigs and barn owls, but not in ferrets or cats).

Self-organization may play a role in organizing auditory localization independent of visual input.

Visual input does seem to be necessary to ensure spatial audio-visual map-register.