Show Tag: auditory

Select Other Tags

Li et al. present a purely engineering-based approach to active speaker localization. Their system uses Viola and Jones' object detection algorithm for face detection and cross-correlation for auditory speaker localization.

In many audio-visual localization tasks, humans integrate information optimally.

In some audio-visual time discrimination tasks, humans do not integrate information optimally.

Activity in LIP is influenced by auditory stimuli.

Congenital blindness leads to tactile and auditory stimuli activating early dorsal cortical visual areas.

There are significant projections from auditory cortex as well as from polysensory areas in the temporal lobe to parts of V1 where receptive fields are peripheral.

Activity in the auditory cortex is modulated by visual stimuli.

Auditory cortex is influenced by visual stimuli.

Early levels of the auditory pathways are tonotopic.

Visual capture is weaker for stimuli in the periphery where visual localization is less reliable relative to auditor localization than at the center of the visual field.

Horizontal localizations are population-coded in the central nucleus of the mustache bat inferior colliculus.

Person tracking can combine cues from single modalities (like motion and color cues), or from different modalities (like auditory and visual cues).

Palmer and Ramsey show that lack of awareness of a visual lip stream does not inhibit learning of its relevance for a visual localization task: the subliminal lip stream influences visual attention and affects the subjects' performance.

They also showed that similar subliminal lip streams did not affect the occurrence of the Mc Gurk effect.

Together, this suggests that awareness of a visual stimulus is not always needed to use it for guiding visual awareness, but sometimes it is needed for multisensory integration to occur (following Palmer and Ramsey's definition).

Laurenti et al. found in a audio-visual color identification task that redundant, congruent, semantic auditory information (the utterance of a color word) can decrease latency in response to a stimulus (color of a circle displayed to the subject). Incongruent semantic visual or auditory information (written or uttered color word) can increase response latency. However, congruent semantic visual information (written color word) does not decrease response latency.

The enhancements in response latencies in Laurenti et al.'s audio-visual color discrimination experiments were greater (response latencies were shorter) than predicted by the race model.

Localization of audiovisual targets is usually determined more by the location of the visual sub-target than on that of the auditory sub-target.

Especially in situations where visual stimuli are seen clearly and thus localized very easily, this can lead to the so-called ventriloquism effect (aka `visual capture') in which a sound source seems to be localized at the location of the visual target although it is in fact a few degrees away from it.

Both visual and auditory neurons in the deep SC usually prefer moving stimuli and are direction selective.

The range of directions deep SC neurons are selective for is usally wide.

Some cells in FEF respond to auditory stimuli.

Stimulating cells in FEF whose activity is elevated before a saccade of a given direction and amplitude usually generates a saccade of that direction and amplitude.

FAES is not exclusively auditory.

AEV is not exclusively (but mostly) visual.

ITD and ILD are most useful for auditory localization in different frequency ranges:

  • In the low frequency ranges, ITD is most informative for auditory localization.
  • In the high frequency ranges, ILD is most informative for auditory localization.

There seem to be significant differences in SOC organization between higher mammals and rodents.

Rearing barn owls in darkness results in mis-alignment of auditory and visual receptive fields in the owls' optic tectum.

Rearing barn owls in darkness results in discontinuities in the map of auditory space of the owls' optic tectum.

The way sound is shaped by the head and body before reaching the ears of a listener is described by a head-related transfer function (HRTF). There is a different HRTF for every angle of incidence.

A head-related transfer function summarizes ITD, ILD, and spectral cues for sound-source localization.

Sound source localization based only on binaural cues (like ITD or ILD) suffer from the ambiguity due to the approximate point symmetry of the head: ITD and ILD identify only a `cone of confusion', ie. a virtual cone whose tip is at the center of the head and whose axis is the interaural axis, not strictly a single angle of incidence.

Spectral cues provide disambiguation: due to the asymmetry of the head, the sound is shaped differently depending on where on a cone of confusion a sound source is.

Talagala et al. measured the head-related transfer function (HRTF) of a dummy head and body in a semi-anechoc chamber and used this HRTF for sound source localization experiments.

Talagala et al.'s system can reliably localize sounds in all directions around the dummy head.

Sound-source localization using head-related impulse response functions is precise, but computationally expensive.

Wan et al. use simple cross-correlation (which is computationally cheap, but not very precise) to localize sounds roughly. They then use the rough estimate to speed up MacDonald's cross-channel algorithm which uses head-related impulse response functions.

MacDonald proposes two methods for sound source localization based on head-related transfer functions (actually the HRIR, their representation in the time domain).

The first method for SSL proposed by MacDonald applies the inverse of the HRIR $F^{(i,\theta)}$ to the signal recorded by $i$ For each microphone $i$ and every candidate angle $\theta$. It then uses the Pearson correlation coefficient to compare the resultant signals. Only for the correct angle $\theta$ should the signals match.

The second method for (binaural) SSL proposed by MacDonald applies the HRIR $F^{(o,\theta)}$ to the signals recorded by the left and right microphones every candidate angle θ, where $F^{(o,\theta)}$ is the respective opposite microphone. It then uses the Pearson correlation coefficient to compare the resultant signals. Only for the correct angle θ should the signals match.

The binaural sound-source localization methods proposed by MacDonald can be extended to larger arrays of microphones.

The superficial cat SC is not responsive to auditory stimuli.

Some neurons in the dSC respond to an auditory stimulus with a single spike at its onset, some with sustained activity over the duration of the stimulus.

Middlebrooks and Knudsen report on sharply delineated auditory receptive fields in some neurons in the deep cat SC, in which there is an optimal region from which stimuli elicit a stronger response than in other places in the RF.

A minority of deep SC neurons are omnidirectional, responding to sounds anywhere, albeit with a defined best area.

There is a map of auditory space in the deep superior colliculus.

There is considerable variability in the sharpness of spatial tuning in the responses to auditory stimuli of deep SC neurons.

The visual and auditory maps in the deep SC are in spatial register.

Auditory receptive fields tend to be greater and contain visual receptive fields in the deep SC of the owl.

The superficial SC of the owl is strongly audio-visual.

The superficial mouse SC is not responsive to auditory or tactile stimuli.

The receptive fields of certain neurons in the cat's deep SC shift when the eye position is changed. Thus, the map of auditory space in the deep SC is temporarily realigned to stay in register with the retinotopic map.

There are projections from auditory cortex to SC (from anterior ectosylvian gyrus).

Voges et al. use ITDs (computed via generalized cross-correlation) for sound-source localization.

There are fine-structure and envelope ITDs. Humans are sensitive to both, but do not weight envelope ITDs very strongly when localizing sound sources.

Some congenitally unilaterally deaf people develop close-to-normal auditory localization capabilities. These people probably learn to use spectral SSL cues.

Humans use a variety of cues to estimate the distance to a sound source. This estimate is much less precise than estimates of the direction towards the sound source.

Kadunce et al. found that two auditory stimuli placed at opposing the edges of a neuron's receptive field, in its suppressive zone, elicited some activity in the neuron (although less than they expected).

Reverberation and noise can degrade speech recognition.

Speech recognition models are usually trained with noise and reverberation free speech samples.

Humans adapt to an auditory scene's reverberation and noise conditions. They use visual scene recognition to recall reverberation and noise conditions of familiar environments.

A system that stores multiple trained speech recognition models for different environments and retrieves them guided by visual scene recognition has improved speech recognition in reverberated and noisy environments.

ICx projects to intermediate and deep layers of SC.

The shift in the auditory map in ICx comes with changed projections from ICc to ICx.

There appears to be plasticity wrt. the auditory space map in the SC.

SC receives input and represents all sensory modalities used in phasic orienting: vision, audition, somesthesis (haptic), nociceptic, infrared, electoceptive, magnetic, and ecolocation.

The nucleus of the brachium of the inferior colliculus (nbic) projects to intermediate and deep layers of SC.

SC receives auditory localization-related inputs from the IC.

Sub-threshold multisensory neurons respond directly only to one modality, however, the strength of the response is strongly influenced by input from another modality.

Ideal observer models of cue integration were introduced in vision research but are now used in other uni-sensory tasks (auditory, somatosensory, proprioceptive and vestibular).

Fixating some point in space enhances spoken language understanding if the words come from that point in space. Fixating a visual stream showing lips consistent with the utterances, this effect is strongest, but it also works if the visual display is random. The effect is also enhanced if fixation is combined with some form of visual task which is complex enough.

Fixating at some point in space can impede language understanding if the utterance do not emanate from the focus of visual attention and there are auditory distractors which do.

The SC is multisensory: it reacts to visual, auditory, and somatosensory stimuli. It does not only initiate gaze shifts, but also other motor behaviour.

The deeper levels of SC are the targets of projections from cortex, auditory, somatosensory and motor systems in the brain.

Moving the eyes shifts the auditory and somatosensory maps in the SC.

(Some) SC neurons in the newborn cat are sensitive to tactile stimuli at birth, to auditory stimuli a few days postnatally, and to visual stimuli last.

We do not know whether other sensory maps than the visual map in the SC are initially set up through chemical markers, but it is likely.

If deep SC neurons are sensitive to tactile stimuli before there are any visually sensitive neurons, then it makes sense that their retinotopic organization be guided by chemical markers.

Santangelo and Macaluso provide a rewiew on the recent literature on visual and auditory attention.

Frontoparietal regions play a key role in spatial orienting in unisensory studies of visual and auditory attention.

There seems to be also modality-specific attention which globally de-activates attention in one modality and activates it in the other.

As a computer scientist I would call de-activating one modality completely a special case of selective attention in that modality.

Localized auditory cues can exogenously orient visual attention.

Santangelo and Macaluso state that multisensory integration and attention are probably separate processes.

Maybe attention controls whether or not multi-sensory integration (MSI) happens at all (at least in SC)? That would be in line with findings that without input from AES and rLS, there's no MSI.

Are AES and rLS cat homologues to the regions cited by Santangelo and Macalluso as regions responsible for auditory and visual attention?

Task-irrelevant visual cues do not affect visual orienting (visual spatial attention). Task-irrelevant auditory cues, however, seem to do so.

Santangelo and Macaluso suggest that whether or not the effects of endogenous attention dominate the ones of bottom-up processing (automatic processing) depends on semantic association, be it linguistic or learned association (like dogs and barking, cows and mooing).

Santangelo and Macaluso state that "the same frontoparietal attention control systems are ... activated in spatial orienting tasks for both the visual and auditory modality..."

The external nucleus of the inferior colliculus (ICx) of the barn owl represents a map of auditory space.

The map of auditory space in the nucleus of the inferior colliculus (ICx) is calibrated by visual experience.

The optic tectum (OT) receives information on sound source localization from ICx.

Hyde and Knudsen found that there is a point-to-point projection from OT to IC.

A faithful model of the SC should probably adapt the mapping of auditory space in the SC and in another model representing ICx.

Mammals seem to have SC-IC connectivity analogous to that of the barn owl.

Jack and Thurlow found that the degree to which a puppet resembled an actual speaker (whether it had eyes and a nose, whether it had a lower jaw moving with the speech etc.) and whether the lips of an actual speaker moved in synch with heard speech influenced the strength of the ventriloquism effect.

Bertelson et al. did not find a shift of sound source localization due to manipulated endogenous visual spatial attention—localization was shifted only due to (the salience of) light flashes which would induce (automatic, mandatory) exogenous attention.

Alais and Burr found in an audio-visual localization experiment that the ventriloquism effect can be interpreted by a simple cue weighting model of human multi-sensory integration:

Their subjects weighted visual and auditory cues depending on their reliability. The weights they used were consistent with MLE. In most situations, visual cues are much more reliable for localization than are auditory cues. Therefore, a visual cue is given so much greater weight that it captures the auditory cue.

Auditory signals gain relevance during saccades as visual perception is unreliable during saccades.

It would therefore be a good candidate for feedback if saccade control is closed-loop.

Individually, auditory cues are highly ambiguous with respect to auditory localization.

Cue combination across auditory cue types and channels (frequencies) are needed to combine auditory cues to a meaningful localization.

Auditory localization within the so-called cone of confusion can be disambiguated using spectral cues: changes in the spectral shape of a sound due to how the sound reflects, bounces and passes through features of an animal's body. Such changes can only be detected for known sounds.

Auditory sound source localization is made effective through the combination of different types of cues across frequency channels. It is thus most reliable for familiar broad-band sounds.

If visual cues were absolutely necessary for the formation of an auditory space map, then no auditory space map should develop without visual cues. Since an auditory space map develops also in blind(ed) animals, visual cues cannot be strictly necessary.

Many localized perceptual events are either only visual or only auditory. It is therefore not plausible that only audio-visual percepts contribute to the formation of an auditory space map.

Visual information plays a role, but does not seem to be necessary for the formation of an auditory space map.

The auditory space maps developed by animals without patterned visual experience seem to be degraded only in some species (in guinea pigs and barn owls, but not in ferrets or cats).

Self-organization may play a role in organizing auditory localization independent of visual input.

Visual input does seem to be necessary to ensure spatial audio-visual map-register.

Audio-visual map registration has its limits: strong distortions of natural perception can only partially be compensated through adaptation.

Register between sensory maps is necessary for proper integration of multi-sensory stimuli.

Visual localization has much greater precision and reliability than auditory localization. This seems to be one reason for vision guiding hearing (in this particular context) and not the other way around.

It is unclear and disputed whether visual dominance in adaptation is hard-wired or a result of the quality of respective stimuli.

Most of the multi-sensory neurons in the (cat) SC are audio-visual followed by visual-somatosensory, but all other combinations can be found.

One reason for specifically studying multi-sensory integration in the (cat) SC is that there is a well-understood connection between input stimuli and overt behavior.

The auditory field of the anterior ectosylvian sulcus (fAES) has strong corticotectal projections (in cats).

Some cortical areas are involved in orienting towards auditory stimuli:

  • primary auditory cortex (A1)
  • posterior auditory field (PAF)
  • dorsal zone of auditory cortex (DZ)
  • auditory field of the anterior ectosylvian sulcus (fAES)

Only fAES has strong cortico-tectal projections.

The ventriloquism aftereffect occurs when an auditory stimulus is initially presented together with a visual stimulus with a certain spatial offset.

The auditory stimulus is typically localized by subjects at the same position as the visual stimulus, and this mis-localization prevails even after the visual stimulus disappears.