Show Tag: active-speaker-localization

Select Other Tags

Li et al. present a purely engineering-based approach to active speaker localization. Their system uses Viola and Jones' object detection algorithm for face detection and cross-correlation for auditory speaker localization.

Sanchez-Riera et al. use a probabilistic model for audio-visual active speaker localization on a humanoid robot (the Nao robot).

Sanchez-Riera et al. use the Bayesian information criterion to choose the number of speakers in their audio-visual active speaker localization system.

Sanchez-Riera et al. use the Waldboost face detection system for visual processing.

Yan et al. present a system which uses auditory and visual information to learn an audio-motor map (in a functional sense) and orient a robot towards a speaker. Learning is online.

Yan et al. use the standard Viola-Jones face detection algorithm for visual processing.

Sanchez-Riera et al. do not report on localization accuracy, but on correct speaker detections.

Li et al. report that, in their experiment, audio-visual active speaker localization is as good as visual active-speaker localization ($\sim 1^\circ$) as long as speakers are within the visual field.

Outside of the visual field, localization varies between $1^\circ$ and $10^\circ$. The authors do not report provide a detailed quantitative evaluation of localization accuracy.

Yan et al. do not evaluate the accuracy of audio-visual localization.

Yan et al. report an accuracy of auditory localization of $3.4^\circ$ for online learning and $0.9^\circ$ for offline calibration.

Yan et al. perform sound source localization using both ITD and ILD. Some of their auditory processing is bio-inspired.

Voges et al. present an engineering approach to audio-visual active speaker localization.

In Vogel et al.'s system, auditory localization serves as a backup in case visual localization fails, and for disambiguation in case more than one visual target is detected.

Vogel et al. suggest using a Kalman or particle filter to integrate information about the speaker's position over time.

Voges et al. do not evaluate the accuracy of audio-visual localization.

Aarabi present a system for audio-visual localization in azimuth and depth which they demonstrate in an active-speaker localization task.

Aarabi choose (adaptive) difference images for visual localization to avoid relying on domain knowledge.

Aarabi use ITD (computed using cross-correlation) and ILD in an array of 3 microphones for auditory localization.

Kushal et al. present an engineering approach to audio-visual active speaker localization.

Kushal et al. do not evaluate the accuracy of audio-visual localization quantitatively. They do show a graph for visual-only, audio-visual, and audio-visual and temporal localization during one test run. That graph seems to indicate that multisensory and temporal integration prevent misdetections—they do not seem to improve localization much.

Kushal et al. use an EM algorithm to integrate audio-visual information for active speaker localization statically and over time.

Studies on audio-visual active speaker localization usually do not report on in-depth evaluations of audio-visual localization accuracy. The reason is, probably, that auditory information is only used as a backup for cases when visual localization fails or for disambiguation in case visual information is not sufficient to tell which of the visual targets is the active speaker.

When visual detection succeeds, it is usually precise enough.

Therefore, active speaker localization is probably a misnomer. It should be called active speaker identification.