# Show Reference: "Multiple Active Speaker Localization based on Audio-visual Fusion in two Stages"

Multiple Active Speaker Localization based on Audio-visual Fusion in two Stages In 2012 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI) (13-15 September 2012), pp. 262-268 by Zhao Li, Thorsten Herfet, Thorsten Thormählen
@article{li-et-al-2012,
address = {Hamburg, Germany},
author = {Li, Zhao and Herfet, Thorsten and Thorm\"{a}hlen, Thorsten},
booktitle = {2012 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)},
day = {13-15},
keywords = {auditory, calibration, computational, cue-combination, face-detection, localization, visual, visual-processing},
location = {Hamburg, Germany},
month = sep,
pages = {262--268},
posted-at = {2012-09-21 07:56:58},
priority = {2},
title = {Multiple Active Speaker Localization based on Audio-visual Fusion in two Stages},
year = {2012}
}


See the CiteULike entry for more info, PDF links, BibTex etc.

Li et al. present a purely engineering-based approach to active speaker localization. Their system uses Viola and Jones' object detection algorithm for face detection and cross-correlation for auditory speaker localization.

Li et al. report that, in their experiment, audio-visual active speaker localization is as good as visual active-speaker localization ($\sim 1^\circ$) as long as speakers are within the visual field.

Outside of the visual field, localization varies between $1^\circ$ and $10^\circ$. The authors do not report provide a detailed quantitative evaluation of localization accuracy.