Show Reference: "Multiple Active Speaker Localization based on Audio-visual Fusion in two Stages"

Multiple Active Speaker Localization based on Audio-visual Fusion in two Stages In 2012 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI) (13-15 September 2012), pp. 262-268 by Zhao Li, Thorsten Herfet, Thorsten Thormählen
@article{li-et-al-2012,
    address = {Hamburg, Germany},
    author = {Li, Zhao and Herfet, Thorsten and Thorm\"{a}hlen, Thorsten},
    booktitle = {2012 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)},
    day = {13-15},
    keywords = {auditory, calibration, computational, cue-combination, face-detection, localization, visual, visual-processing},
    location = {Hamburg, Germany},
    month = sep,
    pages = {262--268},
    posted-at = {2012-09-21 07:56:58},
    priority = {2},
    title = {Multiple Active Speaker Localization based on Audio-visual Fusion in two Stages},
    year = {2012}
}

See the CiteULike entry for more info, PDF links, BibTex etc.

Li et al. present a purely engineering-based approach to active speaker localization. Their system uses Viola and Jones' object detection algorithm for face detection and cross-correlation for auditory speaker localization.

Li et al. report that, in their experiment, audio-visual active speaker localization is as good as visual active-speaker localization ($\sim 1^\circ$) as long as speakers are within the visual field.

Outside of the visual field, localization varies between $1^\circ$ and $10^\circ$. The authors do not report provide a detailed quantitative evaluation of localization accuracy.