Show Reference: "Algorithms for Audiovisual Speaker Localisation in Reverberant Acoustic Environments"

Algorithms for Audiovisual Speaker Localisation in Reverberant Acoustic Environments In The 3rd Workshop on Positioning, Navigation and Communication (WPNC'06) (March 2006), pp. 75-80 by Christoph Voges, Volker Märgner, Rainer Martin
@inproceedings{voges-et-al-2006,
    address = {Aachen},
    author = {Voges, Christoph and M\"{a}rgner, Volker and Martin, Rainer},
    booktitle = {The 3rd Workshop on Positioning, Navigation and Communication (WPNC'06)},
    citeulike-article-id = {13515800},
    isbn = {3-8322-4862-5},
    keywords = {active-speaker-localization, audio, localization, multisensory-integration, visual},
    location = {Hannover, Germany},
    month = mar,
    pages = {75--80},
    posted-at = {2015-02-13 10:57:43},
    priority = {2},
    publisher = {Shaker Verlag},
    title = {Algorithms for Audiovisual Speaker Localisation in Reverberant Acoustic Environments},
    year = {2006}
}

See the CiteULike entry for more info, PDF links, BibTex etc.

Voges et al. use ITDs (computed via generalized cross-correlation) for sound-source localization.

Voges et al. present an engineering approach to audio-visual active speaker localization.

Voges et al. use a difference image to detect and localize moving objects (humans).

Voges et al. use the strength of the visual detection signal (the peak value of the column-wise sum of the difference image) as a proxy for the confidence of visual detection.

They use visual localization whenever this signal strength is above a certain threshold, and auditory localization if it is below that threshold.

In Vogel et al.'s system, auditory localization serves as a backup in case visual localization fails, and for disambiguation in case more than one visual target is detected.

Vogel et al. suggest using a Kalman or particle filter to integrate information about the speaker's position over time.

Voges et al. do not evaluate the accuracy of audio-visual localization.