Show Reference: "Audio-Visual Speaker Localization Using Graphical Models"

Audio-Visual Speaker Localization Using Graphical Models In The 18th International Conference on Pattern Recognition, Vol. 1 (2006), pp. 291-294, doi:10.1109/icpr.2006.284 by Akash Kushal, Mandar Rahurkar, Fei-Fei Li, Jean Ponce, Thomas Huang edited by Yuan Y. Tang, Wang, Guy Lorette, Daniel S. Yeung, Hong Yan
    abstract = {In this work we propose an approach to combine audio and video modalities for person tracking using graphical models. We demonstrate a principled and intuitive framework for combining these modalities to obtain robustness against occlusion and change in appearance. We further exploit the temporal correlations that exist for a moving object between adjacent frames to account for the cases where having both modalities might still not be enough, e.g., when the person being tracked is occluded and not speaking. Improvement in tracking results is shown at each step and compared with manually annotated ground truth.},
    address = {Washington, DC, USA},
    author = {Kushal, Akash and Rahurkar, Mandar and Li, Fei-Fei and Ponce, Jean and Huang, Thomas},
    booktitle = {The 18th International Conference on Pattern Recognition},
    citeulike-article-id = {13515806},
    citeulike-linkout-0 = {},
    citeulike-linkout-1 = {},
    doi = {10.1109/icpr.2006.284},
    editor = {Tang, Yuan Y. and Wang and Lorette, Guy and Yeung, Daniel S. and Yan, Hong},
    isbn = {0-7695-2521-0},
    keywords = {active-speaker-localization, audio, localization, multisensory-integration, visual},
    location = {Hong Kong},
    pages = {291--294},
    posted-at = {2015-02-13 11:04:46},
    priority = {2},
    publisher = {IEEE Computer Society},
    series = {ICPR '06},
    title = {{Audio-Visual} Speaker Localization Using Graphical Models},
    url = {},
    volume = {1},
    year = {2006}

See the CiteULike entry for more info, PDF links, BibTex etc.

Kushal et al. present an engineering approach to audio-visual active speaker localization.

Kushal et al. do not evaluate the accuracy of audio-visual localization quantitatively. They do show a graph for visual-only, audio-visual, and audio-visual and temporal localization during one test run. That graph seems to indicate that multisensory and temporal integration prevent misdetections—they do not seem to improve localization much.

Kushal et al. use an EM algorithm to integrate audio-visual information for active speaker localization statically and over time.