Show Reference: "Multimodal Video Concept Detection via Bag of Auditory Words and Multiple Kernel Learning"

Multimodal Video Concept Detection via Bag of Auditory Words and Multiple Kernel Learning In Advances in Multimedia Modeling, Vol. 7131 (2012), pp. 40-50, doi:10.1007/978-3-642-27355-1_7 by Markus Mühling, Ralph Ewerth, Jun Zhou, Bernd Freisleben edited by Klaus Schoeffmann, Bernard Merialdo, AlexanderG Hauptmann, et al.
@incollection{muehling-et-al-2012,
    abstract = {State-of-the-art systems for video concept detection mainly rely on visual features. Some previous approaches have also included audio features, either using low-level features such as mel-frequency cepstral coefficients ({MFCC}) or exploiting the detection of specific audio concepts. In this paper, we investigate a bag of auditory words ({BoAW}) approach that models {MFCC} features in an auditory vocabulary. The resulting {BoAW} features are combined with state-of-the-art visual features via multiple kernel learning ({MKL}). Experiments on a large set of 101 video concepts from the {MediaMill} Challenge show the effectiveness of using {BoAW} features: The system using {BoAW} features and a support vector machine with a χ 2-kernel is superior to a state-of-the-art audio approach relying on probabilistic latent semantic indexing. Furthermore, it is shown that an early fusion approach degrades detection performance, whereas the combination of auditory and visual bag of words features via {MKL} yields a relative performance improvement of 9\%.},
    author = {M\"{u}hling, Markus and Ewerth, Ralph and Zhou, Jun and Freisleben, Bernd},
    booktitle = {Advances in Multimedia Modeling},
    citeulike-article-id = {13550715},
    citeulike-linkout-0 = {http://dx.doi.org/10.1007/978-3-642-27355-1\_7},
    citeulike-linkout-1 = {http://link.springer.com/chapter/10.1007/978-3-642-27355-1\_7},
    doi = {10.1007/978-3-642-27355-1\_7},
    editor = {Schoeffmann, Klaus and Merialdo, Bernard and Hauptmann, AlexanderG and Ngo, Chong-Wah and Andreopoulos, Yiannis and Breiteneder, Christian},
    keywords = {bag-of-words, learning, multi-modality, objects},
    pages = {40--50},
    posted-at = {2015-03-17 08:11:15},
    priority = {2},
    publisher = {Springer Berlin Heidelberg},
    series = {Lecture Notes in Computer Science},
    title = {Multimodal Video Concept Detection via Bag of Auditory Words and Multiple Kernel Learning},
    url = {http://dx.doi.org/10.1007/978-3-642-27355-1\_7},
    volume = {7131},
    year = {2012}
}

See the CiteULike entry for more info, PDF links, BibTex etc.

There are specialized and general approaches to object detection. General approaches are more popular nowadays because it is infeasible to design specialized approaches for the number of visual categories of objects that one may want to detect.

Most current visual object detection methods (as of 2012) are bag-of-visual-words' approaches: features are detected in an image and those features are combined in abag of visual words'. Learning algorithms are applied to learn to classify such bags of words.

Mühling et al. present an audio-visual video concept detection system. Their system extracts visual and auditory bags of words from video data. Visual words are based on SIFT features, auditory words are formed by applying the K-Means algorithm to a Mel-Frequency Cepstral Coefficients analysis of the auditory data. Support vector machines are used for classification.