Mühling et al. present an audio-visual video concept detection system. Their system extracts visual and auditory bags of words from video data. Visual words are based on SIFT features, auditory words are formed by applying the K-Means algorithm to a Mel-Frequency Cepstral Coefficients analysis of the auditory data. Support vector machines are used for classification.⇒
Hinton states that in using SVMs, the actual features (of an image, article...) are extracted by some hand-crafted algorithm and only discriminating objects based on these features is learned.
He sees this as a modern version of what he calls a strategy of denial to learning feature extractors and hidden units.⇒