Download presentation
Presentation is loading. Please wait.
1
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban
2
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 2 Outline Feature selection and extraction – Why select features? – Information theoretic criteria Our approach – The audio-visual recognizer – Audio-visual integration – Features and selection methods Experimental results Conclusion
3
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 3 Feature selection Features and classification – Features (or attributes, properties, characteristics) - different types of measures that can be taken on the same physical phenomenon – An instance (or pattern, sample, example) - collection of feature values representing simultaneous measurements – For classification, each sample has an associated class label Feature selection – Finding from the original feature set, a subset which retains most of the information that is relevant for a classification task – This is needed because of the curse of dimensionality Why dimensionality reduction? – The number of samples required to obtain accurate models of the data grows exponentially with the dimensionality – The computing resources required also grow with the dimensionality of the data – Irrelevant information can decrease performance
4
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 4 Feature selection Entropy and mutual information – H(X), the entropy of X – the amount of uncertainty about the value of X – I(X;Y), the mutual information between X and Y – the reduction in the uncertainty of X due to the knowledge of Y (or vice-versa) Maximum dependency – One of the frequently used criteria is mutual information – Pick Y S1 …Y Sm from the set Y 1 …Y n of features, such that I(Y S1,Y S2,…, Y Sm ; C) is maximum How many subsets? – Impossible to check all subsets, high number of combinations: – As an approximate solution, greedy algorithms are used – The number of possibilities is reduced to
5
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 5 A simple example Entropies and mutual information can be represented by Venn diagrams We are searching for the features Y Si with maximum mutual information with the class label Assume the complete set of features is
6
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 6 A simple example
7
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 7 A simple example
8
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 8 A simple example
9
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 9 A simple example
10
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 10 Which criterion to penalize redundancy? Many different criteria proposed in the literature Our criterion penalizes only relevant redundancy
11
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 11 Solutions from the literature “Natural” DCT ordering – Zigzag scanning, used in compression (JPEG/MPEG) Maximum mutual information – Typically the redundancy is not taken into account Linear Discriminant Analysis – A transform is applied on the features
12
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 12 Our application: AVSR Experiments on the CUAVE database – 36 speakers, 10 words, 5 repetitions per speaker – Leave-one-out crossvalidation – Audio features: MFCC coefficients – Visual features: DCT with first and second temporal derivatives – Different levels of noise added to the audio
13
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 13 The multi-stream HMM Audio (39 MFCCs) Video (DCT features) Audio-visual integration with multi-stream HMMs – States are modeled with gaussian mixtures – Each modality is modeled separately – The emission likelihood is a weighted product – The optimal weights are chosen for each SNR
14
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 14 Information content of different types of features
15
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 15 Visual-only recognition rate
16
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 16 Audio-visual performance
17
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 17 AV performance with clean audio
18
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 18 AV performance at 10db SNR
19
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 19 Noisy AV and visual-only comparison
20
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 20 Conclusion and future work Feature selection for audio-visual speech recognition – Visual-only recognition rate not a good predictor for audio-visual performance because of dimensionality – Maximum audio-visual performance is obtained for small video dimensionalities – Algorithms that improve performance at small dimensionalities are needed Future work – Better methods to compute the amount of redundancy between features
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.