Download presentation
Presentation is loading. Please wait.
Published byBeatrix Murphy Modified over 9 years ago
7
Database and Visual Front End Makis Potamianos
14
Active Appearance Model Visual Features Iain Matthews
15
Acknowledgments Cootes, Edwards, Talyor, Manchester Sclaroff, Boston
16
AAM Overview Shape & Appearance Appearance Region of interest Warp to reference Shape Landmarks
17
Relationship to DCT Features External feature detector vs. model-based learned tracking Face Detector Face Detector AAM Tracker AAM Tracker DCT AAM Features ROI ‘box’ vs. explicit shape + appearance modeling
18
Training Data 4072 hand labelled images = 2m 13s (/ 50h)
19
Final Model 33 33 mean
20
Image under model Warp to reference Fitting Algorithm Difference Predicted Update cc weight c Image Current model projection Appearance c is all model parameters Error Iterate until convergence Current model projection Appearance Image under model Warp to reference Difference
21
Tracking Results Worst sequence - mean, mean square error = 548.87 Best sequence - mean, mean square error = 89.11
22
Tracking Results Full-face AAM tracker on subset of VVAV database 4,952 sequences 1,119,256 images @ 30fps= 10h 22m Mean, mean MSE per sentence = 254.21 Tracking rate ( m2p decode) 4 fps Beard area and lips only models will not track Regions lack sharp texture gradients needed locate model?
23
Features Use AAM full-face features directly (86 dimensional)
24
Audio Lattice Rescoring Results Lattice random path = 78.14% DCT with LM = 51.08% DCT no LM = 61.06%
25
Audio Lattice Rescoring Results AAM vs. DCT vs. Noise
26
Tracking Errors Analysis AAM vs. Tracking error
27
Analysis and Future Work Models are under trained Little more than face detection on 2m of training
28
Analysis and Future Work Models are under trained Little more than face detection on 2m of training Project face through a more compact model Retain only useful articulation information? reproject
29
Analysis and Future Work Models are under trained Little more than face detection on 2m of training Project face through a more compact model Retain only useful articulation information? Improve the reference shape Minimal information loss through the warping? reproject
30
Asynchronous Stream Modelling Juergen Luettin
31
The Recognition Problem M: word (phoneme) sequence M * : most likely word sequence O A : acoustic observation sequence O V : visual observation sequence
32
Integration at the Feature Level Assumption: conditional dependence between modalities integration at the feature level
33
Integration at the Decision Level Assumption: conditional independence between modalities integration at the unit level
34
Multiple Synchronous Streams Assumption: conditional independence integration at the state level Two streams in each state: X: state sequence a ij : trans. prob. from i to j b j : probability density c jm : m th mixture weight of multivariate GaussianN
35
Multiple Asynchronous Streams Assumption: conditional independence integration at the unit level Decoding: individual best state sequences for audio and video
36
Composite HMM definition 1 5 4 3 2 6 8 79 Speech-noise decomposition (Varga & Moore, 1993) Audio-visual decomposition (Dupont & Luettin, 1998)
37
Stream Clustering
38
AVSR System 3-state HMM with 12 mixture components, 7-state HMM for composite model context dependent phone models (silence, short pause), tree-based state clustering cross-word context dependent decoding, using lattices computed at IBM trigram language model global stream weights in multi stream models, estimated on held out set
39
Speaker independent word recognition
40
Conclusions AV 2 Stream asynchronous model beats other models in noisy conditions Future directions: Transition matrices: context dependent, pruning transitions with low probability, cross-unit asynchrony Stream weights: model based, discriminative Clustering: taking stream-tying into account
41
Phone Dependent Weighting Dimitra Vergyri
46
Weight Estimation Hervé Glotin
55
Visual Clustering June Sison
56
Outline Motivation for use of visemes in triphone classification Definition of visemes Goals of viseme usage Inspection of phone trees (validity check)
57
Equivalence Classification Combats problem of data sparseness Must be sufficiently refined so that equivalence classification can serve as a basis for prediction Use of decision trees to achieve equivalence classification [co-articulation] To derive EC: 1] collect speech data realizing each phone 2] classify [cluster] this speech into appropriately distinct categories
59
Definition of visemes Canonical mouth shapes that accompany speech utterances complements phonetic stream [examples]
61
Visual vs Audio Contexts 276 QS [total] 84 single phoneme QS 116 audio QS 76 visual QS No. root nodes: 123 33 visual 74 audio 16 single phoneme
63
Visual Models Azad Mashari
64
Visual Speech Recognition The Model Trinity Audio-Clustered Model (Question Set 1) Self-Clustered Model (Question Set 1) Self-Clustered Model (Question Set 2) The "Results" (From which we learn what not to do) The Analysis Places to Go, Things to Do...
65
The Questions Set 1: Original Audio Questions 202 Questions based primarily on voicing and manner Set 2: Audio-Visual Questions 274 (includes Set 1) includes questions regarding place of articulation
66
The Trinity Audio-Clustered model Decision trees generated from the audio data using question set 1 Visual triphone models clustered using the trees Self-Clustered old Decision trees generated from the visual data using question set 1 Self-Clustered new Decision trees generated from the visual data using question set 2
67
Experiment I 3 major factors Independence / Complementarity of the two streams Quality of the representation Generalization Speaker-Independent test Noisy audio lattices rescored using visual models
68
Experiment I Rescoring noisy audio lattices using the visual models
69
Experiment I
70
Speaker variability of visual models follows variability of audio models. (we don't know why.. lattices?) This does not mean that they are not "complementary". Viseme clustering gives better results for some speakers only. No overall gain. (we don't know why) Are the new questions being used? Over-training? ~7000 clusters in audio models for ~40 phonemes. Same number in visual models but there are only ~12 "visemes" -> Experiments with fewer clusters Is the Greedy clustering algorithm, making a less optimal tree with the new questions?
71
Experiment II Several ways to get fewer clusters: Increase minimum cluster size Increase likelihood gain threshold Remove questions (specially those frequently used at higher depths, as well as unused ones) Any combination of the above Triple min likelihood gain threshold (single mixture models) -> insignificant increase in error. ~7000 clusters -> 54.24% ~2500 clusters -> 54.57% Even fewer clusters (~150-200)? Different reduction strategy?
72
Places to Go, Things to See... Finding optimal clustering parameters. Current values are optimized for mfcc- based audio models. Clustering with viseme-based questions only Looking at errors in recognition of particular phones/classes
73
Visual Model Adaptation Jie Zhou
74
Visual Model Adaptation Problem The Speaker Independent system is not sufficient to accurately model each new speaker Soluion Use adaptation to make the Speaker Independent System to better fit the characteristics of each new speaker
75
HMM Adaptation To get a new estimate of the adapted mean, µ, We use the transformation matrix given by: µ = Wε Where W is the (n x n) transformation matrix n is the dimensionality of the data and ε is the original mean vector
76
Speaker independent data Speaker specific data ε μ
77
HEAdapt VVAV HMM Models Recognition Speaker Adapted Test Data (ε, σ) Speaker Independent Data Transformed Speaker Independent Model (µ = W ε)
78
Procedure A speaker adaptation on visual models was performed using: MLLR (method of adaptation) Global transform Single mixture triphones Adaptation data: Average 5 minutes per speaker Test data: Average 6 minutes per speaker
79
Results Speaker s Speaker Independent Speaker Adapted AXK44.05%41.92% JFM61.41%59.23% JXC62.28%60.48% LCY31.23%29.32% MBG83.73%83.56% MDP30.16%29.89% RTG57.44%55.73% BAE36.81%36.17% CNM84.73%83.89% DJF71.96%71.15% Average 58.98%55.49% Word error, %
80
Future Better adaptation can be achieved by : Employ Multiple transforms instead of single transform Attempt other methods of adaptation such as MAP with more data Use mixture Gaussians in the model
81
Summary and Conclusions Chalapathy Neti
86
The End.
87
Extra Slides…
88
State based Clustering
89
Error rate on DCT features Language Model No Language Model Lattice Depth 1 Clean Audio 24.7927.79 Lattice Depth 3 Clean Audio 25.5534.58 Lattice Noisy Audio 49.7955.00 Word error rate on small multi-speaker test set
90
Audio Lattice Rescoring Results Visual FeatureWord Error Rate, % AAM - 86 features65.69 AAM - 30 features65.66 AAM - 30 + + 69.50 AAM - 86 LDA 24, WiLDA ±7 64.00 DCT - 18 + + 61.80 DCT - 24, WiLDA ±758.14 Noise - 3061.37 DCT WiLDA no LM = 65.14 Lattice random path = 78.32
91
Overview Shape Appearance
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.