Database and Visual Front End Makis Potamianos
Active Appearance Model Visual Features Iain Matthews
Acknowledgments Cootes, Edwards, Talyor, Manchester Sclaroff, Boston
AAM Overview Shape & Appearance Appearance Region of interest Warp to reference Shape Landmarks
Relationship to DCT Features External feature detector vs. model-based learned tracking Face Detector Face Detector AAM Tracker AAM Tracker DCT AAM Features ROI ‘box’ vs. explicit shape + appearance modeling
Training Data 4072 hand labelled images = 2m 13s (/ 50h)
Final Model 33 33 mean
Image under model Warp to reference Fitting Algorithm Difference Predicted Update cc weight c Image Current model projection Appearance c is all model parameters Error Iterate until convergence Current model projection Appearance Image under model Warp to reference Difference
Tracking Results Worst sequence - mean, mean square error = Best sequence - mean, mean square error = 89.11
Tracking Results Full-face AAM tracker on subset of VVAV database 4,952 sequences 1,119,256 30fps= 10h 22m Mean, mean MSE per sentence = Tracking rate ( m2p decode) 4 fps Beard area and lips only models will not track Regions lack sharp texture gradients needed locate model?
Features Use AAM full-face features directly (86 dimensional)
Audio Lattice Rescoring Results Lattice random path = 78.14% DCT with LM = 51.08% DCT no LM = 61.06%
Audio Lattice Rescoring Results AAM vs. DCT vs. Noise
Tracking Errors Analysis AAM vs. Tracking error
Analysis and Future Work Models are under trained Little more than face detection on 2m of training
Analysis and Future Work Models are under trained Little more than face detection on 2m of training Project face through a more compact model Retain only useful articulation information? reproject
Analysis and Future Work Models are under trained Little more than face detection on 2m of training Project face through a more compact model Retain only useful articulation information? Improve the reference shape Minimal information loss through the warping? reproject
Asynchronous Stream Modelling Juergen Luettin
The Recognition Problem M: word (phoneme) sequence M * : most likely word sequence O A : acoustic observation sequence O V : visual observation sequence
Integration at the Feature Level Assumption: conditional dependence between modalities integration at the feature level
Integration at the Decision Level Assumption: conditional independence between modalities integration at the unit level
Multiple Synchronous Streams Assumption: conditional independence integration at the state level Two streams in each state: X: state sequence a ij : trans. prob. from i to j b j : probability density c jm : m th mixture weight of multivariate GaussianN
Multiple Asynchronous Streams Assumption: conditional independence integration at the unit level Decoding: individual best state sequences for audio and video
Composite HMM definition Speech-noise decomposition (Varga & Moore, 1993) Audio-visual decomposition (Dupont & Luettin, 1998)
Stream Clustering
AVSR System 3-state HMM with 12 mixture components, 7-state HMM for composite model context dependent phone models (silence, short pause), tree-based state clustering cross-word context dependent decoding, using lattices computed at IBM trigram language model global stream weights in multi stream models, estimated on held out set
Speaker independent word recognition
Conclusions AV 2 Stream asynchronous model beats other models in noisy conditions Future directions: Transition matrices: context dependent, pruning transitions with low probability, cross-unit asynchrony Stream weights: model based, discriminative Clustering: taking stream-tying into account
Phone Dependent Weighting Dimitra Vergyri
Weight Estimation Hervé Glotin
Visual Clustering June Sison
Outline Motivation for use of visemes in triphone classification Definition of visemes Goals of viseme usage Inspection of phone trees (validity check)
Equivalence Classification Combats problem of data sparseness Must be sufficiently refined so that equivalence classification can serve as a basis for prediction Use of decision trees to achieve equivalence classification [co-articulation] To derive EC: 1] collect speech data realizing each phone 2] classify [cluster] this speech into appropriately distinct categories
Definition of visemes Canonical mouth shapes that accompany speech utterances complements phonetic stream [examples]
Visual vs Audio Contexts 276 QS [total] 84 single phoneme QS 116 audio QS 76 visual QS No. root nodes: visual 74 audio 16 single phoneme
Visual Models Azad Mashari
Visual Speech Recognition The Model Trinity Audio-Clustered Model (Question Set 1) Self-Clustered Model (Question Set 1) Self-Clustered Model (Question Set 2) The "Results" (From which we learn what not to do) The Analysis Places to Go, Things to Do...
The Questions Set 1: Original Audio Questions 202 Questions based primarily on voicing and manner Set 2: Audio-Visual Questions 274 (includes Set 1) includes questions regarding place of articulation
The Trinity Audio-Clustered model Decision trees generated from the audio data using question set 1 Visual triphone models clustered using the trees Self-Clustered old Decision trees generated from the visual data using question set 1 Self-Clustered new Decision trees generated from the visual data using question set 2
Experiment I 3 major factors Independence / Complementarity of the two streams Quality of the representation Generalization Speaker-Independent test Noisy audio lattices rescored using visual models
Experiment I Rescoring noisy audio lattices using the visual models
Experiment I
Speaker variability of visual models follows variability of audio models. (we don't know why.. lattices?) This does not mean that they are not "complementary". Viseme clustering gives better results for some speakers only. No overall gain. (we don't know why) Are the new questions being used? Over-training? ~7000 clusters in audio models for ~40 phonemes. Same number in visual models but there are only ~12 "visemes" -> Experiments with fewer clusters Is the Greedy clustering algorithm, making a less optimal tree with the new questions?
Experiment II Several ways to get fewer clusters: Increase minimum cluster size Increase likelihood gain threshold Remove questions (specially those frequently used at higher depths, as well as unused ones) Any combination of the above Triple min likelihood gain threshold (single mixture models) -> insignificant increase in error. ~7000 clusters -> 54.24% ~2500 clusters -> 54.57% Even fewer clusters (~ )? Different reduction strategy?
Places to Go, Things to See... Finding optimal clustering parameters. Current values are optimized for mfcc- based audio models. Clustering with viseme-based questions only Looking at errors in recognition of particular phones/classes
Visual Model Adaptation Jie Zhou
Visual Model Adaptation Problem The Speaker Independent system is not sufficient to accurately model each new speaker Soluion Use adaptation to make the Speaker Independent System to better fit the characteristics of each new speaker
HMM Adaptation To get a new estimate of the adapted mean, µ, We use the transformation matrix given by: µ = Wε Where W is the (n x n) transformation matrix n is the dimensionality of the data and ε is the original mean vector
Speaker independent data Speaker specific data ε μ
HEAdapt VVAV HMM Models Recognition Speaker Adapted Test Data (ε, σ) Speaker Independent Data Transformed Speaker Independent Model (µ = W ε)
Procedure A speaker adaptation on visual models was performed using: MLLR (method of adaptation) Global transform Single mixture triphones Adaptation data: Average 5 minutes per speaker Test data: Average 6 minutes per speaker
Results Speaker s Speaker Independent Speaker Adapted AXK44.05%41.92% JFM61.41%59.23% JXC62.28%60.48% LCY31.23%29.32% MBG83.73%83.56% MDP30.16%29.89% RTG57.44%55.73% BAE36.81%36.17% CNM84.73%83.89% DJF71.96%71.15% Average 58.98%55.49% Word error, %
Future Better adaptation can be achieved by : Employ Multiple transforms instead of single transform Attempt other methods of adaptation such as MAP with more data Use mixture Gaussians in the model
Summary and Conclusions Chalapathy Neti
The End.
Extra Slides…
State based Clustering
Error rate on DCT features Language Model No Language Model Lattice Depth 1 Clean Audio Lattice Depth 3 Clean Audio Lattice Noisy Audio Word error rate on small multi-speaker test set
Audio Lattice Rescoring Results Visual FeatureWord Error Rate, % AAM - 86 features65.69 AAM - 30 features65.66 AAM + AAM - 86 LDA 24, WiLDA ± DCT + DCT - 24, WiLDA ± Noise DCT WiLDA no LM = Lattice random path = 78.32
Overview Shape Appearance