Database and Visual Front End Makis Potamianos.

Slides:

Advertisements

Similar presentations

Active Appearance Models

Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.

Face Alignment with Part-Based Modeling

An Overview of Machine Learning

A Temporal Network of Support Vector Machines for the Recognition of Visual Speech Mihaela Gordan *, Constantine Kotropoulos **, Ioannis Pitas ** * Faculty.

SecurePhone Workshop - 24/25 June Speaking Faces Verification Kevin McTait Raphaël Blouet Gérard Chollet Silvia Colón Guido Aversano.

Hidden Markov Models Theory By Johan Walters (SR 2003)

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.

Robust Moving Object Detection & Categorization using self- improving classifiers Omar Javed, Saad Ali & Mubarak Shah.

Speech Recognition Training Continuous Density HMMs Lecture Based on:

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

A Study of Approaches for Object Recognition

Feature Selection, Acoustic Modeling and Adaptation SDSG REVIEW of recent WORK Technical University of Crete Speech Processing and Dialog Systems Group.

Spatial and Temporal Data Mining

Speaker Adaptation for Vowel Classification

Ensemble Learning: An Introduction

Real-time Combined 2D+3D Active Appearance Models Jing Xiao, Simon Baker,Iain Matthew, and Takeo Kanade CVPR 2004 Presented by Pat Chan 23/11/2004.

HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

Optimal Adaptation for Statistical Classifiers Xiao Li.

Visual Recognition Tutorial

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Ensemble Learning (2), Tree and Forest

9.0 Speaker Variabilities: Adaption and Recognition References: of Huang 2. “ Maximum A Posteriori Estimation for Multivariate Gaussian Mixture.

Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.

Isolated-Word Speech Recognition Using Hidden Markov Models

Gaussian Mixture Model and the EM algorithm in Speech Recognition

ECSE 6610 Pattern Recognition Professor Qiang Ji Spring, 2011.

Multimodal Interaction Dr. Mike Spann

1 ECE 738 Paper presentation Paper: Active Appearance Models Author: T.F.Cootes, G.J. Edwards and C.J.Taylor Student: Zhaozheng Yin Instructor: Dr. Yuhen.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

Multimodal Information Analysis for Emotion Recognition

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.

Sparse Bayesian Learning for Efficient Visual Tracking O. Williams, A. Blake & R. Cipolloa PAMI, Aug Presented by Yuting Qi Machine Learning Reading.

ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.

Tom Ko and Brian Mak The Hong Kong University of Science and Technology.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Data Mining and Decision Support

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.

Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture

Statistical Models for Automatic Speech Recognition

Context-based vision system for place and object recognition

ECE539 final project Instructor: Yu Hen Hu Fall 2005

CRANDEM: Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

A Tutorial on Bayesian Speech Feature Enhancement

Generally Discriminant Analysis

Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.

EM Algorithm and its Applications

Presentation transcript:

Database and Visual Front End Makis Potamianos

Active Appearance Model Visual Features Iain Matthews

Acknowledgments Cootes, Edwards, Talyor, Manchester Sclaroff, Boston

AAM Overview Shape & Appearance Appearance Region of interest Warp to reference Shape Landmarks

Relationship to DCT Features External feature detector vs. model-based learned tracking Face Detector Face Detector AAM Tracker AAM Tracker DCT AAM Features ROI ‘box’ vs. explicit shape + appearance modeling

Training Data 4072 hand labelled images = 2m 13s (/ 50h)

Final Model 33 33 mean

Image under model Warp to reference Fitting Algorithm Difference Predicted Update cc  weight  c Image Current model projection Appearance c is all model parameters Error Iterate until convergence Current model projection Appearance Image under model Warp to reference Difference

Tracking Results Worst sequence - mean, mean square error = Best sequence - mean, mean square error = 89.11

Tracking Results Full-face AAM tracker on subset of VVAV database 4,952 sequences 1,119,256 30fps= 10h 22m Mean, mean MSE per sentence = Tracking rate (  m2p decode)  4 fps Beard area and lips only models will not track Regions lack sharp texture gradients needed locate model?

Features Use AAM full-face features directly (86 dimensional)

Audio Lattice Rescoring Results Lattice random path = 78.14% DCT with LM = 51.08% DCT no LM = 61.06%

Audio Lattice Rescoring Results AAM vs. DCT vs. Noise

Tracking Errors Analysis AAM vs. Tracking error

Analysis and Future Work Models are under trained Little more than face detection on 2m of training

Analysis and Future Work Models are under trained Little more than face detection on 2m of training Project face through a more compact model Retain only useful articulation information? reproject

Analysis and Future Work Models are under trained Little more than face detection on 2m of training Project face through a more compact model Retain only useful articulation information? Improve the reference shape Minimal information loss through the warping? reproject

Asynchronous Stream Modelling Juergen Luettin

The Recognition Problem M: word (phoneme) sequence M * : most likely word sequence O A : acoustic observation sequence O V : visual observation sequence

Integration at the Feature Level Assumption: conditional dependence between modalities integration at the feature level

Integration at the Decision Level Assumption: conditional independence between modalities integration at the unit level

Multiple Synchronous Streams Assumption: conditional independence integration at the state level Two streams in each state: X: state sequence a ij : trans. prob. from i to j b j : probability density c jm : m th mixture weight of multivariate GaussianN

Multiple Asynchronous Streams Assumption: conditional independence integration at the unit level Decoding: individual best state sequences for audio and video

Composite HMM definition Speech-noise decomposition (Varga & Moore, 1993) Audio-visual decomposition (Dupont & Luettin, 1998)

Stream Clustering

AVSR System 3-state HMM with 12 mixture components, 7-state HMM for composite model context dependent phone models (silence, short pause), tree-based state clustering cross-word context dependent decoding, using lattices computed at IBM trigram language model global stream weights in multi stream models, estimated on held out set

Speaker independent word recognition

Conclusions AV 2 Stream asynchronous model beats other models in noisy conditions Future directions: Transition matrices: context dependent, pruning transitions with low probability, cross-unit asynchrony Stream weights: model based, discriminative Clustering: taking stream-tying into account

Phone Dependent Weighting Dimitra Vergyri

Weight Estimation Hervé Glotin

Visual Clustering June Sison

Outline Motivation for use of visemes in triphone classification Definition of visemes Goals of viseme usage Inspection of phone trees (validity check)

Equivalence Classification Combats problem of data sparseness Must be sufficiently refined so that equivalence classification can serve as a basis for prediction Use of decision trees to achieve equivalence classification [co-articulation] To derive EC: 1] collect speech data realizing each phone 2] classify [cluster] this speech into appropriately distinct categories

Definition of visemes Canonical mouth shapes that accompany speech utterances complements phonetic stream [examples]

Visual vs Audio Contexts 276 QS [total] 84 single phoneme QS 116 audio QS 76 visual QS No. root nodes: visual 74 audio 16 single phoneme

Visual Models Azad Mashari

Visual Speech Recognition The Model Trinity Audio-Clustered Model (Question Set 1) Self-Clustered Model (Question Set 1) Self-Clustered Model (Question Set 2) The "Results" (From which we learn what not to do) The Analysis Places to Go, Things to Do...

The Questions Set 1: Original Audio Questions 202 Questions based primarily on voicing and manner Set 2: Audio-Visual Questions 274 (includes Set 1) includes questions regarding place of articulation

The Trinity Audio-Clustered model Decision trees generated from the audio data using question set 1 Visual triphone models clustered using the trees Self-Clustered old Decision trees generated from the visual data using question set 1 Self-Clustered new Decision trees generated from the visual data using question set 2

Experiment I 3 major factors Independence / Complementarity of the two streams Quality of the representation Generalization Speaker-Independent test Noisy audio lattices rescored using visual models

Experiment I Rescoring noisy audio lattices using the visual models

Experiment I

Speaker variability of visual models follows variability of audio models. (we don't know why.. lattices?) This does not mean that they are not "complementary". Viseme clustering gives better results for some speakers only. No overall gain. (we don't know why) Are the new questions being used? Over-training? ~7000 clusters in audio models for ~40 phonemes. Same number in visual models but there are only ~12 "visemes" -> Experiments with fewer clusters Is the Greedy clustering algorithm, making a less optimal tree with the new questions?

Experiment II Several ways to get fewer clusters: Increase minimum cluster size Increase likelihood gain threshold Remove questions (specially those frequently used at higher depths, as well as unused ones) Any combination of the above Triple min likelihood gain threshold (single mixture models) -> insignificant increase in error. ~7000 clusters -> 54.24% ~2500 clusters -> 54.57% Even fewer clusters (~ )? Different reduction strategy?

Places to Go, Things to See... Finding optimal clustering parameters. Current values are optimized for mfcc- based audio models. Clustering with viseme-based questions only Looking at errors in recognition of particular phones/classes

Visual Model Adaptation Jie Zhou

Visual Model Adaptation Problem The Speaker Independent system is not sufficient to accurately model each new speaker Soluion Use adaptation to make the Speaker Independent System to better fit the characteristics of each new speaker

HMM Adaptation To get a new estimate of the adapted mean, µ, We use the transformation matrix given by: µ = Wε Where W is the (n x n) transformation matrix n is the dimensionality of the data and ε is the original mean vector

Speaker independent data Speaker specific data ε μ

HEAdapt VVAV HMM Models Recognition Speaker Adapted Test Data (ε, σ) Speaker Independent Data Transformed Speaker Independent Model (µ = W ε)

Procedure A speaker adaptation on visual models was performed using: MLLR (method of adaptation) Global transform Single mixture triphones Adaptation data: Average 5 minutes per speaker Test data: Average 6 minutes per speaker

Results Speaker s Speaker Independent Speaker Adapted AXK44.05%41.92% JFM61.41%59.23% JXC62.28%60.48% LCY31.23%29.32% MBG83.73%83.56% MDP30.16%29.89% RTG57.44%55.73% BAE36.81%36.17% CNM84.73%83.89% DJF71.96%71.15% Average 58.98%55.49% Word error, %

Future Better adaptation can be achieved by : Employ Multiple transforms instead of single transform Attempt other methods of adaptation such as MAP with more data Use mixture Gaussians in the model

Summary and Conclusions Chalapathy Neti

The End.

Extra Slides…

State based Clustering

Error rate on DCT features Language Model No Language Model Lattice Depth 1 Clean Audio Lattice Depth 3 Clean Audio Lattice Noisy Audio Word error rate on small multi-speaker test set

Audio Lattice Rescoring Results Visual FeatureWord Error Rate, % AAM - 86 features65.69 AAM - 30 features65.66 AAM  +   AAM - 86 LDA  24, WiLDA ± DCT  +   DCT - 24, WiLDA ± Noise DCT WiLDA no LM = Lattice random path = 78.32

Overview Shape Appearance