CRICOS No. 000213J † CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Audio-visual speaker verification using continuous fused HMMs.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

CSCE643: Computer Vision Bayesian Tracking & Particle Filtering Jinxiang Chai Some slides from Stephen Roth.

Hyeonsoo, Kang. ▫ Structure of the algorithm ▫ Introduction 1.Model learning algorithm 2.[Review HMM] 3.Feature selection algorithm ▫ Results.

Fusion of HMM’s Likelihood and Viterbi Path for On-line Signature Verification Bao Ly Van - Sonia Garcia Salicetti - Bernadette Dorizzi Institut National.

Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.

SecurePhone Workshop - 24/25 June Speaking Faces Verification Kevin McTait Raphaël Blouet Gérard Chollet Silvia Colón Guido Aversano.

Hidden Markov Models Adapted from Dr Catherine Sweeney-Reed’s slides.

Hidden Markov Models Theory By Johan Walters (SR 2003)

1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.

Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Speaker Adaptation for Vowel Classification

Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.

1 Integration of Background Modeling and Object Tracking Yu-Ting Chen, Chu-Song Chen, Yi-Ping Hung IEEE ICME, 2006.

Effective Gaussian mixture learning for video background subtraction Dar-Shyang Lee, Member, IEEE.

CRICOS No J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information.

EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.

ICME 2004 Tzvetanka I. Ianeva Arjen P. de Vries Thijs Westerveld A Dynamic Probabilistic Multimedia Retrieval Model.

Visual Speech Recognition Using Hidden Markov Models Kofi A. Boakye CS280 Course Project.

Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques

Learning and Recognizing Activities in Streams of Video Dinesh Govindaraju.

Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos

Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.

A Literature Review By Xiaozhen Niu Department of Computing Science

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.

June 28th, 2004 BioSecure, SecurePhone 1 Automatic Speaker Verification : Technologies, Evaluations and Possible Future Gérard CHOLLET CNRS-LTCI, GET-ENST.

Multimodal Interaction Dr. Mike Spann

Graphical models for part of speech tagging

CRICOS No J † CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs for Speaker Recognition.

Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.

1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute.

Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.

 Speech is bimodal essentially. Acoustic and Visual cues. H. McGurk and J. MacDonald, ''Hearing lips and seeing voices'', Nature, pp , December.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Multimodal Information Analysis for Emotion Recognition

MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.

Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,

Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Dijana Petrovska-Delacrétaz 1 Asmaa el Hannani 1 Gérard Chollet 2 1: DIVA Group, University of Fribourg 2: GET-ENST, CNRS-LTCI,

Modeling individual and group actions in meetings with layered HMMs dong zhang, daniel gatica-perez samy bengio, iain mccowan, guillaume lathoud idiap.

Sparse Bayesian Learning for Efficient Visual Tracking O. Williams, A. Blake & R. Cipolloa PAMI, Aug Presented by Yuting Qi Machine Learning Reading.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.

AUTOMATIC TARGET RECOGNITION AND DATA FUSION March 9 th, 2004 Bala Lakshminarayanan.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Experimental Results Abstract Fingerspelling is widely used for education and communication among signers. We propose a new static fingerspelling recognition.

Context-based vision system for place and object recognition Antonio Torralba Kevin Murphy Bill Freeman Mark Rubin Presented by David Lee Some slides borrowed.

Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

Performance Comparison of Speaker and Emotion Recognition

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.

Chapter 8. Learning of Gestures by Imitation in a Humanoid Robot in Imitation and Social Learning in Robots, Calinon and Billard. Course: Robots Learning.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

A NONPARAMETRIC BAYESIAN APPROACH FOR

Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.

Statistical Models for Automatic Speech Recognition

3. Applications to Speaker Verification

Presentation for EEL6586 Automatic Speech Processing

Statistical Models for Automatic Speech Recognition

Presentation transcript:

CRICOS No J † CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Audio-visual speaker verification using continuous fused HMMs David Dean*, Sridha Sridharan*, and Tim Wark* † Presented by David Dean Slides will be available at

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 2 Why audio-visual speaker recognition Bimodal recognition exploits the synergy between acoustic speech and visual speech, particularly under adverse conditions. It is motivated by the need—in many potential applications of speech- based recognition—for robustness to speech variability, high recognition accuracy, and protection against impersonation. (Chibelushi, Deravi and Mason 2002)

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 3 Visual Speaker Models A1A2A3A4 V1V2V3V4 Speaker Decision Acoustic Speaker Models Fusion Early and late fusion Most early approaches to audio-visual speaker recognition (AVSPR) used either early or late fusion (feature or output) Problems –Output fusion cannot model temporal dependencies –Feature fusion suffers from problems with noise Early Fusion Late Fusion Speaker Models A1A2A3A4 V1V2V3V4 Speaker Decision

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 4 Middle fusion - coupled HMMs Middle fusion models can accept two streams of input and the combination is done within the classifier Most middle fusion is performed using coupled HMMs (shown here) –Can be difficult to train –Dependencies between hidden states are not strong (Brand 1999)

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 5 Middle fusion – fused HMMs Pan et al. (2004) used probabilistic models to investigate the optimal multi- stream HMM design –Maximise mutual information in audio and video They found that linking the observations of one modality to the hidden states of the other was more optimal than linking just the hidden states (i.e. Coupled HMM) Acoustic Biased FHMM

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 6 Choosing the dominant modility The fused HMM designed results in two designs, acoustic, or video biased The choice of the dominant modality (the one biased towards) should be based upon which individual HMM can more reliably estimate the hidden state sequence for a particular application –Generally audio Alternatively, both versions can be used concurrently and decision fused (as in Pan et al. 2004)

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 7 Continuous fused HMMs The original Fused HMM implementation treated the secondary domain as discrete (Pan et al. 2004) This caused problems with within-speaker variation –Work fine on single session (CUAVE – Dean et al. 2006) –Fail on multi-session (XM2VTS) Continuous FHMMs model both modalities with GMMs Continuous FHMMDiscrete FHMM

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 8 Training FHMMs Both biased FHMM (if needed) are trained independently 1.Train the dominant (audio for acoustic-biased, video for video-biased) HMM independently upon the training observation sequences for that modality 2.The best hidden state sequence of the trained HMM is found for each training observation using the Viterbi process 3.Model the relationship between the dominant hidden state sequence and the training observation sequences for the subordinate modality –i.e. model the probability of getting certain subordinate observation whilst within a particular dominant hidden state

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 9 Decoding FHMMs The dominant FHMM can be viewed as a special type of HMM that outputs observations in two streams This does not affect the decoding lattice, and the Viterbi algorithm can be used to decode

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 10 Experimental setup Speaker Score Visual Feature Extraction Acoustic Feature Extraction Manual Lip Tracking Visual HMM/GMM Acoustic HMM/GMM Output Fusion Acoustic Biased FHMM Visual Biased FHMM Speaker Score Speaker Score HMM/GMM Output Fusion Acoustic-Biased FHMM Video-Biased FHMM

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 11 Training and testing datasets Training and testing configuration was based on the XM2VTSDB protocol (Messer et al. 1999) 12 configurations were generated based on the single XM2VTSDB configuration For each configuration –400 client tests –8000 imposter tests

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 12 Feature extraction Audio –MFCC – energy, + deltas and accelerations = 43 features Video –Lip ROI manually tracked every 50 frames 120x80 pixels Grayscale Down-sampled to 24x16 –DCT – 20 coefficients + deltas and accelerations = 60 features

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 13 Output fusion training Two classifier types for each modality –Gaussian mixture models (GMMs) Trained over entire sequences –Hidden Markov models (HMMs) Trained for each word Speaker models adapted from background models using maximum a posterior (MAP) adaption (Lee & Gauvin 1999) Topology of HMMs and GMMs determined from testing evaluation partition in first configuration

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 14 Fused HMM performance is compared to output fusion of normal HMMs and GMMs in each modality –Audio HMM + Video HMM –Audio HMM + Video GMM –Audio GMM + Video HMM –Audio GMM + Video GMM Evaluation session used to estimate each modalities output score distribution to normalise scores within each modality Background model score subtracted from speaker scores to normalise for environment and length Output fusion testing + Output Fusion Normalisation

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 15 Output fusion results HMM-based models takes advantage of temporal information to improve performance over GMM-based models in both modalities Audio GMM is near HMM, but with large number of Gaussians Video GMM does not improve with more Gaussians

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 16 Output fusion results Advantage of video HMM does not carry over to output fusion Little difference between video HMM and GMM in output fusion Output fusion performance affected mostly by choice of audio classifier Output fusion doesn’t take advantage of video temporal information Audio GMM Audio HMM

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 17 Fused HMM training Both acoustic- and visual-biased FHMMs are examined Individual audio and video HMMs used as basis for FHMMs Secondary models adapted from individual speaker’s GMMs for each state of the underlying HMM Background FHMM was formed similarly

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 18 Fused HMM testing Subordinate observations are up/down sampled to the same rate as the dominant HMM Evaluation session used to estimate each modalities frame-score distribution to normalise scores within each modality –Similar to output-fusion, but on a frame-by-frame basis rather than using final output score As well as using subordinate models adapted to the states of the dominant HMM, testing is performed with –Word subordinate models (same secondary model for entire word) –Global subordinate models (same secondary for all words) Finally, background FHMM score is subtracted to normalise for environment and length

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 19 Comparison with output-fusion If the same subordinate model is used for each dominant state, the FHMM model can be viewed as functionally equivalent to HMM-GMM output fusion –Although in practice this is not the case due to resampling of the subordinate observations and where modality-normalisation occurs Word and State subordinate models can also be viewed as functionally equivalent to HMM-GMM output fusion –Choose the subordinate GMM based on the dominant state for each frame –Provided that the FHMM design doesn’t affect the best path through the lattice

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 20 Acoustic-biased FHMM results There is some benefit in using state-based FHMM models for audio State or word-based FHMM models are better than global for most of the plot Best Performing Output Fusion

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 21 Video-biased FHMM results No benefit in using word or state-based FHMM models for video Therefore, no use in using FHMM models at all Best Performing Output Fusion

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 22 Acoustic vs. video-biased FHMM The acoustic-biased FHMM shows that the audio can be used to segment the video into visually-similar sequences However, the video-biased FHMM cannot use video to segment the audio into acoustically- similar sequences Whilst the performance increase is small, it appears that the acoustic FHMM is benefiting from a temporal relationship between the acoustic states and the visual observations

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 23 Conclusion Video HMM improves performance over video GMM, but not when used not in output fusion Output fusion performance based mainly on acoustic classifier chosen Audio-biased continuous FHMMs can take advantage of the temporal relationship between audio states and video observations However, the video-biased continuous FHMM performance appears to show no corresponding relationship between video states and audio observations

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 24 Continuing/Future Research Secondary GMMs are recognising a large amount of static video information –Skin or lip colour, facial hair, etc. This information has no temporal relationship with the audio states, and may be swamping the more dynamic information available in facial movements A more efficient structure may be realised by using more dynamic video features (mean-removed DCT, contour- based or optical flow) and output fusion with a face GMM –This would take advantage of the temporal audio-visual relationship, in addition to static face-recognition

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 25 Speech Recognition with Fused HMMs FHMMs improve single-modal speech processing in two ways: 1.2 nd modality improves scores within states 2.2 nd modality improves state sequence Text-dependent speaker recognition only benefits from the first improvement –State sequences is fixed However, speech recognition can take advantage of both improvements

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 26 Speech Recognition with Fused HMMs Using first XM2VTS configuration (XM2VTSDB) Speaker-independent, continuous-speech, digit recognition PLP-based Audio Features, Hierarchical LDA-based (of mean-removed DCT) video features We believe this is comparable performance to coupled and asynchronous HMMs –But simpler to train and decode

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 27 FHMMs, synchronous HMMs and feature fusion The FHMM structure can be implemented as a multimodal synchronous HMM, and therefore with minor simplification as a feature-fusion HMM The difference is in how the structure is trained –In synchronous HMMs and feature-fusion, both modalities are used to train the HMMs –FHMMs can be viewed as adapting a multi-modal synchronous HMM from the dominant single-modal HMM If the same number of Gaussians are used for both modalities, a FHMM can be implemented within a single- modal HMM decoder –Decoding is exactly the same as with feature-fusion

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 28 References Brand, M. (1999), A bayesian computer vision system for modeling human interactions, in ÌCVS'99', Gran Canaria, Spain. Chibelushi, C., Deravi, F. & Mason, J. (2002), À review of speech-based bimodal recognition', Multimedia, IEEE Transactions on 4(1), Dean, D., Wark, T. & Sridharan, S. (2006), An examination of audio-visual fused HMMs for speaker recognition, in `MMUA 2006', Toulouse, France. Lee, C.-H. & Gauvain, J.-L. (1993), Speaker adaptation based on MAP estimation of HMM parameters, in Àcoustics, Speech, and Signal Processing, ICASSP-93., 1993 IEEE International Conference on', Vol. 2, pp vol.2. Luettin, J. & Maitre, G. (1998), Evaluation protocol for the extended M2VTS database (XM2VTSDB), Technical report, IDIAP. Messer, K., Matas, J., Kittler, J., Luettin, J. & Maitre, G. (1999), XM2VTSDB: The extended M2VTS database, in Àudio and Video-based Biometric Person Authentication (AVBPA '99), Second International Conference on', Washington D.C., pp Pan, H., Levinson, S., Huang, T. & Liang, Z.-P. (2004), À fused hidden markov model with application to bimodal speech processing', IEEE Transactions on Signal Processing 52(3),

CRICOS No J Speech, Audio, Image and Video Research Laboratory Audio-Visual Speaker Verification using Continuous Fused HMMs CSIRO ICT Centre 29 Questions?