CRICOS No. 000213J † CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs for Speaker Recognition.

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

Audio Visual Speech Recognition
CRICOS No J † CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Audio-visual speaker verification using continuous fused HMMs.
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
Multi-View Learning in the Presence of View Disagreement C. Mario Christoudias, Raquel Urtasun, Trevor Darrell UC Berkeley EECS & ICSI MIT CSAIL.
SecurePhone Workshop - 24/25 June Speaking Faces Verification Kevin McTait Raphaël Blouet Gérard Chollet Silvia Colón Guido Aversano.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models in NLP
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.
Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Speaker Adaptation for Vowel Classification
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.
CRICOS No J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
ICASSP'06 1 S. Y. Kung 1 and M. W. Mak 2 1 Dept. of Electrical Engineering, Princeton University 2 Dept. of Electronic and Information Engineering, The.
Visual Speech Recognition Using Hidden Markov Models Kofi A. Boakye CS280 Course Project.
Learning and Recognizing Activities in Streams of Video Dinesh Govindaraju.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
A Literature Review By Xiaozhen Niu Department of Computing Science
Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.
REAL TIME EYE TRACKING FOR HUMAN COMPUTER INTERFACES Subramanya Amarnag, Raghunandan S. Kumaran and John Gowdy Dept. of Electrical and Computer Engineering,
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
Introduction to Automatic Speech Recognition
Kinect Player Gender Recognition from Speech Analysis
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
June 28th, 2004 BioSecure, SecurePhone 1 Automatic Speaker Verification : Technologies, Evaluations and Possible Future Gérard CHOLLET CNRS-LTCI, GET-ENST.
1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute.
7-Speech Recognition Speech Recognition Concepts
EE 492 ENGINEERING PROJECT LIP TRACKING Yusuf Ziya Işık & Ashat Turlibayev Yusuf Ziya Işık & Ashat Turlibayev Advisor: Prof. Dr. Bülent Sankur Advisor:
 Speech is bimodal essentially. Acoustic and Visual cues. H. McGurk and J. MacDonald, ''Hearing lips and seeing voices'', Nature, pp , December.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Speaker independent Digit Recognition System Suma Swamy Research Scholar Anna University, Chennai 10/22/2015 9:10 PM 1.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,
Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.
Relative Hidden Markov Models Qiang Zhang, Baoxin Li Arizona State University.
July Age and Gender Recognition from Speech Patterns Based on Supervised Non-Negative Matrix Factorization Mohamad Hasan Bahari Hugo Van hamme.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
AUTOMATIC TARGET RECOGNITION AND DATA FUSION March 9 th, 2004 Bala Lakshminarayanan.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Performance Comparison of Speaker and Emotion Recognition
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
Chapter 8. Learning of Gestures by Imitation in a Humanoid Robot in Imitation and Social Learning in Robots, Calinon and Billard. Course: Robots Learning.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Wrapping Snakes For Improved Lip Segmentation Matthew Ramage Dr Euan Lindsay (Supervisor) Department of Mechanical Engineering.
UCD Electronic and Electrical Engineering Robust Multi-modal Person Identification with Tolerance of Facial Expression Niall Fox Dr Richard Reilly University.
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Automatic Speech Recognition
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Multimodal Biometric Security
Statistical Models for Automatic Speech Recognition
3. Applications to Speaker Verification
Statistical Models for Automatic Speech Recognition
A maximum likelihood estimation and training on the fly approach
EE 492 ENGINEERING PROJECT
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

CRICOS No J † CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs for Speaker Recognition David Dean*, Tim Wark* † and Sridha Sridharan* Presented by David Dean

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 2 Why audio-visual speaker recognition Bimodal recognition exploits the synergy between acoustic speech and visual speech, particularly under adverse conditions. It is motivated by the need—in many potential applications of speech- based recognition—for robustness to speech variability, high recognition accuracy, and protection against impersonation. (Chibelushi, Deravi and Mason 2002)

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 3 Visual Speaker Models A1A2A3A4 V1V2V3V4 Speaker Decision Acoustic Speaker Models Fusion Early and late fusion Most early approaches to audio-visual speaker recognition (AVSPR) used either early or late fusion (feature or decision) Problems –Decision fusion cannot model temporal dependencies –Feature fusion suffers from problems with noise, and has difficulties in modelling the asychronicity of audio-visual speech (Chibelushi et al., 2002) Early Fusion Late Fusion Speaker Models A1A2A3A4 V1V2V3V4 Speaker Decision

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 4 Middle fusion - coupled HMMs Middle fusion models can accept two streams of input and the combination is done within the classifier Most middle fusion is performed using coupled HMMs (shown here) –Can be difficult to train –Dependencies between hidden states are not strong (Brand, 1999)

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 5 Middle fusion – fused HMMs Pan et al. (2004) used probabilistic models to investigate the optimal multi- stream HMM design –Maximise mutual information in audio and video They found that linking the observations of one modality to the hidden states of the other was more optimal than linking just the hidden states (i.e. Coupled HMM) The fused HMM designed results in two designs, acoustic, and video biased Acoustic Biased FHMM

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 6 Choosing the dominant modility The choice of the dominant modality (the one biased towards) should be based upon which individual HMM can more reliably estimate the hidden state sequence for a particular application –Generally audio Alternatively, both versions can be used concurrently and decision fused (as in Pan et al., 2004) This research looks at the relative performance of each biased FHMM design individually –If recognition can be performed using only one FHMM, decoding can be done in half the time compared to decision fusion of both FHMMs

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 7 Training FHMMs Both biased FHMM (if needed) are trained independently 1.Train the dominant (audio for acoustic-biased, video for video-biased) HMM independently upon the training observation sequences for that modality 2.The best hidden state sequence of the trained HMM is found for each training observation using the Viterbi process 3.Calculate the coupling parameters between the dominant hidden state sequence and the training observation sequences for the subordinate modality –i.e. estimate the probability of getting certain subordinate observation whilst within a particular dominant hidden state

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 8 Decoding FHMMs The dominant FHMM can be viewed as a special type of HMM that outputs observations in two streams This does not affect the decoding lattice, and the Viterbi algorithm can be used to decode –Provided that it has access to observations in both streams

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 9 Experimental setup Speaker Decision Visual Feature Extraction Acoustic Feature Extraction Lip Location & Tracking Visual HMM Acoustic HMM Decision Fusion Acoustic Biased FHMM Visual Biased FHMM Speaker Decision Speaker Decision HMM Decision Fusion Acoustic-Biased FHMM Video-Biased FHMM

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 10 Lip location and tracking Lip tracking performed as by Dean et al., 2005.

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 11 Feature extraction and datasets Audio –MFCC – energy, + deltas and accelerations = 43 features Video –DCT – 20 coefficients + deltas and accelerations = 60 features Isolated speech from CUAVE (Patterson et al, 2002) –4 sequences for training, 1 for testing (for each of 36 speakers) –Each sequence is ‘zero one two … nine’ –Testing was also performed on noisy data Speech-babble corrupted audio versions Poorly-tracked lip region-of-interest video features Well tracked Poorly tracked

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 12 Fused HMM design Both acoustic- and visual-biased FHMMs are examined Underlying HMMs are speaker-dependent word-models for each digit –MLLR adapted from speaker-independent background word- models –Trained using HTK Toolkit (Young et al, 2002) Secondary models are based on discrete vector-quantisation codebooks –Codebook is generated from secondary data –The number of occurrences of each discrete VQ value within each state was recorded to arrive at an estimate of. –Codebook size of 100 was found to work best for both modalities

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 13 Decision Fusion Fused HMM performance is compared to decision fusion of normal HMMs in each state Weight of each stream is based upon audio weight parameter α, which can range from –0 (video only), to –1 (audio only) Two decision fusion configurations were used –α = 0.5 –Simulated adaptive fusion Best α for each noise level × × + α 1 - α Decision Fusion

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 14 Speaker recognition: well tracked video Tier 1 recognition rate Video HMM, video- biased FHMM, and Decision-Fusion are all performing at 100% Audio-biased FHMM performs much better than the HMM only, but not as well as video at low noise levels

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 15 Speaker recognition: poorly tracked video Video is degraded through poor tracking Video FHMM has no real improvement on video HMM Audio FHMM is better than all for most audio-noise levels –Even better than simulated adaptive fusion

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 16 Video vs. Audio-Biased FHMM Adding video to audio HMMs to create an acoustic-biased FHMM provides a clear improvement over the HMM alone However, adding audio to video HMMs provides neglibile improvement –Video HMM provides poor state alignment

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 17 Acoustic-biased FHMM vs. Decision Fusion FHMMs can take advantage of the relationship between modalities on a frame-by-frame basis Decision fusion can only compare two scores over an entire utterance FHMM even works better than simulated adaptive fusion for most noise levels –Actual adaptive fusion would require estimation of noise levels –FHMM is running with no knowledge of noise

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 18 Conclusion Acoustic biased FHMM provide a clear improvement on acoustic HMMs Video biased FHMM do not improve upon video HMMs –Video HMMs are unreliable at estimating state sequences Acoustic biased FHMM performs better than simulated adaptive decision fusion at most noise levels –With around half the decoding processing cost (more when the cost of real adaptive fusion is included)

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 19 Future/Continuing Work As the CUAVE database is quite small for speaker recognition experiments at only 36 subjects, research has continued on the XM2VTS database (Messer et al., 1999), which has 295 subjects Continuous GMM models replaced the VQ secondary models –Video DCT VQ couldn’t handle session variability Verification (rather than identification) allows system performance to be examined more easily System is still undergoing development

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 20 References M. Brand, “A bayesian computer vision system for modeling human interactions,” in ICVS’99, Gran Canaria, Spain, C. Chibelushi, F. Deravi, and J. Mason, “A review of speech-based bimodal recognition,” Multimedia, IEEE Transactions on, vol. 4, no. 1, pp. 23–37, D. Dean, P. Lucey, S. Sridharan, and T. Wark, “Comparing audio and visual information for speech processing,” in ISSPA 2005, Sydney, Australia, 2005, pp. 58–61. K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, “Xm2vtsdb: The extended m2vts database,” in Audio and Video-based Biometric Person Authentication (AVBPA ’99), Second International Conference on, Washington D.C., 1999, pp. 72–77. H. Pan, S. Levinson, T. Huang, and Z.-P. Liang, “A fused hidden markov model with application to bimodal speech processing,” IEEE Transactions on Signal Processing, vol. 52, no. 3, pp. 573–581, E. Patterson, S. Gurbuz, Z. Tufekci, and J. Gowdy, “Cuave: a new audio-visual database for multimodal human-computer interface research,” in Acoustics, Speech, and Signal Processing, Proceedings. (ICASSP ’02). IEEE International Conference on, vol. 2, 2002, pp. 2017–2020. S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book, 3rd ed. Cambridge, UK: Cambridge University Engineering Department., 2002.

CRICOS No J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 21 Questions?