CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
CRICOS No J † CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Audio-visual speaker verification using continuous fused HMMs.
Automatic Feature Extraction for Multi-view 3D Face Recognition
GMM-Based Multimodal Biometric Verification Yannis Stylianou Yannis Pantazis Felipe Calderero Pedro Larroy François Severin Sascha Schimke Rolando Bonal.
SecurePhone Workshop - 24/25 June Speaking Faces Verification Kevin McTait Raphaël Blouet Gérard Chollet Silvia Colón Guido Aversano.
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.
Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,
Tracking Objects with Dynamics Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 04/21/15 some slides from Amin Sadeghi, Lana Lazebnik,
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Communications & Multimedia Signal Processing Analysis of the Effects of Train noise on Recognition Rate using Formants and MFCC Esfandiar Zavarehei Department.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Face Detection: a Survey Speaker: Mine-Quan Jing National Chiao Tung University.
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Lip Feature Extraction Using Red Exclusion Trent W. Lewis and David M.W. Powers Flinders University of SA VIP2000.
LYU 0102 : XML for Interoperable Digital Video Library Recent years, rapid increase in the usage of multimedia information, Recent years, rapid increase.
HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
The development of object recognition application to automatically count the number of passengers entering the bus Abzal Adilzhan, Serikbolsyn Myrzakhmet.
ICASSP'06 1 S. Y. Kung 1 and M. W. Mak 2 1 Dept. of Electrical Engineering, Princeton University 2 Dept. of Electronic and Information Engineering, The.
Visual Speech Recognition Using Hidden Markov Models Kofi A. Boakye CS280 Course Project.
1 Real Time, Online Detection of Abandoned Objects in Public Areas Proceedings of the 2006 IEEE International Conference on Robotics and Automation Authors.
Hand Signals Recognition from Video Using 3D Motion Capture Archive Tai-Peng Tian Stan Sclaroff Computer Science Department B OSTON U NIVERSITY I. Introduction.
Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.
REAL TIME EYE TRACKING FOR HUMAN COMPUTER INTERFACES Subramanya Amarnag, Raghunandan S. Kumaran and John Gowdy Dept. of Electrical and Computer Engineering,
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Kinect Player Gender Recognition from Speech Analysis
What’s Making That Sound ?
Abstract Some Examples The Eye tracker project is a research initiative to enable people, who are suffering from Amyotrophic Lateral Sclerosis (ALS), to.
Automated Lip reading technique for people with speech disabilities by converting identified visemes into direct speech using image processing and machine.
Irfan Essa, Alex Pentland Facial Expression Recognition using a Dynamic Model and Motion Energy (a review by Paul Fitzpatrick for 6.892)
Multimedia Databases (MMDB)
Multimodal Interaction Dr. Mike Spann
Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)
Mean-shift and its application for object tracking
CRICOS No J † CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs for Speaker Recognition.
EE 492 ENGINEERING PROJECT LIP TRACKING Yusuf Ziya Işık & Ashat Turlibayev Yusuf Ziya Işık & Ashat Turlibayev Advisor: Prof. Dr. Bülent Sankur Advisor:
Multimodal Information Analysis for Emotion Recognition
Speaker independent Digit Recognition System Suma Swamy Research Scholar Anna University, Chennai 10/22/2015 9:10 PM 1.
1 Webcam Mouse Using Face and Eye Tracking in Various Illumination Environments Yuan-Pin Lin et al. Proceedings of the 2005 IEEE Y.S. Lee.
Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,
Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,
Learning to perceive how hand-written digits were drawn Geoffrey Hinton Canadian Institute for Advanced Research and University of Toronto.
Prof. Thomas Sikora Technische Universität Berlin Communication Systems Group Thursday, 2 April 2009 Integration Activities in “Tools for Tag Generation“
Variation of aspect ratio Voice section Correct voice section Voice Activity Detection by Lip Shape Tracking Using EBGM Purpose What is EBGM ? Experimental.
Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments 張智星
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.
Synthesizing Natural Textures Michael Ashikhmin University of Utah.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Performance Comparison of Speaker and Emotion Recognition
Team Members Ming-Chun Chang Lungisa Matshoba Steven Preston Supervisors Dr James Gain Dr Patrick Marais.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
An MPEG-7 Based Semantic Album for Home Entertainment Presented by Chen-hsiu Huang 2003/08/12 Presented by Chen-hsiu Huang 2003/08/12.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Face Detection Final Presentation Mark Lee Nic Phillips Paul Sowden Andy Tait 9 th May 2006.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Wrapping Snakes For Improved Lip Segmentation Matthew Ramage Dr Euan Lindsay (Supervisor) Department of Mechanical Engineering.
UCD Electronic and Electrical Engineering Robust Multi-modal Person Identification with Tolerance of Facial Expression Niall Fox Dr Richard Reilly University.
An Introduction to Digital Image Processing Dr.Amnach Khawne Department of Computer Engineering, KMITL.
Portable Camera-Based Assistive Text and Product Label Reading From Hand-Held Objects for Blind Persons.
A. M. R. R. Bandara & L. Ranathunga
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.
Tracking Objects with Dynamics
EE 492 ENGINEERING PROJECT
Speaker Identification:
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

CRICOS No J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing David Dean*, Patrick Lucey*, Sridha Sridharan* and Tim Wark* † Presented by David Dean

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 2 Audio-Visual Speech Processing - Overview Speech or speaker recognition traditionally audio only –Mature area of research Significant problems in real-world environments (Wark2001) –High acoustic noise –Variation of speech Audio-visual speech processing adds an additional modality to help alleviate these problems

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 3 Audio-Visual Speech Processing - Overview Speech and speaker recognition tasks have many overlapping areas The same configuration can be used for both text-dependent speaker recognition, and speaker-dependent speech recognition –Train speaker-dependent word (or sub-word) models –Speaker recognition chooses amongst speakers for a particular word, or –Word recognition chooses amongst words for a particular speaker.

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 4 Audio-Visual Speech Processing - Overview Little research has been done into how the two applications (speaker vs. speech) differ in areas other than the set of models chosen for recognition One area of interest in this research is the reliance on each modality –Acoustic features typically work equally well in either application (Young2002) –Little consensus has been reach on the suitability of visual features for each application

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 5 Experimental Setup Speech/ Speaker Decision Visual Feature Extraction Acoustic Feature Extraction Lip Location & Tracking Visual Speech/Speaker Models Acoustic Speech/Speaker Models Decision Fusion

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 6 Lip location and tracking

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 7 Finding Faces Manual Red, Green and Blue skin thresholds were trained for each speaker Faces were located by applying these thresholds to the video frames

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 8 Finding and tracking eyes Top half of face region is searched for eyes A shifted version of Cr-Cb thresholding was performed to locate possible eye regions (Butler2003) Invalid eye candidate regions were removed, and the most likely pair of candidates chosen as the eyes New eye location compared to old, and ignored if too far from old About 40% of sequences had to be manually eye-tracked every 50 frames.

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 9 Finding and tracking lips Eye locations are used to define rotation-normalised lip search region (LSR) LSR converted to Red/Green colour- space and thresholded Unlikely lip-candidates are removed Rectangular area with largest amount of lip-candidate area within is lip ROI.

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 10 Feature Extraction and Datasets MFCC – energy, + deltas and accelerations = 48 features PCA – 20 eigenlip coefficients + deltas and accelerations = 60 features –Eigenlip-space trained on entire data set of lip images Stationary speech from CUAVE (Patterson2002) –5 sequences for training, 2 for testing (per speaker) –Testing was also performed on speech-babble corrupted noisy versions

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 11 Training Phone transcriptions obtained from earlier research (Lucey 2004) were used to train speaker independent HMM phone models in both audio and visual domains Speaker dependent models adapted using MLLR adaption from speaker independent models HMM Toolkit (HTK) was used (Young 2002)

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 12 Comparing acoustic and visual information for speech processing Investigated using the identification rates of speaker- dependent acoustic and visual phoneme models Test segments freely transcribed using all speaker dependent phoneme models –No restriction to specified user or word Confusion tables for speech (phoneme) and speaker recognition were examined to get identification rates Corrects02m, /w/s02m, /ah/s02m, /n/ Audios10m, /w/s02m, /ah/s02m, /n/ Videos02m, /sp/s02m, /n/

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 13 Example Confusion Table (Phonemes in Clean Acoustic Speech) Actual Phonemes Recognised Phonemes

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 14 Example Confusion Table (Phonemes in Clean Visual Speech) Actual Phonemes Recognised Phonemes

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 15 Likelihood of speaker and phone identification using phoneme models

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 16 Fusion Because of the differing performance of each modality at speech and speaker recognition, the fusion configuration for each task must be adjusted with these performances in mind For these experiments –Weighted sum fusion of the top 10 normalised scores in each modality – ranges from 0 (video only) to 1 (audio only)

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 17 Speech vs Speaker The response of each system to speech-babble noise over a selected range of values were compared. Word Identification

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 18 Speech vs Speaker The response of each system to speech-babble noise over a selected range of values were compared. Speaker Identification

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 19 Speech vs Speaker Acoustic performance is basically equal for both tasks Visual performance is clearly better for speaker recognition Speech recognition fusion is catastrophic at nearly all noise levels Speaker recognition is only catastrophic at high noise levels We can also get an idea of the dominance of each modality by looking at values of that produce the ‘best’ lines (ideal adaptive fusion)

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 20 ‘Best’ Fusion

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 21 Conclusion and Further Work PCA-based visual features are mostly person-dependent –Should be used with care in visual speech recognition tasks It is believed that this dependency stems from the large amount of static person-specific information capture along with the dynamic lip configuration –Skin colour, facial hair, etc. Visual information for speech recognition is only useful in high noise situations

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 22 Conclusion and Further Work Even at very low levels of acoustic noise, visual speech information can provide similar performance to acoustic information for speaker recognition Adaptive fusion for speaker recognition should therefore be biased towards visual features for best performance Further study needs to be performed in methods of improving the visual modality for speech recognition by focusing more on the dynamic speech-related information –Mean-image removal, Optical flow, Contour representations

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 23 References (Butler2003) D. Butler, C. McCool, M. McKay, S. Lowther, V. Chandran, and S. Sridharan, "Robust Face Localisation Using Motion, Colour and Fusion," presented at Proceedings of the Seventh International Conference on Digital Image Computing: Techniques and Applications, DICTA 2003, Macquarie University, Sydney, Australia, (Lucey2004) P. Lucey, T. Martin, and S. Sridharan, "Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments," presented at SST 2004, Sydney, Australia, (Patterson2002) E. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy, "CUAVE: a new audio-visual database for multimodal human-computer interface research," presented at Acoustics, Speech, and Signal Processing, Proceedings. (ICASSP '02). IEEE International Conference on, (Young2002) S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book, 3.2 ed. Cambridge, UK: Cambridge University Engineering Department., (Wark2001) T. Wark and S. Sridharan, "Adaptive fusion of speech and lip information for robust speaker identification," Digital Signal Processing, vol. 11, pp , 2001.

CRICOS No J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 24 Questions?