ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: Computer Vision, Speech Communication and.

Slides:



Advertisements
Similar presentations
Advances in WP1 Trento Meeting January
Advertisements

Towards speaker and environmental robustness in ASR: the HIWIRE project A. Potamianos 1, G. Bouselmi 2, D. Dimitriadis 3, D. Fohr 2, R. Gemello 4, I. Illina.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
HIWIRE MEETING Paris, February 11, 2005 JOSÉ C. SEGURA LUNA GSTC UGR.
An Energy Search Approach to Variable Frame Rate Front-End Processing for Robust ASR Julien Epps and Eric H. C. Choi National ICT Australia Presenter:
AAM based Face Tracking with Temporal Matching and Face Segmentation Dalong Du.
Advances in WP1 Turin Meeting – 9-10 March
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.
Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,
Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: Computer Vision, Speech Communication and Signal Processing Research.
Real-Time Audio-Visual Automatic Speech Recognition Demonstrator TSI-TUC, Greece (A. Potamianos, E. Sanchez-Soto, M. Perakakis) NTUA, Greece (P. Maragos,
HIWIRE MEETING Nancy, July 6-7, 2006 José C. Segura, Ángel de la Torre.
HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)
HIWIRE Progress Report Chania, May 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.
Advances in WP2 Trento Meeting – January
Feature Selection, Acoustic Modeling and Adaptation SDSG REVIEW of recent WORK Technical University of Crete Speech Processing and Dialog Systems Group.
MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.
Advances in WP2 Chania Meeting – May
HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.
ICCS-NTUA Contributions to E-teams of MUSCLE WP6 and WP10 Prof. Petros Maragos National Technical University of Athens School of Electrical and Computer.
Advances in WP1 and WP2 Paris Meeting – 11 febr
HIWIRE MEETING Trento, January 11-12, 2007 José C. Segura, Javier Ramírez.
HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.
1 New Technique for Improving Speech Intelligibility for the Hearing Impaired Miriam Furst-Yust School of Electrical Engineering Tel Aviv University.
HIWIRE meeting ITC-irst Activity report Marco Matassoni, Piergiorgio Svaizer March Torino.
Why is ASR Hard? Natural speech is continuous
HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Vasilis Diakoloukas Technical.
Introduction to Automatic Speech Recognition
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.
Multimodal Interaction Dr. Mike Spann
Exploiting video information for Meeting Structuring ….
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
NONLINEAR DYNAMIC INVARIANTS FOR CONTINUOUS SPEECH RECOGNITION Author: Daniel May Mississippi State University Contact Information: 1255 Louisville St.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Yi-zhang Cai, Jeih-weih Hung 2012/08/17 報告者:汪逸婷 1.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
Performance Comparison of Speaker and Emotion Recognition
Database and Visual Front End Makis Potamianos.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.
UCD Electronic and Electrical Engineering Robust Multi-modal Person Identification with Tolerance of Facial Expression Niall Fox Dr Richard Reilly University.
Motorola presents in collaboration with CNEL Introduction  Motivation: The limitation of traditional narrowband transmission channel  Advantage: Phone.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.
Statistical Models for Automatic Speech Recognition
Statistical Models for Automatic Speech Recognition
Speech / Non-speech Detection
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: Computer Vision, Speech Communication and Signal Processing Research Group HIWIRE

ICCS - NTUA HIWIRE Meeting, Granada June 2005 Group Leader : Prof. Petros Maragos Ph.D. Students / Graduate Research Assistants :  D. Dimitriadis (speech: recognition, modulations)  V. Pitsikalis (speech: recognition, fractals/chaos, NLP)  A. Katsamanis (speech: modulations, statistical processing, recognition)  G. Papandreou (vision: PDEs, active contours, level sets, AV-ASR)  G. Evangelopoulos (vision/speech: texture, modulations, fractals)  S. Leykimiatis (speech: statistical processing, microphone arrays) HIWIRE Involved CVSP Members

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 ICCS-NTUA in HIWIRE: 1 st Year Evaluation  DatabasesCompleted  BaselineCompleted WP1  Noise Robust FeaturesResults 1 st Year  Audio-Visual ASR Baseline + Visual Features  Multi-microphone arrayExploratory Phase  VADPrelim. Results WP2  Speaker NormalizationBaseline  Non-native Speech DatabaseCompleted

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 ICCS-NTUA in HIWIRE: 1 st Year Evaluation  DatabasesCompleted  BaselineCompleted WP1  Noise Robust FeaturesResults 1 st Year Modulation Features Results 1 st Year Fractal Features Results 1 st Year  Audio-Visual ASR Baseline + Visual Features  Multi-microphone arrayExploratory Phase  VADPrelim. Results WP2  Speaker NormalizationBaseline  Non-native Speech DatabaseCompleted

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 WP1: Noise Robustness Platform: HTK Baseline + Evaluation:  Aurora 2, Aurora 3, TIMIT+NOISE Modulation Features  AM-FM Modulations  Teager Energy Cepstrum Fractal Features  Dynamical Denoising  Correlation Dimension  Multiscale Fractal Dimension Hybrid-Merged Features up to +62 % (Aurora 3) up to +36% (Aurora 2) up to +61 % ( Aurora 2)

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 ICCS-NTUA in HIWIRE: 1 st Year Evaluation  DatabasesCompleted  BaselineCompleted WP1  Noise Robust FeaturesResults 1 st Year  Speech Modulation FeaturesResults 1 st Year Fractal Features Results 1 st Year  Audio-Visual ASR Baseline + Visual Features  Multi-microphone arrayExploratory Phase  VADPrelim. Results WP2  Speaker NormalizationBaseline  Non-native Speech DatabaseCompleted

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Speech Modulation Features Filterbank Design Short-Term AM-FM Modulation Features  Short-Term Mean Inst. Amplitude IA-Mean  Short-Term Mean Inst. Frequency IF-Mean  Frequency Modulation Percentages FMP Short-Term Energy Modulation Features  Average Teager Energy, Cepstrum Coef. TECC

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Modulation Acoustic Features Speech Nonlinear Processing Demodulation Robust Feature Transformation/ Selection Regularization + Multiband Filtering Statistical Processing V.A.D. Energy Features: Teager Energy Cepstrum Coeff. TECC AM-FM Modulation Features: Mean Inst. Ampl. IA-Mean Mean Inst. Freq. IF-Mean Freq. Mod. Percent. FMP

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 TIMIT-based Speech Databases TIMIT Database:  Training Set: 3696 sentences, ~35 phonemes/utterances  Testing Set: 1344 utterances, phonemes  Sampling Frequency 16 kHz Feature Vectors:  MFCC+C0+AM-FM+1 st +2 nd Time Derivatives Stream Weights: (1) for MFCC and (2) for ΑΜ-FM 3-state left-right HMMs, 16 mixtures All-pair, Unweighted grammar Performance Criterion: Phone Accuracy Rates (%) Back-end System: HTK v3.2.0

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Results: TIMIT+Noise Up to +106%

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Aurora 3 - Spanish Connected-Digits, Sampling Frequency 8 kHz Training Set:  WM (Well-Matched): 3392 utterances (quiet 532, low 1668 and high noise 1192  MM (Medium-Mismatch): 1607 utterances (quiet 396 and low noise 1211)  HM (High-Mismatch): 1696 utterances (quiet 266, low 834 and high noise 596) Testing Set:  WM: 1522 utterances (quiet 260, low 754 and high noise 508), 8056 digits  MM: 850 utterances (quiet 0, low 0 and high noise 850), 4543 digits  HM: 631 utterances (quiet 0, low 377 and high noise 254), 3325 digits 2 Back-end ASR Systems (ΗΤΚ and BLasr) Feature Vectors: MFCC+AM-FM (or Auditory+ΑM-FM), TECC All-Pair, Unweighted Grammar (or Word-Pair Grammar) Performance Criterion: Word (digit) Accuracy Rates

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Results: Aurora 3 (HTK) Up to +62%

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Databases: Aurora 2 Task: Speaker Independent Recognition of Digit Sequences TI - Digits at 8kHz Training (8440 Utterances per scenario, 55M/55F)  Clean (8kHz, G712)  Multi-Condition (8kHz, G712) 4 Noises (artificial): subway, babble, car, exhibition 5 SNRs : 5, 10, 15, 20dB, clean Testing, artificially added noise  7 SNRs: [-5, 0, 5, 10, 15, 20dB, clean]  A: noises as in multi-cond train., G712 (28028 Utters)  B: restaurant, street, airport, train station, G712 (28028 Utters)  C: subway, street (MIRS) (14014 Utters)

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Results: Aurora 2 Up to +12%

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Work To Be Done on Modulation Features Refinements w.r.t. AM-FM Features Fusion w. Other Features

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 ICCS-NTUA in HIWIRE: 1 st Year Evaluation  DatabasesCompleted  BaselineCompleted WP1  Noise Robust FeaturesResults 1 st Year Speech Modulation Features Results 1 st Year Fractal FeaturesResults 1 st Year  Audio-Visual ASR Baseline + Visual Features  Multi-microphone arrayExploratory Phase  VADPrelim. Results WP2  Speaker NormalizationBaseline  Non-native Speech DatabaseCompleted

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Fractal Features N-d Cleaned Embedding N-d Signal Local SVD speech signal Filtered Dynamics - Correlation Dimension (8) Noisy Embedding Filtered Embedding FDCD Multiscale Fractal Dimension (6) MFD Geometrical Filtering

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Databases: Aurora 2 Task: Speaker Independent Recognition of Digit Sequences TI - Digits at 8kHz Training (8440 Utterances per scenario, 55M/55F)  Clean (8kHz, G712)  Multi-Condition (8kHz, G712) 4 Noises (artificial): subway, babble, car, exhibition 5 SNRs : 5, 10, 15, 20dB, clean Testing, artificially added noise  7 SNRs: [-5, 0, 5, 10, 15, 20dB, clean]  A: noises as in multi-cond train., G712 (28028 Utters)  B: restaurant, street, airport, train station, G712 (28028 Utters)  C: subway, street (MIRS) (14014 Utters)

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Results: Aurora 2 Up to +40%

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Results: Aurora 2 Up to +27%

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Results: Aurora 2 Up to +61%

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Future Directions on Fractal Features Refine Fractal Feature Extraction. Application to Aurora 3. Fusion with other features.

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 ICCS-NTUA in HIWIRE: 1 st Year Evaluation  DatabasesCompleted  BaselineCompleted WP1  Noise Robust FeaturesResults 1 st Year  Audio-Visual ASR Baseline + Visual Features  Multi-microphone arrayExploratory Phase  VADPrelim. Results WP2  Speaker NormalizationBaseline  Non-native Speech DatabaseCompleted

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Visual Front-End Aim: Extract low-dimensional visual speech feature vector from video Visual front-end modules: Speaker's face detection ROI tracking Facial Model Fitting Visual feature extraction Challenges: Very high dimensional signal - which features are proper? Robustness Computational Efficiency

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Face Modeling ● A well studied problem in Computer Vision: ● Active Appearance Models, Morphable Models, Active Blobs ● Both Shape & Appearance can enhance lipreading ● The shape and appearance of human faces “live” in low dimensional manifolds = =

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Image Fitting Example step 2step 6step 10 step 14step 18

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Example: Face Interpretation Using AAM original video shape track superimposed on original video reconstructed face This is what the visual-only speech recognizer “sees”! ● Generative models like AAM allow us to evaluate the output of the visual front-end

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Evaluation on the CUAVE Database

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Audio-Visual ASR: Database Subset of CUAVE database used:  36 speakers (30 training, 6 testing)  5 sequences of 10 connected digits per speaker  Training set: 1500 digits (30x5x10)  Test set: 300 digits (6x5x10) CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations) CUAVE was kindly provided by the Clemson University

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Recognition Results (Word Accuracy) Data  Training: ~500 digits (29 speakers)  Testing: ~100 digits (4 speakers) AudioVisualAudiovisual Classification99%46%85% Recognition98%26%78%

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Future Work Visual Front-end  Better trained AAM  Temporal tracking Feature fusion  Experimentation with alternative DBN architectures  Automatic stream weight determination Integration with non-linear acoustic features Experiments on other audio-visual databases Systematic evaluation of visual features

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 ICCS-NTUA in HIWIRE: 1 st Year Evaluation  DatabasesCompleted  BaselineCompleted WP1  Noise Robust FeaturesResults 1 st Year Modulation Features Results 1 st Year Fractal Features Results 1 st Year  Audio-Visual ASR Baseline + Visual Features  Multi-microphone arrayExploratory Phase  VADPrelim. Results WP2  Speaker NormalizationBaseline  Non-native Speech DatabaseCompleted

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 User Robustness, Speaker Adaptation VTLN Baseline  Platform: HTK  Database: AURORA 4  Fs = 8 kHz  Scenarios: Training, Testing  Comparison with MLLR Collection of non-Native Speech Data Completed  10 Speakers  100 Utterances/Speaker

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Vocal Tract Length Normalization Implementation: HTK Warping Factor Estimation  Maximum Likelihood (ML) criterion Frequency Warping Figures from Hain99, Lee96

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 VTLN Training  AURORA 4 Baseline Setup  Clean (SIC), Multi-Condition (SIM), Noisy (SIN) Testing  Estimate warping factor using adaptation utterances (Supervised VTLN) Per speaker warping factor (1, 2, 10, 20 Utterances)  2-pass Decoding 1 st pass  Get a hypothetical transcription Alignment and ML to estimate per utterance warping factor 2 nd pass  Decode properly normalized utterance

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Databases: Aurora 4 Task: 5000 Word, Continuous Speech Recognition WSJ0: (16 / 8 kHz) + Artificially Added Noise 2 microphones: Sennheiser, Other Filtering: G712, P341 Noises: Car, Babble, Restaurant, Street, Airport, Train Station Training (7138 Utterances per scenario)  Clean: Sennheiser mic.  Multi-Condition: Sennheiser – Other mic., 75% w. artificially added SNR: 10 – 20 dB  Noisy: Sennheiser, artificially added noise SNR: 10 – 20 dB Testing (330 Utterances – 166 Utterances each. Speaker # = 8)  SNR: 5-15 dB  1-7: Sennheiser microphone  8-14: Other microphone

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 VTLN Results, Clean Training

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 VTLN Results, Multi-Condition Training

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 VTLN Results, Noisy Training

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Future Directions for Speaker Normalization Estimate warping transforms at signal level  Exploit instantaneous amplitude or frequency signals to estimate the warping parameters, Normalize the signal Effective integration with model-based adaptation techniques (collaboration with TSI)

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 ICCS-NTUA in HIWIRE: 1 st Year Evaluation  DatabasesCompleted  BaselineCompleted WP1  Noise Robust FeaturesResults 1 st Year  Audio-Visual ASR Baseline + Visual Features  Multi-microphone arrayExploratory Phase  VADPrelim. Results WP2  Speaker NormalizationBaseline  Non-native Speech DatabaseCompleted

WP1: Appendix Slides Aurora 3

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 ASR Results Ι MFCC*+ FMP TIMIT+ Car TIMIT-Based Speech Databases (Correct Phone Accuracies (%)) MFCC*+ IF-Mean * MFCC+C 0 +D+DD, # states=3, # mixtures= MFCC*+ IA-Mean TEner. CC MFCC* Av. Rel. Improv. TIMIT+ Pink TIMIT+ White TIMIT+ Babble NTIMITTIMIT Features

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Experimental Results IIa (HTK) % MFCC*+FMP Aurora3 (Spanish Task) (Correct Word Accuracies (%)) Average Aurora Front-End (WI007) * MFCC+log(Ener)+D+DD+CMS, # states=14, # mixtures= HM % MFCC*+IF-Mean % MFCC*+IA-Mean % TEnerCC+log(Ener ) +CMS % MFCC* Av. Rel. Improv. MMWM Scenario Features

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Aurora 3 Configs HM  States 14, Mix’s 12 MM  States 16, Mix’s 6 WM  States 16, Mix’s 16

WP1: Appendix Slides Aurora 2

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Baseline: Aurora 2 Database Structure:  2 Training Scenarios, 3 Test Sets, [4+4+2] Conditions, 7 SNRs per Condition: Total of 2x70 Tests Presentation of Selected Results:  Average over SNR.  Average over Condition.  Training Scenarios: Clean- v.s Multi- Train.  Noise Level: Low v.s. High SNR.  Condition: Worst v.s. Easy Conditions.  Features: MFCC+D+A v.s. MFCC+D+A+CMS Set up: # states 18 [10-22], # mix’s [3-32], MFCC+D+A+CMS

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Average Baseline Results: Aurora Developed Baseline Baseline* Multi Clean Best CMS Best PLAIN CMSPLAIN Training Scenario * Average HTK results as reported with the database. Average over all SNR’s and all Conditions Plain: MFCC+D+A, CMS: MFCC+D+A+CMS. Mixture #: Clean train (Both Plain,CMS) 3, Multi train Plain: 22, CMS: 32. Best: Select for each condition/noise the # mix’s with the best result.

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Results: Aurora 2 Up to +12%

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Results: Aurora 2 Up to +40%

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Results: Aurora 2 Up to +27%

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Results: Aurora 2 Up to +61%

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Aurora 2 Distributed, Multicondition Training Multicondition Training - Full AB C Subw ay Babb leCar Exhibiti on Avera ge Restaur ant Stre et Airp ort Stati on Avera ge Subway M Street M Avera ge Clean98,68 98,5 2 98, 3998,4998,5298,68 98,5 2 98,3 9 98,4 998,5298,5098,5898,5498,52 20 dB97,61 97,7 3 98, 0397,4197,7096,87 97,5 8 97,4 4 97,0 197,2397,3096,5596,9397,35 15 dB96,47 97,0 4 97, 6196,6796,9595,30 96,3 1 96,1 2 95,5 395,8296,3595,5395,9496,29 10 dB94,44 95,2 8 95, 7494,1194,8991,96 94,3 5 93,2 9 92,8 793,1293,3492,5092,9293,79 5 dB88,36 87,5 5 87, 8087,6087,8383,54 85,6 1 86,2 5 83,5 284,7382,4182,5382,4785,52 0 dB66,90 62,1 5 53, 4464,3661,7159,29 61,3 4 65,1 1 56,1 260,4746,8254,4450,6359,00 -5dB26,13 27,1 8 20, 5824,3424,5625,51 27,6 0 29,4 1 21,0 725,9018,9124,2421,5824,50 Avera ge88,76 87,9 5 86, 5288,0387,8285,39 87,0 4 87,6 4 85,0 186,2783,2484,3183,7886,39

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Aurora 2 Distributed, Clean Training Clean Training - Full AB C Subw ay Babb leCar Exhibiti on Avera ge Restaur ant Stre et Airp ort Stati on Avera ge Subway M Street M Avera ge Clean98,93 99,0 0 98, 9699,2099,0298,93 99,0 0 98,9 6 99,2 099,0299,1498,9799,0699,03 20 dB97,05 90,1 5 97, 4196,3995,2589,99 95,7 4 90,6 4 94,7 292,7793,4695,1394,3094,07 15 dB93,49 73,7 6 90, 0492,0487,3376,24 88,4 5 77,0 1 83,6 581,3486,7788,9187,8485,04 10 dB78,72 49,4 3 67, 0175,6667,7154,77 67,1 1 53,8 6 60,2 959,0173,9074,4374,1765,52 5 dB52,16 26,8 1 34, 0944,8339,4731,01 38,4 5 30,3 3 27,9 231,9351,2749,2150,2438,61 0 dB26,019,28 14, 4618,0516,9510,96 17,8 4 14,4 1 11,5 713,7025,4222,9124,1717,09 -5dB11,181,57 9,3 99,607,943,47 10,4 68,238,457,6511,8211,1511,498,53 Avera ge69,49 49,8 9 60, 6065,3961,3452,59 61,5 2 53,2 5 55,6 355,7566,1666,1266,1460,06

WP1: Appendix Slides Audio Visual: Details

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Introduction: Motivations for AV-ASR Audio-only ASR does not work reliably in many scenarios:  Noisy background (e.g. car's cabin, cockpit)  Interference between talkers Need to enhance the auditory signal when it is not reliable Human speech perception is multimodal:  Different modalities are weighed according to their reliability  Hearing impaired people can lipread  McGurk Effect (McGurk & MacDonald, 1976) Machines should also be able to exploit multimodal information

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Audio-Visual Feature Fusion Audio-visual feature integration is highly non-trivial:  Audio & visual speech asychrony (~100 ms)  Relative reliability of streams can vary wildly  Many approaches to feature fusion in the literature:  Early integration  Intermediate integration  Late integration  Highly active research area (mainly machine learning)  The class of Dynamic Bayesian Networks (DBNs) seems particularly suited for the problem:  Stream interaction explicitly modeled  Model parameter inference is more difficult than in HMM

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Visual Front-End AAM Parameters First frame of the 36 videos manually annotated 68 points on the whole face as shape landmarks Color appearance sampled at pixels Eigenvectors retained explain 70% variance  5 eigenshapes & 10 eigenfaces Initial condition at each new frame the converged solution at the previous frame Inverse-compositional gradient descent algorithm Coarse-to-fine refinement (Gaussian pyramid - 3 scales)

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 AV-ASR Experiment Setup Features:  Audio: 39 features (MFCC_D_A)  Visual (upsampled from 30 Hz to 100 Hz):  5 shape features (Sh)  10 appearance features (App)  Audio-Visual: feats (MFCC_D_A+SHAPP_D_A) Two-stream HMM  8 state, left-to-right HMM whole-digit models with no state skipping  Single Gaussian observation probability densities  Separate audio & video feature streams with equal weights (1,1)

WP1: Appendix Slides Aurora 4

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Aurora 4, Multi-Condition Training 7138 utterances 3569 utterances (Sennheiser mic) 3569 utterances (2nd mic) 893 (no noise added) 2676 (1 out of 6 noises added at SNRs between 10 and 20 dB) Multicondition training 2676 (1 out of 6 noises added at SNRs between 10 and 20 dB) 893 (no noise added)

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Aurora 4, Noisy Training 7138 utterances 3569 utterances (Sennheiser mic) 3569 utterances (Sennheiser2nd mic) 893 (no noise added) 2676 (1 out of 6 noises added at SNRs between 10 and 20 dB) Multicondition Noisy training 2676 (1 out of 6 noises added at SNRs between 10 and 20 dB) 893 (no noise added)

HIWIRE ICCS - NTUA HIWIRE Meeting, Granada June 2005 Aurora 4, Noisy Training 7138 utterances 3569 utterances (Sennheiser mic) 3569 utterances (Sennheiser2nd mic) 893 (no noise added) 2676 (1 out of 6 noises added at SNRs between 10 and 20 dB) Multicondition Noisy training 2676 (1 out of 6 noises added at SNRs between 10 and 20 dB) 893 (no noise added)