HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

Advances in WP1 Trento Meeting January

Combining Heterogeneous Sensors with Standard Microphones for Noise Robust Recognition Horacio Franco 1, Martin Graciarena 12 Kemal Sonmez 1, Harry Bratt.

Towards speaker and environmental robustness in ASR: the HIWIRE project A. Potamianos 1, G. Bouselmi 2, D. Dimitriadis 3, D. Fohr 2, R. Gemello 4, I. Illina.

Advances in WP2 Torino Meeting – 9-10 March

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.

Patch to the Future: Unsupervised Visual Prediction

HIWIRE MEETING Paris, February 11, 2005 JOSÉ C. SEGURA LUNA GSTC UGR.

An Energy Search Approach to Variable Frame Rate Front-End Processing for Robust ASR Julien Epps and Eric H. C. Choi National ICT Australia Presenter:

An Alternative Approach of Finding Competing Hypotheses for Better Minimum Classification Error Training Mr. Yik-Cheung Tam Dr. Brian Mak.

Advances in WP1 Turin Meeting – 9-10 March

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.

ICIP 2000, Vancouver, Canada IVML, ECE, NTUA Face Detection: Is it only for Face Recognition?  A few years earlier  Face Detection Face Recognition 

Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: Computer Vision, Speech Communication and Signal Processing Research.

ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: Computer Vision, Speech Communication and.

HIWIRE Progress Report Chania, May 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines

Feature Selection, Acoustic Modeling and Adaptation SDSG REVIEW of recent WORK Technical University of Crete Speech Processing and Dialog Systems Group.

HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

LORIA Irina Illina Dominique Fohr Chania Meeting May 9-10, 2007.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Advances in WP1 and WP2 Paris Meeting – 11 febr

HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.

LORIA Irina Illina Dominique Fohr Christophe Cerisara Torino Meeting March 9-10, 2006.

HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Vasilis Diakoloukas Technical.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:

Isolated-Word Speech Recognition Using Hidden Markov Models

1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.

Exploiting video information for Meeting Structuring ….

7-Speech Recognition Speech Recognition Concepts

EE 492 ENGINEERING PROJECT LIP TRACKING Yusuf Ziya Işık & Ashat Turlibayev Yusuf Ziya Işık & Ashat Turlibayev Advisor: Prof. Dr. Bülent Sankur Advisor:

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.

Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.

Yi-zhang Cai, Jeih-weih Hung 2012/08/17 報告者：汪逸婷 1.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

In-car Speech Recognition Using Distributed Microphones Tetsuya Shinde Kazuya Takeda Fumitada Itakura Center for Integrated Acoustic Information Research.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

 Present by 陳群元.  Introduction  Previous work  Predicting motion patterns  Spatio-temporal transition distribution  Discerning pedestrians  Experimental.

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Research & Technology Progress in the framework of the RESPITE project at DaimlerChrysler Research & Technology Dr-Ing. Fritz Class and Joan Marí Sheffield,

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.

Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.

Statistical Models for Automatic Speech Recognition

Statistical Models for Automatic Speech Recognition

LECTURE 15: REESTIMATION, EM AND MIXTURES

EE 492 ENGINEERING PROJECT

Speech / Non-speech Detection

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2) Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Outline  Work package 1 Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) Audio-Visual ASR: Baseline Feature extraction and combination Segment models for ASR Blind Source Separation for multi-microphone ASR  Work package 2 Adaptation Data collection

Outline  Work package 1 Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) Audio-Visual ASR: Baseline Feature extraction and combination Segment models for ASR Blind Source Separation for multi-microphone ASR  Work package 2 Adaptation Data collection

Baseline  Baseline Performance Completed Aurora 2 on HTK Aurora 3 on HTK Aurora 4 on HTK  Lattices for Aurora 4  Baseline Performance Ongoing WSJ1 (Decipher) DMHMMs (Decipher)

Aurora 2 Database  Based on TIdigits downsampled to 8KHz  Noise artificially added at several SNRs  3 sets of noises A: subway, babble, car, exhib. hall B: restaurant, street, airport, train station C: subway, street (with different freq. characteristics)  Two training conditions Training on clean data Multi-condition Training on noisy data

Aurora 2 Database  8440 training sentences  1001 test sentences / test set  Three front-end configurations HTK default WI007 (Aurora 2 distribution) WI008 (Thanks to Prof. Segura)

Aurora 2: Clean training  HTK default Front-End

Aurora 2: Multi-Condition training  HTK default Front-End

Aurora 2: Clean vs Multi-Condition Training

Aurora 2 Front End Comparison: Clean Training

Front End Comparison: Multi-Condition Training

Aurora 3 Database  5 languages Finnish German Italian Spanish Danish  3 noise conditions quiet low noisy (low) high noisy (high)  2 recording modes close-talking microphone (ch0) hands-free microphone (ch1)

Aurora 3 Database  3 experimental setups Well-Matched (WM) 70% of all utts in “quiet, low, high” conditions were used for training remaining 30% were used for testing Medium Mismatched (MM) 100% hands-free recordings from “quiet” and “low” for training 100% hands-free recordings from “high” for testing High Mismatched (HM) 70% of close-talking recordings from all noise conditions for training 30% of hands-free recordings from “low” and “high” for testing

Baseline Aurora 3 performance

Baseline Aurora 3 with WI007 FE ( TUC - UGR comparison ) FINNISHSPANISHGERMAN FRONT-ENDWMMMHMWMMMHMWMMMHM WI007-TUC90,5372,530,3586,8873,7242,2390,5879,0674,24 WI007-UGR92,7480,5140,5392,9480,3151,5591,281,0473,17 TRAIN(#sent.) TEST(#sent.) DANISHITALIAN FRONT-ENDWMMMHMWMMMHM WI007-TUC79,6249,2933,1593,6482,0239,84 WI007-UGR87,2867,3239,3793,6482,0239,84 TRAIN(#sent.) TEST(#sent.)

Baseline Aurora 3 with WI007 FE ( TUC - UGR comparison )

Baseline Aurora 3 with WI008 FE ( TUC - UGR comparison )

Aurora 4 Database  Based on the WSJ phase 0 collection  5000 word vocabulary  7138 training data (ARPA evaluation)  2 recording microphones  6 different noises artificially added Car, Babble, Restaurant, Street, Airport, TrainSt

Aurora 4 Training Data Sets  3 Training Conditions (Clean – MultiCondition – Noisy) 7138 utterances (as in the ARPA evaluation) 7138 utterances 3569 utterances (Sennheiser) 3569 utterances (2nd mic) 893 (no noise added) 2676 (1 out of 6 noises added at SNRs between 10 and 20 dB) Clean trainingMulticondition training 2676 (1 out of 6 noises added at SNRs between 10 and 20 dB) 893 (no noise added)

Aurora 4 Test Sets  14 Test Sets  2 sizes: small (166 utts) and large (330 utts) 330 utt. (Sennheiser microphone) SET utt. (Sennheiser mic; Noise 1 added at SNRs between 5 and 15 dB) SET 2 … 330 utt. (Sennheiser mic; Noise 2 added at SNRs between 5 and 15 dB) SET utt. (Sennheiser mic; Noise 6 added at SNRs between 5 and 15 dB) SET utt. (2nd microphone) SET utt. (2nd mic; Noise 1 added at SNRs between 5 and 15 dB) SET 9 … 330 utt. (2nd mic; Noise 2 added at SNRs between 5 and 15 dB) SET utt. (2nd mic; Noise 6 added at SNRs between 5 and 15 dB) SET 14

Lattices  Obtained from SONIC recognizer real time decoding for WSJ 5k task State-of-the-art performance (8% WERR)  Lattices obtained from clean models  Three sizes lattices: small, medium, large  Fixed branching factor for each lattice size (small=2.5, medium=4, large=5.5)  Speed-up factor compared to HTK decoding: x100, x50, x10

Baseline Aurora 4 with Lattices

Baseline Aurora 4 (Comparing Lattices)

Aurora4 Baseline Conclusions on Lattices  Lattices speed up recognition Medium Size Lattice is ~ 60 times faster Small Size Lattice is ~ 108 times faster  Problem: improved performance in noisy test  Careful when using lattices in mismatched conditions (clean training-noisy data)!  Solution: two sets of lattices lattices: matched, mismatched

Audio-Visual ASR: Database  Subset of CUAVE database used: 36 speakers (30 training, 6 testing) 5 sequences of 10 connected digits per speaker Training set: 1500 digits (30x5x10) Test set: 300 digits (6x5x10)  CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations)

CUAVE Database Speakers

Audio-Visual ASR: Feature Extraction  Lip region of interest (ROI) tracking A fixed size ROI is detected using template matching ROI minimizes RGB-Euclidean distance with a given ROI template ROI template is selected from 1 st frame of each speaker Continuity constraint: search within a 20x20 pixel window of previous frame ROI (does not work for rapid speaker movements)

Audio-Visual ASR: Feature Extraction  Features extracted from ROI ROI is transformed to grayscale ROI is decimated to a 16x16 pixel region 2D separable DCT is applied to 16x16 pixel region Upper-left 6x6 region is kept (excluding first coef.) 35 feature vector is resampled in time from fps (NTSC) to 100 fps First and second derivatives in time are computed using a 6 frame window (feature size 105)  Sanity check: unsupervised k-means clustering of ROI results in …

Experiments  Recognition experiment: Open loop digit grammar (50 digits per utterance, no endpointing)  Classification experiment: Single digit grammar (endpointed digits based on provided segmentation)

Models  Features: Audio: 39 features (MFCC_D_A) Visual: 105 features (ROIDCT_D_A) Audio-Visual: feats (MFCC_D_A+ROIDCT)  HMM models 8 state, left-to-right HMM whole-digit models with no state skipping Single Gaussian mixture Audio-Visual HMM uses separate audio and video feature streams with equal weights (1,1)

Results (Word Accuracy]  Data Training: 1500 digits (30 speakers) Testing: 300 digits (6 speakers) AudioVisualAudioVisual Recognition98%26%78% Classification99%46%85%

Future Work  Multi-mixture models  Front-end (NTUA) Tracking algorithms Feature extraction  Feature Combination Feature integration Feature weighting

Outline  Work package 1 Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) Audio-Visual ASR: Baseline Feature extraction and combination Segment models for ASR Blind Source Separation for multi-microphone ASR  Work package 2 Adaptation Data collection

Feature extraction and combination  Noise Robust Features (NTUA) – m12  AM-FM Features (NTUA) – m12  Feature combination – m12  Supra-segmental features (see also segment models) – m18

Outline  Work package 1 Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) Audio-Visual ASR: Baseline Feature extraction and combination Segment models for ASR Blind Source Separation for multi-microphone ASR  Work package 2 Adaptation Data collection

Segment Models  Baseline system  Supra-segmental features Phone Transition modeling – m12 Prosody modeling – m18 Stress modeling – m18  Parametric modeling of feature trajectories  Dynamical system modeling  Combine with HMMs

Outline  Work package 1 Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) Audio-Visual ASR: Baseline Feature extraction and combination Segment models for ASR Blind Source Separation for multi-microphone ASR  Work package 2 Adaptation Data collection

Blind Source Separation (Mokios, Sidiropoulos]  Based on PARallel FACtor (PARAFAC) analysis, i.e., low- rank decomposition of multi-dimensional tensorial data  Collecting spatial covariance matrix estimates which are sufficiently separated in time:  Assumptions uncorrelated speaker signals and noise D(t) is a diagonal matrix of speaker powers for measurement period t denotes noise power (estimated from silence intervals)

Outline  Work package 1 Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) Audio-Visual ASR: Baseline Feature extraction and combination Segment models for ASR Blind Source Separation for multi-microphone ASR  Work package 2 Adaptation Data collection

Acoustic Model Adaptation  Adaptation Method: Bayes’ Optimal Classification  Acoustic Models: Discrete Mixture HMMs

Bayes optimal classification  Classifier decision for a test data vector x test :  Choose the class that results in the highest value:

Bayes optimal versus MAP  Assumption: the posterior is sufficiently peaked around the most probable point  MAP approximation:  θ MAP is the set of parameters that maximize:

Why Bayes optimal classification  Optimal classification criterion  The prediction of all the parameter hypotheses is combined  Better discrimination  Less training data  Faster asymptotic convergence to the ML estimate

Why Bayes optimal classification  However: Computationally more expensive Difficult to find analytical solutions....hence some approximations should still be considered

Approximate Bayesian Decision rule (Merhav, Ephraim 1991)  Having Training data y Test sequence x M the number of source models H λ the parameter set of each source  However Still difficult to be implemented Strong assumptions

Discrete-Mixture HMMs (Digalakis et. al. 2000)  It is based on sub-vector quantization  Introduces a new form of observation distributions

DMHMMs benefits (Digalakis et. al. 2000)  Speech Recognition performance driven quantization scheme  Quantization of the acoustic space in sufficient detail  Mixtures capture the correlation between sub-vectors  Well-matched in client-server applications  Comparable performance to continuous HMMs  Faster decoding speeds

DMHMM parameters that could be adapted  Partitioning into sub-vectors How many sub-vectors Which MFCCs to form each sub-vector  Bit-allocation Optimize bit-allocation based on adaptation data  Discrete Mixture Weights  Centroids of codebooks  Centroid observation probabilities

Adaptation on DMHMMs  Goal: Reestimate the centroids observation distribution  Transformation-based adaptation ? Maybe not enough training data for the amount of centroids  Bayesian adaptation ? Could benefit from its convergence property  Optimal Bayes classification ? Easier to find approximate forms for DMHMMs

Outline  Work package 1 Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) Audio-Visual ASR: Baseline Feature extraction and combination Segment models for ASR Blind Source Separation for multi-microphone ASR  Work package 2 Adaptation Data collection

TUC Non-Native Recordings  10 Speakers (6 male – 4 female)  Fluency in English: 4 excellent 5 good – very good 1 satisfactory  Speaker pronunciation: 1 from Cyprus 3 from Northern Greece 1 from Ionian Islands 2 Athens area 1 from Crete 1 from Central Greece

EXTRA SLIDES

Prior Work Overview MLST. Constr. Est. Adapt. MAP (Bayes) Adapt. Genones Segment Models VTLN Combinations Robust Features

HIWIRE Work Proposal Adaptation Bayes optimal class. Audio Visual ASR Baseline experiments Microphone Arrays Speech/Noise Separation Feature Selection AM-FM Features Acoustic Modeling Segment Models

Aurora 2 Performance with HTK FE (Clean Training) ABC SubwayBabbleCarExhibitAvg.RestrStreetAirportStationAvg.Sub.M.Str.M.Avg.Overall Clean98,8398,9798,8199,1498,9498,8398,9798,8199,1498,9499,0298,979998,95 20 dB96,9689,9696,8496,294,9989,1995,7790,0794,3892,3594,4795,1994,8393,9 15 dB92,9173,4389,5391,8586,9374,3988,2776,8983,6280,7987,6389,6988,6684,82 10 dB78,7249,0666,2475,167,2852,7266,7553,1559,6158,0675,1975,2775,2365,18 5 dB53,3927,0332,843,5139,1829,5738,1530,6929,7132,0352,8448,8550,8538,65 0 dB27,311,7313,2715,9817,0711,718,6815,8412,2514,6226,0121,6423,8317,44 -5 dB12,624,968,357,658,45,0410,078,088,497,9212,110,711,48,81 Avg.65,8250,7357,9861,3558,9751,6359,5253,3655,3154,9663,8962,963,458,25

Aurora 2 Performance with HTK FE (Multi-Condition Training) ABC SubwayBabbleCarExhibitAvg.RestrStreetAirportStationAvg.Sub.M.Str.M.Avg.Overall Clean98,5998,5298,4898,5598,5498,5998,5298,4898,5598,5498,6598,5298,5998,55 20 dB97,6497,6197,8596,9897,5296,5697,4697,1796,6496,9697,0596,4396,7497,14 15 dB96,7596,897,6496,5896,9494,7295,9295,6295,2595,3895,4695,595,4896,02 10 dB94,3895,2295,6593,1294,5990,9794,292,7892,3592,5892,3591,992,1393,29 5 dB88,4287,6786,1786,9587,381,8585,3484,9182,9183,7581,4681,8681,6684,75 0 dB65,6761,0350,8261,859,8356,8360,2264,3654,2158,9145,1654,0549,6157,42 -5 dB26,0126,1819,1522,4923,4622,626,327,6518,8823,8618,6125,5422,0823,34 Avg.88,5787,6785,6387,0987,2484,1986,6386,9784,2785,5282,383,9583,1285,72

Aurora 2 Performance with WI008 FE (Clean Training) ABC SubwayBabbleCarExhibitAvg.RestrStreetAirportStationAvg.Sub.M.Str.M.Avg.Overall Clean99,0899,0399,0599,2399,199,0899,0399,0599,2399,199,0299,03 99,08 20 dB97,8898,2598,3697,8198,0898,0797,6498,4298,4398,1497,3697,6797,5297,99 15 dB96,3896,7497,5296,796,8495,3396,5897,0596,7696,4395,395,7495,5296,41 10 dB92,2691,9995,2992,5993,0389,8792,7493,2693,8692,4390,3390,7590,5492,29 5 dB83,8880,6886,0184,0583,6676,0583,2583,5484,281,7678,8878,4878,6881,9 0 dB61,9351,1266,0663,560,6550,2659,760,2462,2358,1152,5952,1252,3657,98 -5 dB31,0718,9529,8233,228,2618,3929,2327,3229,5626,1325,1526,1225,6426,88 Avg.86,4783,7688,6586,9386,4581,9285,9886,587,185,3782,8982,9582,9285,31

Aurora 2 Performance with WI008 FE (Multi-Condition Training) ABC SubwayBabbleCarExhibitAvg.RestrStreetAirportStationAvg.Sub.M.Str.M.Avg.Overall Clean99,0298,8298,9999,1498,9999,0298,8298,9999,1498,99 98,8598,9298,98 20 dB98,6298,5898,5498,2498,598,198,1398,6398,898,4298,0797,9498,0198,37 15 dB97,5497,9198,4297,5697,8696,9397,8598,0397,6997,6397,5497,7397,6497,72 10 dB95,3396,0797,3895,3496,0394,8495,5995,9196,0595,695,5895,3195,4595,74 5 dB91,4390,2190,9390,190,6787,1490,3991,4490,1689,7888,9287,5288,2289,82 0 dB75,2868,7180,77675,1765,5573,8575,7874,0872,3266,9965,6366,3172,26 -5 dB39,8530,0540,4144,9938,8328,5238,8840,9541,7537,5330,4330,5930,5136,64 Avg.91,6490,393,1991,4591,6588,5191,1691,9691,3690,7589,4288,8389,1390,78

Aurora 3 HTK Settings  Spanish Parametrize.csh Set Options = “-F RAW –fs 8 –q –noc0 –swap” Config_tr TARGETKIND = MFCC_E_D_A DELTAWINDOW = 3 ACCWINDOW = 2 ENORMALISE = F HNET:TRACE = 2 NATURALREADORDER = T NATURALWRITEORDER = T

Aurora 3 HTK Settings  Italian Sdc_it.conf $FE_OPTIONS = “-q -F RAW –fs 8 ” Config TARGETKIND = MFCC_D_A_E HNET:TRACE = 2 ACCWINDOW = 2 DELTAWINDOW = 3 ENORMALISE = F NATURALREADORDER = T NATURALWRITEORDER = T

Baseline Aurora 3 Performance FINNISHSPANISHGERMAN FRONT-ENDWMMMHMWMMMHMWMMMHM WI00790,5372,530,3586,8873,7242,2390,5879,0674,24 WI00895,6276,6886,1193,4785,4181,0294,4988,7389,55 TRAIN(#sent.) TEST(#sent.) DANISHITALIAN FRONT-ENDWMMMHMWMMMHM WI00779,6249,2933,1593,6482,0239,84 WI00884,9965,6863,9196,5888,5388,22 TRAIN(#sent.) TEST(#sent.)

Baseline Aurora 3 with WI007 FE ( TUC - UGR comparison ) FINNISHSPANISHGERMAN FRONT-ENDWMMMHMWMMMHMWMMMHM WI007-TUC90,5372,530,3586,8873,7242,2390,5879,0674,24 WI007-UGR92,7480,5140,5392,9480,3151,5591,281,0473,17 TRAIN(#sent.) TEST(#sent.) DANISHITALIAN FRONT-ENDWMMMHMWMMMHM WI007-TUC79,6249,2933,1593,6482,0239,84 WI007-UGR87,2867,3239,3793,6482,0239,84 TRAIN(#sent.) TEST(#sent.)

Baseline Aurora 3 with WI008 FE ( TUC - UGR comparison ) FINNISHSPANISHGERMAN FRONT-ENDWMMMHMWMMMHMWMMMHM WI008-TUC95,6276,6886,1193,4785,4181,0294,4988,7389,55 WI008-UGR96,0980,9286,6196,6493,9291,5595,1190,8491,25 TRAIN(#sent.) TEST(#sent.) DANISHITALIAN FRONT-ENDWMMMHMWMMMHM WI008-TUC84,9965,6863,9196,5888,5388,22 WI008-UGR93,3781,4979,5996,7192,5389 TRAIN(#sent.) TEST(#sent.)

Baseline Aurora 4 with Lattices Small Lattice Size Avg. Clean86,5680,7768,4364,7555,3170,9859,772,37 Multi86,8586,5283,9882,581,3384,6481,8484,88 Noisy8785,9781,5880,4876,5182,6577,4883,3 Average86,884,427875,9171,0579,4273,0180, Clean88,3685,6774,3673,4466,4174,5963,87 Multi86,8186,8585,7885,3485,5685,8984,42 Noisy87,8186,9685,7183,6183,0985,681,8 Average87,6686,4981,9580,878,3582,0376,7

Baseline Aurora 4 with Lattices Medium Lattice Size Clean87,9284,7172,8972,7865,1273,7862,91 Multi85,9785,5284,7983,8383,984,2483,24 Noisy87,3385,7884,4282,2881,5884,1680,88 Average87,0785,3480,779,6378,8780,7375, Avg. Clean85,6779,4566,0863,6853,8669,0758,3171,16 Multi86,1984,9782,6581,1880,6382,8480,2983,59 Noisy86,785,4581,1478,6774,2282,2176,6582,25 Average86,1983,2976,6274,5169,5778,0471,7579,14

Baseline Aurora 4 with Lattices Small Lattice Size Avg. Clean86,5680,7768,4364,7555,3170,9859,772,37 Multi86,8586,5283,9882,581,3384,6481,8484,88 Noisy8785,9781,5880,4876,5182,6577,4883,3 Average86,884,427875,9171,0579,4273,0180, Clean88,3685,6774,3673,4466,4174,5963,87 Multi86,8186,8585,7885,3485,5685,8984,42 Noisy87,8186,9685,7183,6183,0985,681,8 Average87,6686,4981,9580,878,3582,0376,7