Towards speaker and environmental robustness in ASR: the HIWIRE project A. Potamianos 1, G. Bouselmi 2, D. Dimitriadis 3, D. Fohr 2, R. Gemello 4, I. Illina.

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

Advances in WP1 Trento Meeting January
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Advances in WP2 Torino Meeting – 9-10 March
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Survey of INTERSPEECH 2013 Reporter: Yi-Ting Wang 2013/09/10.
Advances in WP1 Turin Meeting – 9-10 March
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.
Advances in WP1 Nancy Meeting – 6-7 July
ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: Computer Vision, Speech Communication and Signal Processing Research.
HIWIRE MEETING Nancy, July 6-7, 2006 José C. Segura, Ángel de la Torre.
HIWIRE MEETING Torino, March 9-10, 2006 José C. Segura, Javier Ramírez.
HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)
ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: Computer Vision, Speech Communication and.
HIWIRE Progress Report Chania, May 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.
Advances in WP2 Trento Meeting – January
Feature Selection, Acoustic Modeling and Adaptation SDSG REVIEW of recent WORK Technical University of Crete Speech Processing and Dialog Systems Group.
HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.
Non-native Speech Languages have different pronunciation spaces
Speech Recognition in Noise
Advances in WP2 Chania Meeting – May
HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.
LORIA Irina Illina Dominique Fohr Chania Meeting May 9-10, 2007.
ICCS-NTUA Contributions to E-teams of MUSCLE WP6 and WP10 Prof. Petros Maragos National Technical University of Athens School of Electrical and Computer.
Advances in WP1 and WP2 Paris Meeting – 11 febr
HIWIRE MEETING Trento, January 11-12, 2007 José C. Segura, Javier Ramírez.
HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.
LORIA Irina Illina Dominique Fohr Christophe Cerisara Torino Meeting March 9-10, 2006.
Advances in WP1 Chania Meeting – May
HIWIRE meeting ITC-irst Activity report Marco Matassoni, Piergiorgio Svaizer March Torino.
HIWIRE MEETING Athens, November 3-4, 2005 José C. Segura, Ángel de la Torre.
HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Vasilis Diakoloukas Technical.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Introduction to Automatic Speech Recognition
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer Science Queen’s University Belfast, Belfast BT7 1NN,
NONLINEAR DYNAMIC INVARIANTS FOR CONTINUOUS SPEECH RECOGNITION Author: Daniel May Mississippi State University Contact Information: 1255 Louisville St.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Yi-zhang Cai, Jeih-weih Hung 2012/08/17 報告者:汪逸婷 1.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Speech Enhancement for ASR by Hans Hwang 8/23/2000 Reference 1. Alan V. Oppenheim,etc., ” Multi-Channel Signal Separation by Decorrelation ”,IEEE Trans.
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Research & Technology Progress in the framework of the RESPITE project at DaimlerChrysler Research & Technology Dr-Ing. Fritz Class and Joan Marí Sheffield,
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Survey of Robust Speech Techniques in ICASSP 2009 Shih-Hsiang Lin ( 林士翔 ) 1Survey of Robustness Techniques in ICASSP 2009.
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Statistical Models for Automatic Speech Recognition
Statistical Models for Automatic Speech Recognition
EE513 Audio Signals and Systems
Missing feature theory
Speech / Non-speech Detection
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Towards speaker and environmental robustness in ASR: the HIWIRE project A. Potamianos 1, G. Bouselmi 2, D. Dimitriadis 3, D. Fohr 2, R. Gemello 4, I. Illina 2, F. Mana 4, P. Maragos 3, M. Matassoni 5, V. Pitsikalis 3, J. Ramírez 6, E. Sanchez-Soto 1, J. Segura 6, and P. Svaizer 5 1 Dept. of E.C.E., Tech. Univ. of Crete, Chania, Greece 2 Speech Group, LORIA, Nancy, France 3 School of E.C.E., Natl. Tech. Univ. of Athens, Athens, Greece 4 Loquendo, via Valdellatorre, , Torino, Italy 5 ITC-irst, via Sommarive 18 - Povo (TN), Italy 6 Dept. of Signal Theory, Univ. of Granada, Spain

Outline  Introduction: the HIWIRE project  Goals and objectives  Research areas: Environmental robustness Speaker robustness  Experimental results  Ongoing work

HIWIRE project   Goals: environment and speaker robust ASR  Showcase: fixed cockpit platform, PDA platform  Industrial partners: Thales Avionics, Loquendo  Research partners: LORIA, TUC, NTUA, UGR, ITC-IRST, Thales research  FP6 project: 6/2004 to 5/2007

Research areas  Environmental robustness Multi-microphone ASR Robust feature extraction Feature fusion and audio-visual ASR Feature equalization Voice-activity detection Speech enhancement  Speaker robustness Model-transformation Acoustic modeling for non-native speech

Multi-microphone ASR: Outline  Beamforming and Adaptive Noise Cancellation  Environmental Acoustics Estimation

Beamforming: D&S Availability of multi-channel signals allows to selectively capture the desired source: Issues: estimation of reliable TDOAs; Method: CSP analysis over multiple frames Advantages: robustness reduced computational power

D&S with MarkIII Test set: set N1_SNR0 of MC-TIDIGITS (cockpit noise), MarkIII channels clean models, trained on original TIDIGITS Results (WERR [%]): C_138.5 C_ DS_C879.9 DS_C DS_C DS_C6485.4

Robust Features for ASR  Modulation Features AM-FM Modulations Teager Energy Cepstrum  Fractal Features Dynamical Denoising Correlation Dimension Multiscale Fractal Dimension  Hybrid-Merged Features up to +62 % (Aurora 3) up to +36% (Aurora 2) up to +61 % ( Aurora 2)

Speech Modulation Features  Filterbank Design  Short-Term AM-FM Modulation Features Short-Term Mean Inst. Amplitude IA-Mean Short-Term Mean Inst. Frequency IF-Mean Frequency Modulation Percentages FMP  Short-Term Energy Modulation Features Average Teager Energy, Cepstrum Coef. TECC

Modulation Acoustic Features Speech Nonlinear Processing Demodulation Robust Feature Transformation/ Selection Regularization + Multiband Filtering Statistical Processing V.A.D. Energy Features: Teager Energy Cepstrum Coeff. TECC AM-FM Modulation Features: Mean Inst. Ampl. IA-Mean Mean Inst. Freq. IF-Mean Freq. Mod. Percent. FMP

TIMIT-based Speech Databases  TIMIT Database: Training Set: 3696 sentences, ~35 phonemes/utterances Testing Set: 1344 utterances, phonemes Sampling Frequency 16 kHz  Feature Vectors: MFCC+C0+AM-FM+1 st +2 nd Time Derivatives  Stream Weights: (1) for MFCC and (2) for ΑΜ-FM  3-state left-right HMMs, 16 mixtures  All-pair, Unweighted grammar  Performance Criterion: Phone Accuracy Rates (%)  Back-end System: HTK v3.2.0

Results: TIMIT+Noise Up to +106%

Aurora 3 - Spanish  Connected-Digits, Sampling Frequency 8 kHz  Training Set: WM (Well-Matched): 3392 utterances (quiet 532, low 1668 and high noise 1192 MM (Medium-Mismatch): 1607 utterances (quiet 396 and low noise 1211) HM (High-Mismatch): 1696 utterances (quiet 266, low 834 and high noise 596)  Testing Set: WM: 1522 utterances (quiet 260, low 754 and high noise 508), 8056 digits MM: 850 utterances (quiet 0, low 0 and high noise 850), 4543 digits HM: 631 utterances (quiet 0, low 377 and high noise 254), 3325 digits  2 Back-end ASR Systems (ΗΤΚ and BLasr)  Feature Vectors: MFCC+AM-FM (or Auditory+ΑM-FM), TECC  All-Pair, Unweighted Grammar (or Word-Pair Grammar)  Performance Criterion: Word (digit) Accuracy Rates

Results: Aurora 3 Up to +62%

Fractal Features N-d Cleaned Embedding N-d Signal Local SVD speech signal Filtered Dynamics - Correlation Dimension Noisy Embedding Filtered Embedding FDCD Multiscale Fractal Dimension MFD Geometrical Filtering

Databases: Aurora 2  Task: Speaker Independent Recognition of Digit Sequences  TI - Digits at 8kHz  Training (8440 Utterances per scenario, 55M/55F) Clean (8kHz, G712) Multi-Condition (8kHz, G712) 4 Noises (artificial): subway, babble, car, exhibition 5 SNRs : 5, 10, 15, 20dB, clean  Testing, artificially added noise 7 SNRs: [-5, 0, 5, 10, 15, 20dB, clean] A: noises as in multi-cond train., G712 (28028 Utters) B: restaurant, street, airport, train station, G712 (28028 Utters) C: subway, street (MIRS) (14014 Utters)

Results: Aurora 2 Up to +61%

Feature Fusion  Merge synchronous feature streams  Investigate both supervised and unsupervised algorithms

Feature Fusion: multi-stream Compute “optimal” exponent weights for each stream s [ HMM Gaussian mixture formulation; similar expressions for MM, naïve Bayes, Euclidean/Mahalonobois classifier] Optimality in the sense of minimizing “total classification error”

Multi-Stream Classification  Two class problem w 1, w 2  Feature vector x is broken up into two independent streams x 1 and x 2  Stream weights s 1 and s 2 are used to “equalize” the “probabilities”

Multi-Stream Classification  Bayes classification decision  Non-unity weights increase Bayes error but estimation/modeling error may decrease Stream weights can decrease total error  “Optimal” weights minimize estimation error variance  z 2

Optimal Stream Weights  Equal error rate in single-stream classifiers optimal stream weights are inversely proportional to the total stream estimation error variance

Optimal Stream Weights  Equal estimation error variance in each stream optimal weights are approximately inversely proportional to the single stream classification error

Experimental Results  Subset of CUAVE database used: 36 speakers (30 training, 6 testing), 5 sequences of 10 digits per spkr. Training set: 1500 digits (30x5x10) Test set: 300 digits (6x5x10)  Features: Audio: 39 features (MFCC_D_A) Visual: 105 features (ROIDCT_D_A)  Multi-Streams HMM models, Middle Integration: 8 state, left-to-right HMM whole-digit models Single Gaussian mixture AV-HMM uses separate audio and video feature streams

Optimal Stream Weights Results  Assume:  V 2 /  A 2 = 2 SNR-indep. correlation 0.96

Parametric non-linear equalization  Parametric histogram equalization  Smoother estimates  Bi-modal transformation (speech vs. non- speech)

Voice Activity Detection  Bi-spectrum based VAD  Support vector machine based VAD  Combination of VAD with speech enhancement

Speech Enhancement  Modified Wiener filtering with filter depending on global SNR  Modified Ephraim-Malah enhancement: based on the E-M spectral attenuation rule

Non Native Speech Recognition  Build non-native models by combining English and native models  Use phone confusion between English phones and native acoustic models to add alternate model paths  Extract confusion matrix automatically by running phone recognition using native model  Phone pronunciation depends on word grapheme, English phone [grapheme] -> french phone

Example for English phone /t  /  /t/  /k/  //// /t  / //// //// Extracted rules EnglishFrench French models English model

Graphemic constraints  Example: APPROACH /ah p r ow ch/ APPROACH (A, ah) (PP, p) (R, r) (OA, ow) (CH, ch)  Alignment between graphemes and phones for each word of lexicon  Lexicon modification: add graphemes for each word  Confusion rules extraction (grapheme, english phone) → list of non native phones Example: (A, ah) → a

Used Approach FrenchItalianSpanish WERSERWERSERWERSER Command and control grammar baseline confusion graphemes confusion Word loop grammar baseline confusion graphemes confusion Experiments : HIWIRE Database

Ongoing Work  Front-end combination and integration of algorithms  Fixed-platform demonstration non-native speech demo  PDA-platform demonstration  Ongoing research