Towards speaker and environmental robustness in ASR: the HIWIRE project A. Potamianos 1, G. Bouselmi 2, D. Dimitriadis 3, D. Fohr 2, R. Gemello 4, I. Illina.

Towards speaker and environmental robustness in ASR: the HIWIRE project A. Potamianos 1, G. Bouselmi 2, D. Dimitriadis 3, D. Fohr 2, R. Gemello 4, I. Illina 2, F. Mana 4, P. Maragos 3, M. Matassoni 5, V. Pitsikalis 3, J. Ramírez 6, E. Sanchez-Soto 1, J. Segura 6, and P. Svaizer 5 1 Dept. of E.C.E., Tech. Univ. of Crete, Chania, Greece 2 Speech Group, LORIA, Nancy, France 3 School of E.C.E., Natl. Tech. Univ. of Athens, Athens, Greece 4 Loquendo, via Valdellatorre, 4-10149, Torino, Italy 5 ITC-irst, via Sommarive 18 - Povo (TN), Italy 6 Dept. of Signal Theory, Univ. of Granada, Spain

Outline  Introduction: the HIWIRE project  Goals and objectives  Research areas: Environmental robustness Speaker robustness  Experimental results  Ongoing work

HIWIRE project  http://www.hiwire.org http://www.hiwire.org  Goals: environment and speaker robust ASR  Showcase: fixed cockpit platform, PDA platform  Industrial partners: Thales Avionics, Loquendo  Research partners: LORIA, TUC, NTUA, UGR, ITC-IRST, Thales research  FP6 project: 6/2004 to 5/2007

Research areas  Environmental robustness Multi-microphone ASR Robust feature extraction Feature fusion and audio-visual ASR Feature equalization Voice-activity detection Speech enhancement  Speaker robustness Model-transformation Acoustic modeling for non-native speech

Multi-microphone ASR: Outline  Beamforming and Adaptive Noise Cancellation  Environmental Acoustics Estimation

Beamforming: D&S Availability of multi-channel signals allows to selectively capture the desired source: Issues: estimation of reliable TDOAs; Method: CSP analysis over multiple frames Advantages: robustness reduced computational power

D&S with MarkIII Test set: set N1_SNR0 of MC-TIDIGITS (cockpit noise), MarkIII channels clean models, trained on original TIDIGITS Results (WERR [%]): C_138.5 C_3250.8 DS_C879.9 DS_C1683.0 DS_C3285.3 DS_C6485.4

Robust Features for ASR  Modulation Features AM-FM Modulations Teager Energy Cepstrum  Fractal Features Dynamical Denoising Correlation Dimension Multiscale Fractal Dimension  Hybrid-Merged Features up to +62 % (Aurora 3) up to +36% (Aurora 2) up to +61 % ( Aurora 2)

Speech Modulation Features  Filterbank Design  Short-Term AM-FM Modulation Features Short-Term Mean Inst. Amplitude IA-Mean Short-Term Mean Inst. Frequency IF-Mean Frequency Modulation Percentages FMP  Short-Term Energy Modulation Features Average Teager Energy, Cepstrum Coef. TECC

Modulation Acoustic Features Speech Nonlinear Processing Demodulation Robust Feature Transformation/ Selection Regularization + Multiband Filtering Statistical Processing V.A.D. Energy Features: Teager Energy Cepstrum Coeff. TECC AM-FM Modulation Features: Mean Inst. Ampl. IA-Mean Mean Inst. Freq. IF-Mean Freq. Mod. Percent. FMP

TIMIT-based Speech Databases  TIMIT Database: Training Set: 3696 sentences, ~35 phonemes/utterances Testing Set: 1344 utterances, 46680 phonemes Sampling Frequency 16 kHz  Feature Vectors: MFCC+C0+AM-FM+1 st +2 nd Time Derivatives  Stream Weights: (1) for MFCC and (2) for ΑΜ-FM  3-state left-right HMMs, 16 mixtures  All-pair, Unweighted grammar  Performance Criterion: Phone Accuracy Rates (%)  Back-end System: HTK v3.2.0

Results: TIMIT+Noise Up to +106%

Aurora 3 - Spanish  Connected-Digits, Sampling Frequency 8 kHz  Training Set: WM (Well-Matched): 3392 utterances (quiet 532, low 1668 and high noise 1192 MM (Medium-Mismatch): 1607 utterances (quiet 396 and low noise 1211) HM (High-Mismatch): 1696 utterances (quiet 266, low 834 and high noise 596)  Testing Set: WM: 1522 utterances (quiet 260, low 754 and high noise 508), 8056 digits MM: 850 utterances (quiet 0, low 0 and high noise 850), 4543 digits HM: 631 utterances (quiet 0, low 377 and high noise 254), 3325 digits  2 Back-end ASR Systems (ΗΤΚ and BLasr)  Feature Vectors: MFCC+AM-FM (or Auditory+ΑM-FM), TECC  All-Pair, Unweighted Grammar (or Word-Pair Grammar)  Performance Criterion: Word (digit) Accuracy Rates

Results: Aurora 3 Up to +62%

Fractal Features N-d Cleaned Embedding N-d Signal Local SVD speech signal Filtered Dynamics - Correlation Dimension Noisy Embedding Filtered Embedding FDCD Multiscale Fractal Dimension MFD Geometrical Filtering

Databases: Aurora 2  Task: Speaker Independent Recognition of Digit Sequences  TI - Digits at 8kHz  Training (8440 Utterances per scenario, 55M/55F) Clean (8kHz, G712) Multi-Condition (8kHz, G712) 4 Noises (artificial): subway, babble, car, exhibition 5 SNRs : 5, 10, 15, 20dB, clean  Testing, artificially added noise 7 SNRs: [-5, 0, 5, 10, 15, 20dB, clean] A: noises as in multi-cond train., G712 (28028 Utters) B: restaurant, street, airport, train station, G712 (28028 Utters) C: subway, street (MIRS) (14014 Utters)

Results: Aurora 2 Up to +61%

Feature Fusion  Merge synchronous feature streams  Investigate both supervised and unsupervised algorithms

Feature Fusion: multi-stream Compute “optimal” exponent weights for each stream s [ HMM Gaussian mixture formulation; similar expressions for MM, naïve Bayes, Euclidean/Mahalonobois classifier] Optimality in the sense of minimizing “total classification error”

Multi-Stream Classification  Two class problem w 1, w 2  Feature vector x is broken up into two independent streams x 1 and x 2  Stream weights s 1 and s 2 are used to “equalize” the “probabilities”

Multi-Stream Classification  Bayes classification decision  Non-unity weights increase Bayes error but estimation/modeling error may decrease Stream weights can decrease total error  “Optimal” weights minimize estimation error variance  z 2

Optimal Stream Weights  Equal error rate in single-stream classifiers optimal stream weights are inversely proportional to the total stream estimation error variance

Optimal Stream Weights  Equal estimation error variance in each stream optimal weights are approximately inversely proportional to the single stream classification error

Experimental Results  Subset of CUAVE database used: 36 speakers (30 training, 6 testing), 5 sequences of 10 digits per spkr. Training set: 1500 digits (30x5x10) Test set: 300 digits (6x5x10)  Features: Audio: 39 features (MFCC_D_A) Visual: 105 features (ROIDCT_D_A)  Multi-Streams HMM models, Middle Integration: 8 state, left-to-right HMM whole-digit models Single Gaussian mixture AV-HMM uses separate audio and video feature streams

Optimal Stream Weights Results  Assume:  V 2 /  A 2 = 2 SNR-indep. correlation 0.96

Parametric non-linear equalization  Parametric histogram equalization  Smoother estimates  Bi-modal transformation (speech vs. non- speech)

Voice Activity Detection  Bi-spectrum based VAD  Support vector machine based VAD  Combination of VAD with speech enhancement

Speech Enhancement  Modified Wiener filtering with filter depending on global SNR  Modified Ephraim-Malah enhancement: based on the E-M spectral attenuation rule

Non Native Speech Recognition  Build non-native models by combining English and native models  Use phone confusion between English phones and native acoustic models to add alternate model paths  Extract confusion matrix automatically by running phone recognition using native model  Phone pronunciation depends on word grapheme, English phone [grapheme] -> french phone

Example for English phone /t  /  /t/  /k/  //// /t  / //// //// Extracted rules EnglishFrench French models English model

Graphemic constraints  Example: APPROACH /ah p r ow ch/ APPROACH (A, ah) (PP, p) (R, r) (OA, ow) (CH, ch)  Alignment between graphemes and phones for each word of lexicon  Lexicon modification: add graphemes for each word  Confusion rules extraction (grapheme, english phone) → list of non native phones Example: (A, ah) → a

Used Approach FrenchItalianSpanish WERSERWERSERWERSER Command and control grammar baseline612.810.519.67.014.9 confusion4.610.26.914.15.111.8 +graphemes confusion4.911.38.215.96.213.6 Word loop grammar baseline35.747.943.552.039.953.5 confusion27.342.131.346.231.344.5 +graphemes confusion26.241.930.545.531.346.5 Experiments : HIWIRE Database

Ongoing Work  Front-end combination and integration of algorithms  Fixed-platform demonstration non-native speech demo  PDA-platform demonstration  Ongoing research

Towards speaker and environmental robustness in ASR: the HIWIRE project A. Potamianos 1, G. Bouselmi 2, D. Dimitriadis 3, D. Fohr 2, R. Gemello 4, I. Illina.

Similar presentations

Presentation on theme: "Towards speaker and environmental robustness in ASR: the HIWIRE project A. Potamianos 1, G. Bouselmi 2, D. Dimitriadis 3, D. Fohr 2, R. Gemello 4, I. Illina."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Towards speaker and environmental robustness in ASR: the HIWIRE project A. Potamianos 1, G. Bouselmi 2, D. Dimitriadis 3, D. Fohr 2, R. Gemello 4, I. Illina.

Similar presentations

Presentation on theme: "Towards speaker and environmental robustness in ASR: the HIWIRE project A. Potamianos 1, G. Bouselmi 2, D. Dimitriadis 3, D. Fohr 2, R. Gemello 4, I. Illina."— Presentation transcript:

Similar presentations

About project

Feedback