Download presentation
Presentation is loading. Please wait.
Published byJeffery Cobb Modified over 9 years ago
1
Towards speaker and environmental robustness in ASR: the HIWIRE project A. Potamianos 1, G. Bouselmi 2, D. Dimitriadis 3, D. Fohr 2, R. Gemello 4, I. Illina 2, F. Mana 4, P. Maragos 3, M. Matassoni 5, V. Pitsikalis 3, J. Ramírez 6, E. Sanchez-Soto 1, J. Segura 6, and P. Svaizer 5 1 Dept. of E.C.E., Tech. Univ. of Crete, Chania, Greece 2 Speech Group, LORIA, Nancy, France 3 School of E.C.E., Natl. Tech. Univ. of Athens, Athens, Greece 4 Loquendo, via Valdellatorre, 4-10149, Torino, Italy 5 ITC-irst, via Sommarive 18 - Povo (TN), Italy 6 Dept. of Signal Theory, Univ. of Granada, Spain
2
Outline Introduction: the HIWIRE project Goals and objectives Research areas: Environmental robustness Speaker robustness Experimental results Ongoing work
3
HIWIRE project http://www.hiwire.org http://www.hiwire.org Goals: environment and speaker robust ASR Showcase: fixed cockpit platform, PDA platform Industrial partners: Thales Avionics, Loquendo Research partners: LORIA, TUC, NTUA, UGR, ITC-IRST, Thales research FP6 project: 6/2004 to 5/2007
4
Research areas Environmental robustness Multi-microphone ASR Robust feature extraction Feature fusion and audio-visual ASR Feature equalization Voice-activity detection Speech enhancement Speaker robustness Model-transformation Acoustic modeling for non-native speech
5
Multi-microphone ASR: Outline Beamforming and Adaptive Noise Cancellation Environmental Acoustics Estimation
6
Beamforming: D&S Availability of multi-channel signals allows to selectively capture the desired source: Issues: estimation of reliable TDOAs; Method: CSP analysis over multiple frames Advantages: robustness reduced computational power
7
D&S with MarkIII Test set: set N1_SNR0 of MC-TIDIGITS (cockpit noise), MarkIII channels clean models, trained on original TIDIGITS Results (WERR [%]): C_138.5 C_3250.8 DS_C879.9 DS_C1683.0 DS_C3285.3 DS_C6485.4
8
Robust Features for ASR Modulation Features AM-FM Modulations Teager Energy Cepstrum Fractal Features Dynamical Denoising Correlation Dimension Multiscale Fractal Dimension Hybrid-Merged Features up to +62 % (Aurora 3) up to +36% (Aurora 2) up to +61 % ( Aurora 2)
9
Speech Modulation Features Filterbank Design Short-Term AM-FM Modulation Features Short-Term Mean Inst. Amplitude IA-Mean Short-Term Mean Inst. Frequency IF-Mean Frequency Modulation Percentages FMP Short-Term Energy Modulation Features Average Teager Energy, Cepstrum Coef. TECC
10
Modulation Acoustic Features Speech Nonlinear Processing Demodulation Robust Feature Transformation/ Selection Regularization + Multiband Filtering Statistical Processing V.A.D. Energy Features: Teager Energy Cepstrum Coeff. TECC AM-FM Modulation Features: Mean Inst. Ampl. IA-Mean Mean Inst. Freq. IF-Mean Freq. Mod. Percent. FMP
11
TIMIT-based Speech Databases TIMIT Database: Training Set: 3696 sentences, ~35 phonemes/utterances Testing Set: 1344 utterances, 46680 phonemes Sampling Frequency 16 kHz Feature Vectors: MFCC+C0+AM-FM+1 st +2 nd Time Derivatives Stream Weights: (1) for MFCC and (2) for ΑΜ-FM 3-state left-right HMMs, 16 mixtures All-pair, Unweighted grammar Performance Criterion: Phone Accuracy Rates (%) Back-end System: HTK v3.2.0
12
Results: TIMIT+Noise Up to +106%
13
Aurora 3 - Spanish Connected-Digits, Sampling Frequency 8 kHz Training Set: WM (Well-Matched): 3392 utterances (quiet 532, low 1668 and high noise 1192 MM (Medium-Mismatch): 1607 utterances (quiet 396 and low noise 1211) HM (High-Mismatch): 1696 utterances (quiet 266, low 834 and high noise 596) Testing Set: WM: 1522 utterances (quiet 260, low 754 and high noise 508), 8056 digits MM: 850 utterances (quiet 0, low 0 and high noise 850), 4543 digits HM: 631 utterances (quiet 0, low 377 and high noise 254), 3325 digits 2 Back-end ASR Systems (ΗΤΚ and BLasr) Feature Vectors: MFCC+AM-FM (or Auditory+ΑM-FM), TECC All-Pair, Unweighted Grammar (or Word-Pair Grammar) Performance Criterion: Word (digit) Accuracy Rates
14
Results: Aurora 3 Up to +62%
15
Fractal Features N-d Cleaned Embedding N-d Signal Local SVD speech signal Filtered Dynamics - Correlation Dimension Noisy Embedding Filtered Embedding FDCD Multiscale Fractal Dimension MFD Geometrical Filtering
16
Databases: Aurora 2 Task: Speaker Independent Recognition of Digit Sequences TI - Digits at 8kHz Training (8440 Utterances per scenario, 55M/55F) Clean (8kHz, G712) Multi-Condition (8kHz, G712) 4 Noises (artificial): subway, babble, car, exhibition 5 SNRs : 5, 10, 15, 20dB, clean Testing, artificially added noise 7 SNRs: [-5, 0, 5, 10, 15, 20dB, clean] A: noises as in multi-cond train., G712 (28028 Utters) B: restaurant, street, airport, train station, G712 (28028 Utters) C: subway, street (MIRS) (14014 Utters)
17
Results: Aurora 2 Up to +61%
18
Feature Fusion Merge synchronous feature streams Investigate both supervised and unsupervised algorithms
19
Feature Fusion: multi-stream Compute “optimal” exponent weights for each stream s [ HMM Gaussian mixture formulation; similar expressions for MM, naïve Bayes, Euclidean/Mahalonobois classifier] Optimality in the sense of minimizing “total classification error”
20
Multi-Stream Classification Two class problem w 1, w 2 Feature vector x is broken up into two independent streams x 1 and x 2 Stream weights s 1 and s 2 are used to “equalize” the “probabilities”
21
Multi-Stream Classification Bayes classification decision Non-unity weights increase Bayes error but estimation/modeling error may decrease Stream weights can decrease total error “Optimal” weights minimize estimation error variance z 2
22
Optimal Stream Weights Equal error rate in single-stream classifiers optimal stream weights are inversely proportional to the total stream estimation error variance
23
Optimal Stream Weights Equal estimation error variance in each stream optimal weights are approximately inversely proportional to the single stream classification error
24
Experimental Results Subset of CUAVE database used: 36 speakers (30 training, 6 testing), 5 sequences of 10 digits per spkr. Training set: 1500 digits (30x5x10) Test set: 300 digits (6x5x10) Features: Audio: 39 features (MFCC_D_A) Visual: 105 features (ROIDCT_D_A) Multi-Streams HMM models, Middle Integration: 8 state, left-to-right HMM whole-digit models Single Gaussian mixture AV-HMM uses separate audio and video feature streams
25
Optimal Stream Weights Results Assume: V 2 / A 2 = 2 SNR-indep. correlation 0.96
26
Parametric non-linear equalization Parametric histogram equalization Smoother estimates Bi-modal transformation (speech vs. non- speech)
27
Voice Activity Detection Bi-spectrum based VAD Support vector machine based VAD Combination of VAD with speech enhancement
28
Speech Enhancement Modified Wiener filtering with filter depending on global SNR Modified Ephraim-Malah enhancement: based on the E-M spectral attenuation rule
29
Non Native Speech Recognition Build non-native models by combining English and native models Use phone confusion between English phones and native acoustic models to add alternate model paths Extract confusion matrix automatically by running phone recognition using native model Phone pronunciation depends on word grapheme, English phone [grapheme] -> french phone
30
Example for English phone /t / /t/ /k/ //// /t / //// //// Extracted rules EnglishFrench French models English model
31
Graphemic constraints Example: APPROACH /ah p r ow ch/ APPROACH (A, ah) (PP, p) (R, r) (OA, ow) (CH, ch) Alignment between graphemes and phones for each word of lexicon Lexicon modification: add graphemes for each word Confusion rules extraction (grapheme, english phone) → list of non native phones Example: (A, ah) → a
32
Used Approach FrenchItalianSpanish WERSERWERSERWERSER Command and control grammar baseline612.810.519.67.014.9 confusion4.610.26.914.15.111.8 +graphemes confusion4.911.38.215.96.213.6 Word loop grammar baseline35.747.943.552.039.953.5 confusion27.342.131.346.231.344.5 +graphemes confusion26.241.930.545.531.346.5 Experiments : HIWIRE Database
33
Ongoing Work Front-end combination and integration of algorithms Fixed-platform demonstration non-native speech demo PDA-platform demonstration Ongoing research
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.