Download presentation
Presentation is loading. Please wait.
Published byBeverly Evans Modified over 9 years ago
1
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication Research Laboratories (ATR-SLC) Kyoto, Japan Konstantin Markov and Satoshi Nakamura
2
May 20, 2006 SRIV2006, Toulouse, France2 Outline Motivation and previous studies. HMM based accent acoustic modeling. Hybrid HMM/BN acoustic model for accented speech. Evaluation and results. Conclusion.
3
May 20, 2006 SRIV2006, Toulouse, France3 Motivation and Previous Studies Accent variability: Causes performance degradation due to training / testing conditions mismatch. Becomes major factor for ASR’s public applications. Differences due to accent variability are mainly: Phonetic - lexicon modification (Liu, ICASSP,98). accent dependent dictionary (Humphries, ICASSP,98). Acoustic – (addressed in this work) Pooled data HMM (Chengalvarayan, Eurospeech’01). Accent identification (Huang, ICASSP’05).
4
May 20, 2006 SRIV2006, Toulouse, France4 input speech HMM based approaches (1) Accent-dependent data → ABC A,B,CA,B,C MA-HMM Feature ExtractionDecoder Pooled data → Multi-accent AM → recognition result
5
May 20, 2006 SRIV2006, Toulouse, France5 input speech HMM based approaches (2) Accent-dependent data → ABC B-HMM Feature ExtractionDecoder Accent-dependent HMMs Parallel AM → recognition result C-HMMA-HMM PA-HMM →
6
May 20, 2006 SRIV2006, Toulouse, France6 input speech HMM based approaches (3) Accent-dependent data → ABC F-HMM Feature ExtractionDecoder Gender-dependent HMMs Parallel AM → recognition result M-HMM GD-HMM →
7
May 20, 2006 SRIV2006, Toulouse, France7 Hybrid HMM/BN Background HMM/BN Structure: HMM at the top level. Models speech temporal characteristic by state transitions. BN at the bottom level. Represents states PDF. BN Topologies: Simple BN Example: State PDF: State output probability: If M is hidden, then: q1q1 q2q2 q3q3 HMM Bayesian Network X HMM State Mixture component index Observation Q M
8
May 20, 2006 SRIV2006, Toulouse, France8 HMM/BN based Accent Model Accent and Gender are modeled as additional variables of the BN. The BN topology: G = {F,M} A = {A,B,C} When G and A are hidden:
9
May 20, 2006 SRIV2006, Toulouse, France9 HMM/BN Training Initial conditions Bootstrap HMM: gives the (tied) state structure. Labelled data: each feature vector has accent and gender label. Training algorithm: Step 1: Viterbi alignment of the training data using the bootstrap HMM to obtain state labels. Step 2: Initialization of BN parameters. Step 3: Forwards-Backward based embedded HMM/BN training. Step 4: If convergence criterion is met Stop Otherwise go to Step 3
10
May 20, 2006 SRIV2006, Toulouse, France10 input speech HMM/BN approach Accent-dependent and gender-dependent data A(M)B(M)C(M) C(F) HMM/BN Feature ExtractionDecoder HMM/BN AM → recognition result B(F)A(F) →
11
May 20, 2006 SRIV2006, Toulouse, France11 MA-HMM Comparison of state distributions PA-HMM GD-HMM HMM/BN
12
May 20, 2006 SRIV2006, Toulouse, France12 Database and speech pre-processing Database Accents: American (US). British (BRT). Australian (AUS). Speakers / Utterances: 100 per accent (90 for training + 10 for evaluation). 300 utterances per speaker. Speech material same for each accent. Travel arrangement dialogs. Speech feature extraction: 20ms frames at 10ms rate. 25 dim. features vectors (12MFCC + 12ΔMFCC + ΔE).
13
May 20, 2006 SRIV2006, Toulouse, France13 Models Acoustic models: All HMM based AMs have: Three states, left-to-right, triphone contexts 3,275 states (MDL-SSS) Variants with 6,18, 30 and 42 total Gaussians per state. HMM/BN model: Same state structure as the HMM models. Same number of Gaussian components. Language model: Bi-gram, Tri-gram (600,000 training sentences). 35,000 word vocabulary. Test data perplexity: 116.5 and 27.8 Pronunciation lexicon – American English.
14
May 20, 2006 SRIV2006, Toulouse, France14 Evaluation results
15
May 20, 2006 SRIV2006, Toulouse, France15 Evaluation results Model type Test data accentAverage USBRTAUS US-HMM91.551.168.670.4 BRT-HMM77.984.783.882.1 AUS-HMM81.673.590.982.0 MA-HMM90.982.189.587.5 PA-HMM89.681.786.485.9 GD-HMM90.982.589.387.6 HMM/BN91.483.190.388.2 Word accuracies (%), all models with total of 42 Gaussians per state.
16
May 20, 2006 SRIV2006, Toulouse, France16 Conclusions In the matched accent case, accent-dependent models are the best choice. The HMM/BN is the best, almost matching the results of accent- dependent models, but requires more mixture components. Multi-accent HMM is the most efficient in terms of performance and complexity. Different performance levels of accent-dependent models apparently caused by the phonetic accent differences.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.