May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication Research Laboratories (ATR-SLC) Kyoto, Japan Konstantin Markov and Satoshi Nakamura

May 20, 2006 SRIV2006, Toulouse, France2 Outline  Motivation and previous studies.  HMM based accent acoustic modeling.  Hybrid HMM/BN acoustic model for accented speech.  Evaluation and results.  Conclusion.

May 20, 2006 SRIV2006, Toulouse, France3 Motivation and Previous Studies  Accent variability:  Causes performance degradation due to training / testing conditions mismatch.  Becomes major factor for ASR’s public applications.  Differences due to accent variability are mainly:  Phonetic -  lexicon modification (Liu, ICASSP,98).  accent dependent dictionary (Humphries, ICASSP,98).  Acoustic – (addressed in this work)  Pooled data HMM (Chengalvarayan, Eurospeech’01).  Accent identification (Huang, ICASSP’05).

May 20, 2006 SRIV2006, Toulouse, France4 input speech HMM based approaches (1) Accent-dependent data → ABC A,B,CA,B,C MA-HMM Feature ExtractionDecoder Pooled data → Multi-accent AM → recognition result

May 20, 2006 SRIV2006, Toulouse, France5 input speech HMM based approaches (2) Accent-dependent data → ABC B-HMM Feature ExtractionDecoder Accent-dependent HMMs Parallel AM → recognition result C-HMMA-HMM PA-HMM →

May 20, 2006 SRIV2006, Toulouse, France6 input speech HMM based approaches (3) Accent-dependent data → ABC F-HMM Feature ExtractionDecoder Gender-dependent HMMs Parallel AM → recognition result M-HMM GD-HMM →

May 20, 2006 SRIV2006, Toulouse, France7 Hybrid HMM/BN Background  HMM/BN Structure:  HMM at the top level. Models speech temporal characteristic by state transitions.  BN at the bottom level. Represents states PDF.  BN Topologies:  Simple BN Example: State PDF: State output probability: If M is hidden, then: q1q1 q2q2 q3q3 HMM Bayesian Network X HMM State Mixture component index Observation Q M

May 20, 2006 SRIV2006, Toulouse, France8 HMM/BN based Accent Model  Accent and Gender are modeled as additional variables of the BN.  The BN topology:  G = {F,M}  A = {A,B,C}  When G and A are hidden:

May 20, 2006 SRIV2006, Toulouse, France9 HMM/BN Training  Initial conditions  Bootstrap HMM: gives the (tied) state structure.  Labelled data: each feature vector has accent and gender label.  Training algorithm: Step 1: Viterbi alignment of the training data using the bootstrap HMM to obtain state labels. Step 2: Initialization of BN parameters. Step 3: Forwards-Backward based embedded HMM/BN training. Step 4: If convergence criterion is met  Stop Otherwise  go to Step 3

May 20, 2006 SRIV2006, Toulouse, France10 input speech HMM/BN approach Accent-dependent and gender-dependent data A(M)B(M)C(M) C(F) HMM/BN Feature ExtractionDecoder HMM/BN AM → recognition result B(F)A(F) →

May 20, 2006 SRIV2006, Toulouse, France11 MA-HMM Comparison of state distributions PA-HMM GD-HMM HMM/BN

May 20, 2006 SRIV2006, Toulouse, France12 Database and speech pre-processing  Database  Accents:  American (US).  British (BRT).  Australian (AUS).  Speakers / Utterances:  100 per accent (90 for training + 10 for evaluation).  300 utterances per speaker.  Speech material same for each accent.  Travel arrangement dialogs.  Speech feature extraction:  20ms frames at 10ms rate.  25 dim. features vectors (12MFCC + 12ΔMFCC + ΔE).

May 20, 2006 SRIV2006, Toulouse, France13 Models  Acoustic models:  All HMM based AMs have:  Three states, left-to-right, triphone contexts  3,275 states (MDL-SSS)  Variants with 6,18, 30 and 42 total Gaussians per state.  HMM/BN model:  Same state structure as the HMM models.  Same number of Gaussian components.  Language model:  Bi-gram, Tri-gram (600,000 training sentences).  35,000 word vocabulary.  Test data perplexity: 116.5 and 27.8  Pronunciation lexicon – American English.

May 20, 2006 SRIV2006, Toulouse, France14 Evaluation results

May 20, 2006 SRIV2006, Toulouse, France15 Evaluation results Model type Test data accentAverage USBRTAUS US-HMM91.551.168.670.4 BRT-HMM77.984.783.882.1 AUS-HMM81.673.590.982.0 MA-HMM90.982.189.587.5 PA-HMM89.681.786.485.9 GD-HMM90.982.589.387.6 HMM/BN91.483.190.388.2 Word accuracies (%), all models with total of 42 Gaussians per state.

May 20, 2006 SRIV2006, Toulouse, France16 Conclusions  In the matched accent case, accent-dependent models are the best choice.  The HMM/BN is the best, almost matching the results of accent- dependent models, but requires more mixture components.  Multi-accent HMM is the most efficient in terms of performance and complexity.  Different performance levels of accent-dependent models apparently caused by the phonetic accent differences.

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

Similar presentations

Presentation on theme: "May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

Similar presentations

Presentation on theme: "May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication."— Presentation transcript:

Similar presentations

About project

Feedback