Download presentation
Presentation is loading. Please wait.
Published byShana Austin Modified over 8 years ago
1
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky
2
Outline 1.State-of-the-art 2.Modelling phoneme duration 3.Suggestions of human perception results for speech modelling 4.Conclusion I of IV
3
1. Current HMM state-of-the-art Technology
4
State-of-the-art: Overview Speech Modelling I of IV
5
1. Feature Extraction For speech recognition: Extract features that enable us to discriminate between different classes (phonemes) The more discriminant the features, the easier it is to do classification Usually extract frequencies contained in frame (MFCC) I of IV
6
2. Speech Modelling Usually uses Hidden Markov Models Characteristics Number of states Transition probabilities Model to estimate emission likelihoods (GMMs) or posterior probabilities (ANNs) I of IV
7
2. Modelling Phoneme duration
8
Problem (1) Phonemes in reality have different duration If minimum duration longer than phoneme: some states have to model context
9
Problems (2) Generally, if less number of states, less good performance %WER Baseline TIMIT (S6G32p4) 39.88 TIMIT S4G32p- 3 43.21 TIMIT S4G64p- 3 41.76
10
Possible solutions Hypothesis: choose shorter minimum duration HMM‘s for shorter phonemes (prior knowledge) 1.Other topology (jump states) 2.Less states
11
Test setup – TIMIT TIMIT database 8 dialectic regions, 630 speakers Without the dialect „sa“ utterances 3693 sentences for training 1344 sentences for testing Number of model parameters is constant (less states => more Gaussians per states)
12
Modelling phoneme duration: Results (1) %WER Baseline TIMIT (S6G32p4) 39.88 TIMIT var. no of states: S6G32-S4G32p1 40.82 TIMIT var. no of states: S6G32-S4G64 NO P 39.50 TIMIT jump model for all phonemes (S6G32p2) 41.61 TIMIT var. Topology: jump model (S6G32p1) 39.90
13
Modelling Phoneme duration: Results analysis If decreased number of states, an increase in Gaussians per state is neccessary to ensure comparative model complexity Insertion penalty less important Decreasing model minimum duration for short phonemes helps correct recognition Better results for variable states
14
2. Suggestions of human perception results for speech modelling
15
Human perception tests: Motivation Speech is created to be perceived by humans We know that human performance is very good and robust Simulation of the human perception may lead to improvements Testing on nonsense phoneme sequences (no language model) to isolate the „Acoustic Model“
16
Human stop-consonant perception (1) Tested stop-consonant perception: Identical noise burst in variable context Liberman, Cooper, Delattre in 1952 Valid test results? Implications for state-of-the-art technology
17
Human stop-consonant perception: Test setup Synthetic sounds (Matlab generated) 40 test persons, 2 tests each 17 English 11 French 12 others Tests on different days Headphones Technics RP-F880 Quiet room
18
Human stop-consonant perception: Test setup (2) 12 different noise burst frequencies 7 different two-formant vowels No transitions
19
Human stop-consonant perception: Selected results In front of /a/ In front of /o/
20
Suggestions of human perception results for speech modelling Suggests that speech data has to be analyzed in some context => well-known results that context- dependent phoneme models improve performance Suggests the neccessity of the use of Multiple Gaussians per state
21
4. Conclusions
22
Conclusions (1) Performance can be improved by introducing variable-state HMMs Context-independent phoneme model is inadequate with short-term spectral features
23
Conclusions (2) New features (such as TRAPS) to enable capturing of relative dependencies better? Preference to context-dependent phoneme models with multiple Gaussians
24
Thank you!
25
Human stop-consonant perception: Results 2004 EN vs.FR
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.