Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky
Outline 1.State-of-the-art 2.Modelling phoneme duration 3.Suggestions of human perception results for speech modelling 4.Conclusion I of IV
1. Current HMM state-of-the-art Technology
State-of-the-art: Overview Speech Modelling I of IV
1. Feature Extraction For speech recognition: Extract features that enable us to discriminate between different classes (phonemes) The more discriminant the features, the easier it is to do classification Usually extract frequencies contained in frame (MFCC) I of IV
2. Speech Modelling Usually uses Hidden Markov Models Characteristics Number of states Transition probabilities Model to estimate emission likelihoods (GMMs) or posterior probabilities (ANNs) I of IV
2. Modelling Phoneme duration
Problem (1) Phonemes in reality have different duration If minimum duration longer than phoneme: some states have to model context
Problems (2) Generally, if less number of states, less good performance %WER Baseline TIMIT (S6G32p4) TIMIT S4G32p TIMIT S4G64p
Possible solutions Hypothesis: choose shorter minimum duration HMM‘s for shorter phonemes (prior knowledge) 1.Other topology (jump states) 2.Less states
Test setup – TIMIT TIMIT database 8 dialectic regions, 630 speakers Without the dialect „sa“ utterances 3693 sentences for training 1344 sentences for testing Number of model parameters is constant (less states => more Gaussians per states)
Modelling phoneme duration: Results (1) %WER Baseline TIMIT (S6G32p4) TIMIT var. no of states: S6G32-S4G32p TIMIT var. no of states: S6G32-S4G64 NO P TIMIT jump model for all phonemes (S6G32p2) TIMIT var. Topology: jump model (S6G32p1) 39.90
Modelling Phoneme duration: Results analysis If decreased number of states, an increase in Gaussians per state is neccessary to ensure comparative model complexity Insertion penalty less important Decreasing model minimum duration for short phonemes helps correct recognition Better results for variable states
2. Suggestions of human perception results for speech modelling
Human perception tests: Motivation Speech is created to be perceived by humans We know that human performance is very good and robust Simulation of the human perception may lead to improvements Testing on nonsense phoneme sequences (no language model) to isolate the „Acoustic Model“
Human stop-consonant perception (1) Tested stop-consonant perception: Identical noise burst in variable context Liberman, Cooper, Delattre in 1952 Valid test results? Implications for state-of-the-art technology
Human stop-consonant perception: Test setup Synthetic sounds (Matlab generated) 40 test persons, 2 tests each 17 English 11 French 12 others Tests on different days Headphones Technics RP-F880 Quiet room
Human stop-consonant perception: Test setup (2) 12 different noise burst frequencies 7 different two-formant vowels No transitions
Human stop-consonant perception: Selected results In front of /a/ In front of /o/
Suggestions of human perception results for speech modelling Suggests that speech data has to be analyzed in some context => well-known results that context- dependent phoneme models improve performance Suggests the neccessity of the use of Multiple Gaussians per state
4. Conclusions
Conclusions (1) Performance can be improved by introducing variable-state HMMs Context-independent phoneme model is inadequate with short-term spectral features
Conclusions (2) New features (such as TRAPS) to enable capturing of relative dependencies better? Preference to context-dependent phoneme models with multiple Gaussians
Thank you!
Human stop-consonant perception: Results 2004 EN vs.FR