Download presentation
Presentation is loading. Please wait.
Published byRudolf Cummings Modified over 10 years ago
1
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers
2
2 Summary Problem overview Baseline system Extensions to the baseline system Conclusions and future work
3
3 The Problem Speaker Gender Age Vocal tract characteristics Pronunciation Rate of Speech Stress Lombard Reflex Microphone Position Distortion Channel Distortion Noise Environment Background noises Intermitent noises Coktail party noises Reverberation
4
4 Corpus Description Multilingual telephone speech corpus SPEECHDAT(M)1000 speakers SPEECHDAT(II)4000 speakers Orthographically transcribed including noise events
5
5 Noise events [spk]:Speaker related noises [sta]:Stationary noises [int]:Intermittent noises
7
7 Train and Test Set Definition Selection procedure –Age, gender and region distribution are approximately equal in both train and test sets; SPEECHDAT II –Fixed 500 speakers evaluation set –Additional 300 speakers development set SPEECHDAT(M) –200 speakers evaluation set Overall ratio of 80% Train/20% Test
8
8 Sub-corpus Used I1 - Isolated digit strings B1 - Sequences of 10 digits N* - Natural numbers
9
9 Feature Extraction MFCC (Mel Frequency Cepstral Coefficients) –14 Cepstra + 14 Cepstra + Energy + Energy –Speech signal band-limited between 200 and 3800 Hz –Hamming Window: 25 ms each 10 ms Cepstral Mean Substraction –Simple but effective technique for channel and speaker normalization
10
10 Acoustic Modeling Left-right continuous density HMM’s –Word models for each digit. No skips. –Silence and filler models with forward and backward skips Gender dependent models HMM: Hidden Markov Model
11
11 Model Topology Fillers and silence models topology
12
12 Baseline System - Isolated Digits Choose isolated digits with no noise marks –HMM parameters initialized with the global mean and variance of the training data Embedded Baum-Welch Reestimation Evaluate performance withViterbi decoding –Grammar allowing one digit and initial and final silence –Grammar allowing one digit and any number of fillers or silence
13
13 Baseline System - Isolated Digits
14
14 Baseline System - Isolated Digits Increment Gaussian mixtures per state up to 3 for the digit models Introduce files with noise marks Repeat re-estimation/evaluation process Increment Gaussian mixtures per state up to 3 for the filler and digit models
15
15 Connected vs Isolated Digits Example: Number 3 1 2 6 said as: Isolated Digits: t r e S u~ d o j S s 6 j S Connected Digits: t r e z u~ d o j S _ 6 j S
16
16 Baseline System - Connected Digits Use best isolated digit models as bootstrap models Repeat re-estimation/evaluation process Increment gradually Gaussian mixtures per state up to 5 for the digit models
17
17 Baseline System - Results
18
18 Extension to the Baseline System New way of modelling the filler models Same training/evaluation process Train the 9 filler and silence models with no skips Build a unique filler model concatenating all filler and silence models
19
19 New Filler Model Arquitecture
20
20 Results With New Filler Model
21
21 Natural Numbers Phone models with 3 states and no skips Larger vocabulary size May be adapted to other tasks Phones initialized from models already trained for a directory assistance task Digits are still modeled by word models Grammar for natural numbers ranging from zero to hundreds of millions
22
22 Natural Numbers Example Number 25: Hypothesis 1: vinte e cinco (Twenty and five) Hypotesis 2: vinte cinco (Twenty five) But “vinte cinco” could also be the sequence of natural numbers: 20 5
23
23 Natural Numbers - Results
24
24 Sample Application State Control Speech Recording User Server Feature Extraction Speech RecognitionSpeech Synthesis DIXI - SVIT Client Speech Prompts Speech / Commands Synthesised answer/ Commands Answer
25
25 Conclusions and Future Work Explicitly modeling fillers is a difficult task –Improved filler model decreases error rate up to 50 % Develop context dependent models –Solve vowel reduction and co-articulation problems Results may be improved through the use of discriminative training techniques
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.