Presentation is loading. Please wait.

Presentation is loading. Please wait.

Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.

Similar presentations


Presentation on theme: "Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers."— Presentation transcript:

1 Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers

2 2 Summary Problem overview Baseline system Extensions to the baseline system Conclusions and future work

3 3 The Problem Speaker Gender Age Vocal tract characteristics Pronunciation Rate of Speech Stress Lombard Reflex Microphone Position Distortion Channel Distortion Noise Environment Background noises Intermitent noises Coktail party noises Reverberation

4 4 Corpus Description Multilingual telephone speech corpus SPEECHDAT(M)1000 speakers SPEECHDAT(II)4000 speakers Orthographically transcribed including noise events

5 5 Noise events  [spk]:Speaker related noises  [sta]:Stationary noises  [int]:Intermittent noises

6

7 7 Train and Test Set Definition Selection procedure –Age, gender and region distribution are approximately equal in both train and test sets; SPEECHDAT II –Fixed 500 speakers evaluation set –Additional 300 speakers development set SPEECHDAT(M) –200 speakers evaluation set Overall ratio of 80% Train/20% Test

8 8 Sub-corpus Used I1 - Isolated digit strings B1 - Sequences of 10 digits N* - Natural numbers

9 9 Feature Extraction MFCC (Mel Frequency Cepstral Coefficients) –14 Cepstra + 14  Cepstra + Energy +  Energy –Speech signal band-limited between 200 and 3800 Hz –Hamming Window: 25 ms each 10 ms Cepstral Mean Substraction –Simple but effective technique for channel and speaker normalization

10 10 Acoustic Modeling Left-right continuous density HMM’s –Word models for each digit. No skips. –Silence and filler models with forward and backward skips Gender dependent models HMM: Hidden Markov Model

11 11 Model Topology Fillers and silence models topology

12 12 Baseline System - Isolated Digits Choose isolated digits with no noise marks –HMM parameters initialized with the global mean and variance of the training data Embedded Baum-Welch Reestimation Evaluate performance withViterbi decoding –Grammar allowing one digit and initial and final silence –Grammar allowing one digit and any number of fillers or silence

13 13 Baseline System - Isolated Digits

14 14 Baseline System - Isolated Digits Increment Gaussian mixtures per state up to 3 for the digit models Introduce files with noise marks Repeat re-estimation/evaluation process Increment Gaussian mixtures per state up to 3 for the filler and digit models

15 15 Connected vs Isolated Digits Example: Number 3 1 2 6 said as: Isolated Digits: t r e S u~ d o j S s 6 j S Connected Digits: t r e z u~ d o j S _ 6 j S

16 16 Baseline System - Connected Digits Use best isolated digit models as bootstrap models Repeat re-estimation/evaluation process Increment gradually Gaussian mixtures per state up to 5 for the digit models

17 17 Baseline System - Results

18 18 Extension to the Baseline System New way of modelling the filler models Same training/evaluation process Train the 9 filler and silence models with no skips Build a unique filler model concatenating all filler and silence models

19 19 New Filler Model Arquitecture

20 20 Results With New Filler Model

21 21 Natural Numbers Phone models with 3 states and no skips Larger vocabulary size May be adapted to other tasks Phones initialized from models already trained for a directory assistance task Digits are still modeled by word models Grammar for natural numbers ranging from zero to hundreds of millions

22 22 Natural Numbers Example Number 25: Hypothesis 1: vinte e cinco (Twenty and five) Hypotesis 2: vinte cinco (Twenty five) But “vinte cinco” could also be the sequence of natural numbers: 20 5

23 23 Natural Numbers - Results

24 24 Sample Application State Control Speech Recording User Server Feature Extraction Speech RecognitionSpeech Synthesis DIXI - SVIT Client Speech Prompts Speech / Commands Synthesised answer/ Commands Answer

25 25 Conclusions and Future Work Explicitly modeling fillers is a difficult task –Improved filler model decreases error rate up to 50 % Develop context dependent models –Solve vowel reduction and co-articulation problems Results may be improved through the use of discriminative training techniques


Download ppt "Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers."

Similar presentations


Ads by Google