Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.

Similar presentations


Presentation on theme: "Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering."— Presentation transcript:

1 Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering The Chinese University of Hong Kong

2 Outline Introduction to Interactive Toy
Storytelling Engine Authoring Environment Speech Recognition Engine Background of Vocal Track Length Normalization (VTLN) used in Speech Recognition Engine, especially suitable for children utterance Results Conclusion

3 What is Interactive Toy?
Not like the traditional story telling machine (one speak and one listen) Player can talk to the toy, have interaction and decide the story branching Video Show

4 Interactive Story Teller
Story Telling Engine Speech Recognizer Authoring Environment

5 Storytelling Engine Interface

6 Storytelling Engine Stories retrieved in standardized database format, from local machine or Internet Multiple media (video, audio and image) presentation of story segments Branching controlled by input from speech recognizer

7 Authoring Environment
Tool for creating, modifying and managing the multipath stories. What is multipath stories?

8 Multipath Stories Alternative to traditional linear narrative
Graph consists of nodes (story segments) and edges connecting nodes Path traversed depends upon user input

9 Authoring Environment (con’t)
Writes stories in standard database format, which being shared in Internet Graphical user interface visualizes interactive stories as graph structure. Editable node properties: name, text, audio, image Editable branch properties: name, keyword

10 User Interface

11 Speech Recognition Engine
The KEY of interaction Phoneme modeling allows recognition of arbitrary keywords/phrases Keyword spotting allows fewer constraints on users’ utterances Vocal tract normalization (VTLN) for adapting to children’s utterances, to increase the recognition accuracy

12 Adult vs. Child utterances
Different in Vocal Tract Length Different Formant Position even same word Reason: Vocal Tract ~ Tube Longer tube, lower the formants Adult Child 3rd formant

13 Why need VTLN? Feature Extraction captures the vocal tract shape propriety Normally, adult features are used to train the models represented by Gaussian Mixture Mismatch between children feature and adult models, e.g. like mismatch between template matching Solution: pull utterance to models == feature warping pull models to utterance == model warping

14 Why do we need model warping?
Feature Warping is performed at every income speech frame, but model Warping can use the pre-calculated models. Thus, model warping has lower computational cost during real time recognition Feature Warping needs to modify the front-end feature extractor,but Model Warping does not need to change the configuration. Thus, the development time of model warping is shorter

15 Feature Warping Approach
Frequency Warping by Interpolation during feature extraction(every frame) Binning and left blocks ==> MFCC

16 Motivation of Model Warping
Binning is grouping the spectrum coefficients Neighborhood coefficients are highly correlated, thus binned coefficients are related Binning Spectrum vs Frequency

17 Model Warping Approach
Transform MFCC from model parameter to filterbank domain Warp at filterbank domain Transform back to MFCC model

18 Model Warping Procedure
Transform MFCC to filterbank coefficients Warping by Interpolation at filterbank domain (similar to feature warping) Transform it back to MFCC Derivatives and Accelerations MFCC are similar transformation

19 Experimental Setup Feature vectors with 39 elements consisting of 12 MFCC with 1 frame energy and delta and acceleration coefficients 11 digit models used were 9 state left-to-right HMM models with a single mixture component. The silence model trained had 3 states, 4 mixtures and loop back path Training Data: 25 men and 25 women (3570 utterances) Testing Data: 25 boys and 25 girls (3570 utterances) Warping factors from 0.88 to 1.12 with the step size 0.04.

20 Feature vs. Model Warping

21 Summary of Results Recognition Accuracy for children utterances

22 Conclusion The model-based approach achieves performance which is essentially identical with that obtained using a feature-based approach at the same warping factor. The model-based approach can save the computational cost and development time, with same recognition accuracy of feature-based approach


Download ppt "Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering."

Similar presentations


Ads by Google