Download presentation
Presentation is loading. Please wait.
1
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering The Chinese University of Hong Kong
2
Outline Introduction to Interactive Toy
Storytelling Engine Authoring Environment Speech Recognition Engine Background of Vocal Track Length Normalization (VTLN) used in Speech Recognition Engine, especially suitable for children utterance Results Conclusion
3
What is Interactive Toy?
Not like the traditional story telling machine (one speak and one listen) Player can talk to the toy, have interaction and decide the story branching Video Show
4
Interactive Story Teller
Story Telling Engine Speech Recognizer Authoring Environment
5
Storytelling Engine Interface
6
Storytelling Engine Stories retrieved in standardized database format, from local machine or Internet Multiple media (video, audio and image) presentation of story segments Branching controlled by input from speech recognizer
7
Authoring Environment
Tool for creating, modifying and managing the multipath stories. What is multipath stories?
8
Multipath Stories Alternative to traditional linear narrative
Graph consists of nodes (story segments) and edges connecting nodes Path traversed depends upon user input
9
Authoring Environment (con’t)
Writes stories in standard database format, which being shared in Internet Graphical user interface visualizes interactive stories as graph structure. Editable node properties: name, text, audio, image Editable branch properties: name, keyword
10
User Interface
11
Speech Recognition Engine
The KEY of interaction Phoneme modeling allows recognition of arbitrary keywords/phrases Keyword spotting allows fewer constraints on users’ utterances Vocal tract normalization (VTLN) for adapting to children’s utterances, to increase the recognition accuracy
12
Adult vs. Child utterances
Different in Vocal Tract Length Different Formant Position even same word Reason: Vocal Tract ~ Tube Longer tube, lower the formants Adult Child 3rd formant
13
Why need VTLN? Feature Extraction captures the vocal tract shape propriety Normally, adult features are used to train the models represented by Gaussian Mixture Mismatch between children feature and adult models, e.g. like mismatch between template matching Solution: pull utterance to models == feature warping pull models to utterance == model warping
14
Why do we need model warping?
Feature Warping is performed at every income speech frame, but model Warping can use the pre-calculated models. Thus, model warping has lower computational cost during real time recognition Feature Warping needs to modify the front-end feature extractor,but Model Warping does not need to change the configuration. Thus, the development time of model warping is shorter
15
Feature Warping Approach
Frequency Warping by Interpolation during feature extraction(every frame) Binning and left blocks ==> MFCC
16
Motivation of Model Warping
Binning is grouping the spectrum coefficients Neighborhood coefficients are highly correlated, thus binned coefficients are related Binning Spectrum vs Frequency
17
Model Warping Approach
Transform MFCC from model parameter to filterbank domain Warp at filterbank domain Transform back to MFCC model
18
Model Warping Procedure
Transform MFCC to filterbank coefficients Warping by Interpolation at filterbank domain (similar to feature warping) Transform it back to MFCC Derivatives and Accelerations MFCC are similar transformation
19
Experimental Setup Feature vectors with 39 elements consisting of 12 MFCC with 1 frame energy and delta and acceleration coefficients 11 digit models used were 9 state left-to-right HMM models with a single mixture component. The silence model trained had 3 states, 4 mixtures and loop back path Training Data: 25 men and 25 women (3570 utterances) Testing Data: 25 boys and 25 girls (3570 utterances) Warping factors from 0.88 to 1.12 with the step size 0.04.
20
Feature vs. Model Warping
21
Summary of Results Recognition Accuracy for children utterances
22
Conclusion The model-based approach achieves performance which is essentially identical with that obtained using a feature-based approach at the same warping factor. The model-based approach can save the computational cost and development time, with same recognition accuracy of feature-based approach
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.