Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.

Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering The Chinese University of Hong Kong

Outline Introduction to Interactive Toy
Storytelling Engine Authoring Environment Speech Recognition Engine Background of Vocal Track Length Normalization (VTLN) used in Speech Recognition Engine, especially suitable for children utterance Results Conclusion

What is Interactive Toy?
Not like the traditional story telling machine (one speak and one listen) Player can talk to the toy, have interaction and decide the story branching Video Show

Interactive Story Teller
Story Telling Engine Speech Recognizer Authoring Environment

Storytelling Engine Interface

Storytelling Engine Stories retrieved in standardized database format, from local machine or Internet Multiple media (video, audio and image) presentation of story segments Branching controlled by input from speech recognizer

Authoring Environment
Tool for creating, modifying and managing the multipath stories. What is multipath stories?

Multipath Stories Alternative to traditional linear narrative
Graph consists of nodes (story segments) and edges connecting nodes Path traversed depends upon user input

Authoring Environment (con’t)
Writes stories in standard database format, which being shared in Internet Graphical user interface visualizes interactive stories as graph structure. Editable node properties: name, text, audio, image Editable branch properties: name, keyword

User Interface

Speech Recognition Engine
The KEY of interaction Phoneme modeling allows recognition of arbitrary keywords/phrases Keyword spotting allows fewer constraints on users’ utterances Vocal tract normalization (VTLN) for adapting to children’s utterances, to increase the recognition accuracy

Adult vs. Child utterances
Different in Vocal Tract Length Different Formant Position even same word Reason: Vocal Tract ~ Tube Longer tube, lower the formants Adult Child 3rd formant

Why need VTLN? Feature Extraction captures the vocal tract shape propriety Normally, adult features are used to train the models represented by Gaussian Mixture Mismatch between children feature and adult models, e.g. like mismatch between template matching Solution: pull utterance to models == feature warping pull models to utterance == model warping

Why do we need model warping?
Feature Warping is performed at every income speech frame, but model Warping can use the pre-calculated models. Thus, model warping has lower computational cost during real time recognition Feature Warping needs to modify the front-end feature extractor,but Model Warping does not need to change the configuration. Thus, the development time of model warping is shorter

Feature Warping Approach
Frequency Warping by Interpolation during feature extraction(every frame) Binning and left blocks ==> MFCC

Motivation of Model Warping
Binning is grouping the spectrum coefficients Neighborhood coefficients are highly correlated, thus binned coefficients are related Binning Spectrum vs Frequency

Model Warping Approach
Transform MFCC from model parameter to filterbank domain Warp at filterbank domain Transform back to MFCC model

Model Warping Procedure
Transform MFCC to filterbank coefficients Warping by Interpolation at filterbank domain (similar to feature warping) Transform it back to MFCC Derivatives and Accelerations MFCC are similar transformation

Experimental Setup Feature vectors with 39 elements consisting of 12 MFCC with 1 frame energy and delta and acceleration coefficients 11 digit models used were 9 state left-to-right HMM models with a single mixture component. The silence model trained had 3 states, 4 mixtures and loop back path Training Data: 25 men and 25 women (3570 utterances) Testing Data: 25 boys and 25 girls (3570 utterances) Warping factors from 0.88 to 1.12 with the step size 0.04.

Feature vs. Model Warping

Summary of Results Recognition Accuracy for children utterances

Conclusion The model-based approach achieves performance which is essentially identical with that obtained using a feature-based approach at the same warping factor. The model-based approach can save the computational cost and development time, with same recognition accuracy of feature-based approach

Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.

Similar presentations

Presentation on theme: "Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.

Similar presentations

Presentation on theme: "Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering."— Presentation transcript:

Similar presentations

About project

Feedback