Presentation is loading. Please wait.

Presentation is loading. Please wait.

Laboratory for Digital Speech and Audio Processing - DSSP Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009.

Similar presentations


Presentation on theme: "Laboratory for Digital Speech and Audio Processing - DSSP Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009."— Presentation transcript:

1

2 Laboratory for Digital Speech and Audio Processing - DSSP Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009 Yuk On Kong, Lukas Latacz, Werner Verhelst Laboratory for Digital Speech and Audio Processing Vrije Universiteit Brussel

3 Laboratory for Digital Speech and Audio Processing - DSSP Introduction

4 Laboratory for Digital Speech and Audio Processing - DSSP To Record or Not to Record: That’s the question. Pre-recorded speech in existing reading tutors Advantages / disadvantages?

5 Laboratory for Digital Speech and Audio Processing - DSSP Application-specific TTS Speaker / voice Material in speech corpus How to synthesize Any extra mode necessary? the child is too slow… How to maximize quality

6 Laboratory for Digital Speech and Audio Processing - DSSP Speaker / Voice Speaker –Appealing to children –Female speaker –Standard Flemish pronunciation, no noticeable regional accent –Experienced speaker

7 Laboratory for Digital Speech and Audio Processing - DSSP Material in Speech Corpus Database (about 6 hours) –Material from stories for children –Words expected at 6 years of age –Diphones

8 Laboratory for Digital Speech and Audio Processing - DSSP How to synthesize Based on the general unit selection paradigm. Heterogeneous units: units could be of various sizes Bases: –Use of longer chunks leads to quality improvement. –Used for synthesizing domain- specific utterances. _-o o oma ma o-mm-a Fig. Word “oma” to synthesize and multi-tier segmentation in word, syllable and segment

9 Laboratory for Digital Speech and Audio Processing - DSSP How to synthesize Basic algorithm: –Search top-down and select longest sequence of targets at each level and go to lower levels if no candidates are found. Coarticulation: –Even across word boundaries Level: diphone, syllable, word, phrase

10 Laboratory for Digital Speech and Audio Processing - DSSP How to synthesize Front-end Back-end Tokenisation Silence Insertion ToDI Intonation Phrase and Pause Prediction Part of speech Word Pronunciation Unit Selection Speech DB Text Normalisation Word Accent Unit Concatenation Als het flink vriest, kunnen we schaatsen.

11 Laboratory for Digital Speech and Audio Processing - DSSP How to synthesize Target prosody is described symbolically Best sequence of units is selected –Weighted sum of target and join costs –Viterbi search Joins: –Costs based on spectrum, pitch, energy, duration and adjacency –PSOLA-based algorithm with optimal coupling LevelTarget cost Segment Phonemic identity* Pause type (if silence)* Segment Position in syllable Syllable Phoneme sequence Lexical stress* ToDI accent* Is_accented* Onset and coda type * Syllable Onset, nucleus and coda size* Distance to next/previous stressed syllable, in terms of syl’s Number of stressed syllables until next/previous phrase break Syllable Distance to next/previous accented syllables, in terms of syl’s Number of accented syllables until next/previous phrase break Word Word Position in phrase Part of speech* Is_content_word* Has_accented_syllable(s)* Is_capitalized * Position in phrase* Token punctuation* Token prepunctuation* Number of words until next/previous phrase break Number of content words until next/previous phrase break Those with a * are also calculated for the neighboring segments, syllables or words. “Neighboring syllables” are restricted to the syllables of the current word. As for segments & words, three neighbors on the left and three on the right are taken into account.

12 Laboratory for Digital Speech and Audio Processing - DSSP Extra Modes? Phoneme-by- phoneme mode –Stress Syllable mode

13 Laboratory for Digital Speech and Audio Processing - DSSP Extra Modes? Demonstration: –Phoneme-by-phoneme Stress on first phoneme –Syllable –Normal mode MoeilijkKoffiezetapparaat

14 Laboratory for Digital Speech and Audio Processing - DSSP The Child is Too Slow… Choosing the appropriate reading speed for the child –Uniform WSOLA time-scaling –Insertion of additional silences between neighboring words Reading along

15 Laboratory for Digital Speech and Audio Processing - DSSP The Child is Too Slow… Commands & Timing Info Synthesizer Cygwin Synthesis module Playback module Reading tutor Teacher’s module Audio Assessment Error detection Tracking Windows XP

16 Laboratory for Digital Speech and Audio Processing - DSSP How to Maximize Quality Major synthesis problems –Join artifacts –Inappropriate prosody Interactive tuning of synthesis –Assisted by quality management –User can make small changes to the input text

17 Laboratory for Digital Speech and Audio Processing - DSSP How to Maximize Quality Approach: –For each word, calculate average target and join costs –Predictor: : threshold based on max and min of cost c u j usually lies between 0 & 1 because of training settings. Accept if u j < 0.5 and reject otherwise. –Weights: linear regression –Best alphas found iteratively (maximizing f-score)

18 Laboratory for Digital Speech and Audio Processing - DSSP Other Special Aspects Phrase and Silence Prediction Context-dependent Weight Training

19 Laboratory for Digital Speech and Audio Processing - DSSP Phrase and Silence Prediction Type of pauses: heavy, medium and light –Phrase breaks: both heavy and medium pauses Training –No manual labeling, but based on the pauses automatically labeled in the speech database –Iterative classification based on these pauses –Training of memory-based learner (features such as POS, punctuation,...)

20 Laboratory for Digital Speech and Audio Processing - DSSP Context-dependent Weight Training Automatic adaptation (tuning) of weights Context-dependent weights –Context is described symbolically per phone Training: –Optimizing weights –Clustering optimized weights (decision trees)

21 Laboratory for Digital Speech and Audio Processing - DSSP Context-dependent Weight Training 7 subjects 4 conditions –Randomly selected corpus; Context-dependent weights –Randomly selected corpus; Untrained weights –Corpus selected based on word frequency; Context-dependent weights –Corpus selected based on word frequency; Untrained weights 25 test utterances, AVI 1-5 (5 utt./level) Results: ConditionsMOS Randomly selected corpus; Context-dependent weights 3.1 Randomly selected corpus; Untrained weights 3.1 Corpus selected based on word frequency; Context- dependent weights 3.3 Corpus selected based on word frequency; Untrained weights 3.0

22 Laboratory for Digital Speech and Audio Processing - DSSP Demonstration Hierarchical unit selection: –AVI 1: “Dit is te gek, gilt ze.” –AVI 3: “Toch had hij liever de hond gehad.” –AVI 5: “Roel ligt nog een paar dagen in het ziekenhuis.” –AVI 7: “De kleine huizen staan dicht tegen elkaar aan.” –AVI 9: “Nou Henk, zie je nu wel dat je moeder hier fantastisch verzorgd wordt!”

23 Laboratory for Digital Speech and Audio Processing - DSSP WSOLA Illustration of the WSOLA strategy Top: original signal Bottom: WSOLA time-scaling

24 Laboratory for Digital Speech and Audio Processing - DSSP Other Application Audio-visual TTS –Example: “The sentence you hear is made out of many combinations of original sound and video, selected from the recordings of natural speech.”The sentence you hear is made out of many combinations of original sound and video, selected from the recordings of natural speech. –Database containing about 20 minutes (LIPS Challenge ’08) –For better audio quality, the database should be much larger

25 Laboratory for Digital Speech and Audio Processing - DSSP Future Work Optimizing synthesis –User feedback Expressive speech synthesis –Automated prosodic annotations Quality Management –Evaluation & optimization of the algorithm –Compare with the perceived quality of synthesized sentences (MOS)

26 Laboratory for Digital Speech and Audio Processing - DSSP Questions? Thank you for your attention. Acknowledgments: –Prof. Wivine Decoster (our speaker) –Jacques, Leen and other SPACE members –Wesley and other DSSP people –IWT

27 Laboratory for Digital Speech and Audio Processing - DSSP THE END


Download ppt "Laboratory for Digital Speech and Audio Processing - DSSP Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009."

Similar presentations


Ads by Google