Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,

Similar presentations


Presentation on theme: "Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,"— Presentation transcript:

1 Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler, Sabine Buchholz 6th ISCA Workshop on Speech Synthesis Bonn, Germany 22-24th August 2007

2 2 Overview  Text selection for a TTS voice  Random sub-corpus  Phonologically balanced sub-corpus  Phonetic and phonological inventory of full corpus and its sub-corpora  Phonetic and phonological coverage of units in test sentences with respect to the full corpus and its sub-corpora  Voice building - automatic annotation and training  Objective and subjective evaluations  Conclusions

3 3 Selection of Text for a TTS Voice Voice preparation for a TTS system is affected by:  Text domain from which text is selected  Text annotations (phonetic, phonological, prosodic, syntactic)  The linguistic and signal processing capabilities of the TTS system  Unit selection method and the type of units selected for speech synthesis  Corpus training  Speech annotation (automatic/manual; phonetic details, post lexical effects)  Other factors (time and financial resources, voice talent, recording quality, the target audience of a TTS application, etc.)

4 4 Text Selection  Our case study tries to answer the following question:  What is the effect of different script selection methods on a half-phone unit selection system, automatic corpus annotation and corpus training?  Full corpus: The ATR American English Speech Corpus for Speech Synthesis (~ 8 h) used in this year’s Blizzard Challenge.  Random sub-corpus (0.8 h);  Phonologically-rich sub-corpus (0.8 h) Full corpus ~8 h Phonbal Random Phonologically balanced selection Random selection

5 5 Phonologically-Rich Sub-Corpus ……………………… ……………………. ………………….. ………………… ……………… …………… ……….. …….. …… …. …... Set cover algorithm ……. Lexical units (full corpus) ……. Sub-corpus A (1133 sentences) Removed stress in consonants + +..................................................................................................................................................................................................................................................................................................................................................................... Sentences from full corpus (emphasis on interrogative, exclamatory, multisyllabic phrases, consonant clusters before and after silence) ………....…… …… ….. …... Sub-corpus B …. + Sub-corpus A 539 sentences (above the cut point) Sub-corpus (728 sentences ~2906 sec) Phonetically and phonologically transcribed full corpus = Full corpus Lexical units (sub-corpus) 594 sentences covered 1 unit per sentence Set cover algorithm

6 6 Random Sub-Corpus ……. Randomized sequence of sentences: Sub-corpus (686 sentences < 2914 sec) Removed sentences including foreign words Sub-corpus (687 sentences ~2914 sec) Full corpus ……… ………. ……… ……………… ………….. ………… ………………. ……………….. ……………… ……………….. ………………. …………… ……… ………. ……………….. ……………… ………. ….. ……………… …………… …………….. ……………….. ……………… + 1 sentence = 2914 sec

7 7 Textual and Duration Characteristics of Corpora FullArcticPhonbalRandom seconds28,5912,9142,9062,914 sentences6,5791,032728687 words79,1829,1968,1568,094 words/sent.12.08.911.211.8 % sent with 1 – 9 words37.754.941.038.6 10 – 15 words27.645.118.626.9 > 15 words34.8-40.434.5 ‘?’86819694 ‘!’4--1 ‘,’3,977430452410 ‘;’30643 ‘:’17---

8 8  Selection of text based on broad phonetic transcription  may be insufficient  Inclusion of phonological, prosodic and syntactic markings  how to make it effective for a half-phone unit selection system? Distribution of Unit Types in Full Corpus and its Sub-Corpora Corpus Selection - Considerations Unit TypesFullArcticPhonbalRandom diph. (no stress)1607138515101322 lex. diphones4332271633062735 lex. triphones17032794587168144 sil_CV clusters (no stress)104424643 VC_sil clusters (no stress)1848410075

9 9 Percentage Distribution of Units in Full Corpus and its Sub-corpora

10 10 Distribution of Unit Types in Test Sentences  Testing distribution of unit types in 400 test sentences  100 sentences each from: conv = conversational; mrt = modified rhyme test; news = news texts; novel = sentences from a novel; sus = semantically unpredictable sentences

11 11 Distribution of Lexical Diphone Types per Corpus per Text Genre

12 12 Missing Diphone Types from Each Corpus in Relation to Test Sentences

13 13 Diphone Types in Each Corpus but not Required in Test Sentences

14 14 Voice Building – Automatic Annotation and Training  From both corpora Phonbal and Random synthesis voices were created  Automatic synthesis voice creation encompasses  Grapheme to phoneme conversion  Automatic phone alignment  Automatic prosody annotation  Automatic prosody training (duration, F0, pause, etc.)  Speech unit database creation  Automatic phone alignment  Depends on the quality of grapheme to phoneme conversion  Depends on the output of text normalisation  Uses HMM’s with a flat start, i.e. depends on corpus size  Respects pronunciation variants  Acoustic model typology: three-state Markov, left-to-right with no skips, context independent, single Gaussian monophone HMM’s

15 15 Voice Building – Automatic Annotation and Training  Automatic prosody annotation  Prosodizer creates ToBI markup for each sentence  Rule based  Depends on quality of phone alignments  Depends on quality of text analysis module, i.e. uses PoS, etc.  Automatic prosody training  Depends on phone alignments, ToBI markup, and text analysis  Creates prediction models for: Phone duration Prosodic chunk boundaries Presence or absence of pauses The length of previously predicted pauses The accent property of each word: de-accented, accented, high The F0 contour of each word  Quality of predicted prosody is important factor for overall voice quality

16 16 Objective Evaluation – how good are the phone alignments?  Comparison of phone alignments in the Phonbal and Random sub- corpora against those in the Full corpus  Phone alignment of Random corpus is slightly better than that of Phonbal MetricPhonbalRandom Overlap Rate95.2696.35 RMSE of boundaries6.3 ms3.3 ms boundaries within 5 ms86.6 %91.8 % boundaries within 10 ms97.1 %99.1 % boundaries within 20 ms99.1 %99.9 %

17 17 Objective Evaluation – Accuracy of Prosody Prediction  Comparison of the accuracy of  pause prediction, prosodic chunk prediction, and word accent prediction;  by the modules trained on the Phonbal or on the Random sub-corpus against the automatic markup of 1000 sentences not in either sub-corpus  Some prosody modules trained on Random corpus are better PhonbalRandom ChunksPrecision58.956.3 Recall34.238.7 PausesPrecision63.163.4 Recall34.138.0 accPrecision69.769.5 Recall78.478.9 highPrecision54.757.1 Recall38.641.1

18 18 Subjective Evaluation – Preference Listening Test SubjectPhonbalRandom Non-American Listeners 12033 22132 32429 42528 All90122 American English Listeners 12132 22132 31637 42330 52528 All106159  Result of preference test comparing 53 test sentences synthesized with voice Phonbal or voice Random  2 groups of listeners:  Non American listeners  Native American listeners  Columns 2 and 3 show the number of times each subject preferred each voice  Each of the 9 subjects preferred the Random voice

19 19 Conclusions  Two synthesis voices were compared in this study:  The two voices are based on two separate selections of sentences from the same source corpus  The Random corpus was created by a random selection of sentences from the source corpus  The Phonbal corpus was created by selecting sentences which optimise its phonetic and phonological coverage  Listeners consistently preferred the TTS voice built with our system from the Random corpus  Investigation of the differences of the two sub-corpora revealed:  Phonbal has better diphone and lexical diphone coverage  Random has better phone alignments  Random has slightly better prosody prediction performance

20 20 Future  Is the prosody prediction performance only due to better automatic prosody annotation which is due to better phone alignment?  Is the random selection inherently better suited to train prosody models on, e.g. because its distribution of sentence lengths is not as skewed as the Phonbal one?  What exactly is the relation between phone frequency and alignment accuracy?  Why does the Random corpus have so much better pause alignment when it contains fewer pauses?  Is it worth trying to construct some kind of prosodically balanced corpus to boost the performance of the trained modules, or would that result in a similar detrimental effect on alignment accuracy?


Download ppt "Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,"

Similar presentations


Ads by Google