Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,

Slides:



Advertisements
Similar presentations
Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
Advertisements

Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Identification of prosodic near- minimal Pairs in Spontaneous Speech Keesha Joseph Howard University Center for Spoken Language Understanding (CSLU) Oregon.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
FLST: Prosodic Models FLST: Prosodic Models for Speech Technology Bernd Möbius
Text to Speech for In-car Navigation Systems Luisa Cordano August 8, 2006.
Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
A Text-to-Speech Synthesis System
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Arabic TTS (status & problems) O. Al Dakkak & N. Ghneim.
TTS Evaluation Julia Hirschberg CS TTS Evaluation Intelligibility Tests Mean Opinion Scores Preference Tests 9/7/20152 Speech and Language Processing.
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.
Suprasegmentals Segmental Segmental refers to phonemes and allophones and their attributes refers to phonemes and allophones and their attributes Supra-
Prepared by: Waleed Mohamed Azmy Under Supervision:
Diamantino Caseiro and Isabel Trancoso INESC/IST, 2000 Large Vocabulary Recognition Applied to Directory Assistance Services.
Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and.
SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,
Vergina: A Modern Greek Speech Database for Speech Synthesis Alexandros Lazaridis Theodoros Kostoulas Todor Ganchev Iosif Mporas Nikos Fakotakis Artificial.
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
A prosodically sensitive diphone synthesis system for Korean Kyuchul Yoon Linguistics Department The Ohio State University.
Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon, Mary Beckman & Chris Brew.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Modeling and Generation of Accentual Phrase F 0 Contours Based on Discrete HMMs Synchronized at Mora-Unit Transitions Atsuhiro Sakurai (Texas Instruments.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
Modelling Human Thematic Fit Judgments IGK Colloquium 3/2/2005 Ulrike Padó.
Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
A Fully Annotated Corpus of Russian Speech
Introduction to Speech Neal Snider, For LIN110, April 12 th, 2005 (adapted from slides by Florian Jaeger)
1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.
A quick walk through phonetic databases Read English –TIMIT –Boston University Radio News Spontaneous English –Switchboard ICSI transcriptions –Buckeye.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
The role of prosody in dialect authentication Simulating Masan dialect with Seoul speech segments Kyuchul Yoon Division of English, Kyungnam University.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Dialect Simulation through Prosody Transfer: A preliminary study on simulating Masan dialect with Seoul dialect Kyuchul Yoon Division of English, Kyungnam.
Welcome to All S. Course Code: EL 120 Course Name English Phonetics and Linguistics Lecture 1 Introducing the Course (p.2-8) Unit 1: Introducing Phonetics.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.
G. Anushiya Rachel Project Officer
Linguistic knowledge for Speech recognition
Mr. Darko Pekar, Speech Morphing Inc.
Text-To-Speech System for English
Studying Intonation Julia Hirschberg CS /21/2018.
Speech and Language Processing
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Audio Books for Phonetics Research
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Indian Institute of Technology Bombay
Presentation transcript:

Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler, Sabine Buchholz 6th ISCA Workshop on Speech Synthesis Bonn, Germany 22-24th August 2007

2 Overview  Text selection for a TTS voice  Random sub-corpus  Phonologically balanced sub-corpus  Phonetic and phonological inventory of full corpus and its sub-corpora  Phonetic and phonological coverage of units in test sentences with respect to the full corpus and its sub-corpora  Voice building - automatic annotation and training  Objective and subjective evaluations  Conclusions

3 Selection of Text for a TTS Voice Voice preparation for a TTS system is affected by:  Text domain from which text is selected  Text annotations (phonetic, phonological, prosodic, syntactic)  The linguistic and signal processing capabilities of the TTS system  Unit selection method and the type of units selected for speech synthesis  Corpus training  Speech annotation (automatic/manual; phonetic details, post lexical effects)  Other factors (time and financial resources, voice talent, recording quality, the target audience of a TTS application, etc.)

4 Text Selection  Our case study tries to answer the following question:  What is the effect of different script selection methods on a half-phone unit selection system, automatic corpus annotation and corpus training?  Full corpus: The ATR American English Speech Corpus for Speech Synthesis (~ 8 h) used in this year’s Blizzard Challenge.  Random sub-corpus (0.8 h);  Phonologically-rich sub-corpus (0.8 h) Full corpus ~8 h Phonbal Random Phonologically balanced selection Random selection

5 Phonologically-Rich Sub-Corpus ……………………… ……………………. ………………….. ………………… ……………… …………… ……….. …….. …… …. …... Set cover algorithm ……. Lexical units (full corpus) ……. Sub-corpus A (1133 sentences) Removed stress in consonants Sentences from full corpus (emphasis on interrogative, exclamatory, multisyllabic phrases, consonant clusters before and after silence) ………....…… …… ….. …... Sub-corpus B …. + Sub-corpus A 539 sentences (above the cut point) Sub-corpus (728 sentences ~2906 sec) Phonetically and phonologically transcribed full corpus = Full corpus Lexical units (sub-corpus) 594 sentences covered 1 unit per sentence Set cover algorithm

6 Random Sub-Corpus ……. Randomized sequence of sentences: Sub-corpus (686 sentences < 2914 sec) Removed sentences including foreign words Sub-corpus (687 sentences ~2914 sec) Full corpus ……… ………. ……… ……………… ………….. ………… ………………. ……………….. ……………… ……………….. ………………. …………… ……… ………. ……………….. ……………… ………. ….. ……………… …………… …………….. ……………….. ……………… + 1 sentence = 2914 sec

7 Textual and Duration Characteristics of Corpora FullArcticPhonbalRandom seconds28,5912,9142,9062,914 sentences6,5791, words79,1829,1968,1568,094 words/sent % sent with 1 – 9 words – 15 words > 15 words ‘?’ ‘!’4--1 ‘,’3, ‘;’30643 ‘:’17---

8  Selection of text based on broad phonetic transcription  may be insufficient  Inclusion of phonological, prosodic and syntactic markings  how to make it effective for a half-phone unit selection system? Distribution of Unit Types in Full Corpus and its Sub-Corpora Corpus Selection - Considerations Unit TypesFullArcticPhonbalRandom diph. (no stress) lex. diphones lex. triphones sil_CV clusters (no stress) VC_sil clusters (no stress)

9 Percentage Distribution of Units in Full Corpus and its Sub-corpora

10 Distribution of Unit Types in Test Sentences  Testing distribution of unit types in 400 test sentences  100 sentences each from: conv = conversational; mrt = modified rhyme test; news = news texts; novel = sentences from a novel; sus = semantically unpredictable sentences

11 Distribution of Lexical Diphone Types per Corpus per Text Genre

12 Missing Diphone Types from Each Corpus in Relation to Test Sentences

13 Diphone Types in Each Corpus but not Required in Test Sentences

14 Voice Building – Automatic Annotation and Training  From both corpora Phonbal and Random synthesis voices were created  Automatic synthesis voice creation encompasses  Grapheme to phoneme conversion  Automatic phone alignment  Automatic prosody annotation  Automatic prosody training (duration, F0, pause, etc.)  Speech unit database creation  Automatic phone alignment  Depends on the quality of grapheme to phoneme conversion  Depends on the output of text normalisation  Uses HMM’s with a flat start, i.e. depends on corpus size  Respects pronunciation variants  Acoustic model typology: three-state Markov, left-to-right with no skips, context independent, single Gaussian monophone HMM’s

15 Voice Building – Automatic Annotation and Training  Automatic prosody annotation  Prosodizer creates ToBI markup for each sentence  Rule based  Depends on quality of phone alignments  Depends on quality of text analysis module, i.e. uses PoS, etc.  Automatic prosody training  Depends on phone alignments, ToBI markup, and text analysis  Creates prediction models for: Phone duration Prosodic chunk boundaries Presence or absence of pauses The length of previously predicted pauses The accent property of each word: de-accented, accented, high The F0 contour of each word  Quality of predicted prosody is important factor for overall voice quality

16 Objective Evaluation – how good are the phone alignments?  Comparison of phone alignments in the Phonbal and Random sub- corpora against those in the Full corpus  Phone alignment of Random corpus is slightly better than that of Phonbal MetricPhonbalRandom Overlap Rate RMSE of boundaries6.3 ms3.3 ms boundaries within 5 ms86.6 %91.8 % boundaries within 10 ms97.1 %99.1 % boundaries within 20 ms99.1 %99.9 %

17 Objective Evaluation – Accuracy of Prosody Prediction  Comparison of the accuracy of  pause prediction, prosodic chunk prediction, and word accent prediction;  by the modules trained on the Phonbal or on the Random sub-corpus against the automatic markup of 1000 sentences not in either sub-corpus  Some prosody modules trained on Random corpus are better PhonbalRandom ChunksPrecision Recall PausesPrecision Recall accPrecision Recall highPrecision Recall

18 Subjective Evaluation – Preference Listening Test SubjectPhonbalRandom Non-American Listeners All90122 American English Listeners All  Result of preference test comparing 53 test sentences synthesized with voice Phonbal or voice Random  2 groups of listeners:  Non American listeners  Native American listeners  Columns 2 and 3 show the number of times each subject preferred each voice  Each of the 9 subjects preferred the Random voice

19 Conclusions  Two synthesis voices were compared in this study:  The two voices are based on two separate selections of sentences from the same source corpus  The Random corpus was created by a random selection of sentences from the source corpus  The Phonbal corpus was created by selecting sentences which optimise its phonetic and phonological coverage  Listeners consistently preferred the TTS voice built with our system from the Random corpus  Investigation of the differences of the two sub-corpora revealed:  Phonbal has better diphone and lexical diphone coverage  Random has better phone alignments  Random has slightly better prosody prediction performance

20 Future  Is the prosody prediction performance only due to better automatic prosody annotation which is due to better phone alignment?  Is the random selection inherently better suited to train prosody models on, e.g. because its distribution of sentence lengths is not as skewed as the Phonbal one?  What exactly is the relation between phone frequency and alignment accuracy?  Why does the Random corpus have so much better pause alignment when it contains fewer pauses?  Is it worth trying to construct some kind of prosodically balanced corpus to boost the performance of the trained modules, or would that result in a similar detrimental effect on alignment accuracy?