Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis Ivan Bulyko and Mari Ostendorf Electrical Engineering Department University.

Slides:



Advertisements
Similar presentations
Information structuring in English dialogue class 4
Advertisements

The Role of F0 in the Perceived Accentedness of L2 Speech Mary Grantham O’Brien Stephen Winters GLAC-15, Banff, Alberta May 1, 2009.
Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.
Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Prosody Modeling (in Speech) by Julia Hirschberg Presented by Elaine Chew QMUL: ELE021/ELED021/ELEM March 2012.
“Downstepped contours in the given/new distinction” Agustín Gravano Spoken Language Processing Group Columbia University, New York On the Role of Prosody.
Can a prosodic pattern induce/ reduce the perception of a lower- class suburban accent in French? Philippe Boula de Mareüil 1 & Iryna Lehka-Lemarchand.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Modelling Polish Intonation for Speech Synthesis Dominika Oliver 23 May 2002.
Making & marking text for synthesis Caroline Henton 10 August 2006.
Connecting Acoustics to Linguistics in Chinese Intonation Greg Kochanski (Oxford Phonetics) Chilin Shih (University of Illinois) Tan Lee (CUHK) with Hongyan.
EARLY INTONATIONAL DEVELOPMENT IN CATALAN Pilar Prieto ICREA-UAB.
FLST: Prosodic Models FLST: Prosodic Models for Speech Technology Bernd Möbius
Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen.
On the interaction between intonational prominence and phrasing – evidence from Swedish Gösta Bruce Lund University Sweden.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Dianne Bradley & Eva Fern á ndez Graduate Center & Queens College CUNY Eliciting and Documenting Default Prosody ABRALIN23-FEB-05.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Producing Emotional Speech Thanks to Gabriel Schubiner.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Data-driven approach to rapid prototyping Xhosa speech synthesis Albert Visagie Justus Roux Centre for Language and Speech Technology Stellenbosch University.
ZBento Knowledge Management System. Rationale The University is responsible for knowledge work –Teaching and Public Education: Dissemination of knowledge.
Supervisor: Dr. Eddie Jones Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification System for Security.
Intonation and Information Discourse and Dialogue CS359 October 16, 2001.
May 2006CLINT-CS Verbmobil1 CLINT-CS Dialogue II Verbmobil.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Building a sentential model for automatic prosody evaluation Kyuchul Yoon School of English Language & Literature Yeungnam University Korea.
A prosodically sensitive diphone synthesis system for Korean Kyuchul Yoon Linguistics Department The Ohio State University.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon, Mary Beckman & Chris Brew.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
A Fully Annotated Corpus of Russian Speech
TOBI Basics April 13, 2010.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Imposing native speakers’ prosody on non-native speakers’ utterances: Preliminary studies Kyuchul Yoon Spring 2006 NAELL The Division of English Kyungnam.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
Nuclear Accent Shape and the Perception of Syllable Pitch Rachael-Anne Knight LAGB 16 April 2003.
Integrating Multiple Knowledge Sources For Improved Speech Understanding Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering,
Laboratory for Digital Speech and Audio Processing - DSSP Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009.
TOBI, continued January 29, 2008 The Outlook 1.Return course project reports. 2.New course schedule. 3.Today: Continue the discussion of English Intonation.
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
On the role of context and prosody in the interpretation of ‘okay’ Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Héctor Chávez, and Lauren Wilcox.
영어교육에 있어서의 영어억양의 역할 (The role of prosody in English education) Korea Nazarene University Kyuchul Yoon English Division Kyungnam University.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Investigating Pitch Accent Recognition in Non-native Speech
Mr. Darko Pekar, Speech Morphing Inc.
Studying Intonation Julia Hirschberg CS /21/2018.
Meanings of Intonational Contours
Representing Intonational Variation
Speech Technology for Language Learning
Intonational and Its Meanings
Intonational and Its Meanings
The American School and ToBI
Speech and Language Processing
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Meanings of Intonational Contours
Representing Intonational Variation
Representing Intonational Variation
“Downstepped contours in the given/new distinction”
On the transcription of Galician Intonation
Discourse & Dialogue CMSC October 28, 2004
Automatic Prosodic Event Detection
Presentation transcript:

Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis Ivan Bulyko and Mari Ostendorf Electrical Engineering Department University of Washington, Seattle

Limited Domain Synthesis Prosody Prediction Unit Selection Waveform Concatenation Concept Speech Canonical Pronunciation Will you return from Seattle to Boston? H*L* H-H% Unit DB Dynamic Search Sequence of Units Will you return [H*] from Seattle [L*] to Boston [L*][H-H%] return [L*+H] Seattle [none] Boston [H*][H-H%] A Network of Pronunciations from Seattle [L*]... C(i,j) Find best path Compose Prosodic Target to Standard ApproachOur Approach

Choice of Units and Prosodic Categories Will you return from Seattle to Boston H*L* H-H% Pitch Accents: high H*, L+H* low L*, L*+H downstepped !H*, L+!H*, H+!H* Boundary Tones: L-L% L-H% H-L% H-H% Why symbolic prosodic targets? They capture categorical perceptual differences

Modeling Prosody with WFSTs Will you return from Seattle to Boston low/highlow/nonelow/highH-H% Will you return[high] from / 0.4 Will you return[low] from / 1.2 Seattle[low] / 0.5toBoston[low][H-H%] Boston[high][H-H%] from[none] / 0.2 from[low] / 1.8 from[high] / 2.2 from[ds] / 2.7 Seattle[none] / 1.2 Seattle[low] / 0.3 Seattle[high] / 0.8 Seattle[ds] / Union template prosody prediction toSeattle[none] / 0.9

Representing Decision Trees with WFSTs F=aF=b P(X=s)=0.8 P(X=t)=0.2 P(X=s)=0.3 P(X=t)=0.7 a:t/c(0.2) a:s/c(0.8) b:s/c(0.3) b:t/c(0.7) c(p) = -log(p)

Modular Structure of Prosody Model Utterance level Phrase level Phrase breaks AccentsTones Prosody Prediction WFST Phrase Break Template Prosody WFST Accent & Tone Template Prosody WFST + + Other levels (if necessary)

Representing Unit DB as WFST SeattletoBoston uiui ukuk uiui u i+1 ukuk u k-1 d1d1 d2d2 Concatenation Cost: C(u i,u k )=0.5(d 1 +d 2 ) to:u k /C(u i,u k )

Experiments l 14 target utterances in 3 versions: A. no prosody prediction, unit selection is based entirely on the concatenation costs B. only one zero-cost prosodic target in the template (all others have very high and equal costs) C. a prosody template that allows alternative paths weighted according to their relative frequency l Travel domain corpus from University of Colorado (~2hrs) – Automatically segmented – Annotated with ToBI labels (220 utterances) l 4 subjects - native speakers of American English

Conclusions and Future Work l Combining prosody prediction and unit selection improves naturalness l The WFST architecture is – flexible : accommodates variable size units and different forms of prosody generation – efficient : composition and finding the best path are fast operations, allowing real-time synthesis l Future work will focus on making these techniques applicable to subword units