Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.

Slides:



Advertisements
Similar presentations
Presented by Erin Palmer. Speech processing is widely used today Can you think of some examples? Phone dialog systems (bank, Amtrak) Computers dictation.
Advertisements

Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Connecting Acoustics to Linguistics in Chinese Intonation Greg Kochanski (Oxford Phonetics) Chilin Shih (University of Illinois) Tan Lee (CUHK) with Hongyan.
FLST: Prosodic Models FLST: Prosodic Models for Speech Technology Bernd Möbius
Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013.
Text to Speech for In-car Navigation Systems Luisa Cordano August 8, 2006.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition Thurid Vogt, Elisabeth André ICME 2005 Multimedia concepts.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis Ivan Bulyko and Mari Ostendorf Electrical Engineering Department University.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.
STANDARDIZATION OF SPEECH CORPUS Li Ai-jun, Yin Zhi-gang Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
So far: Historical introduction Mathematical background (e.g., pattern classification, acoustics) Feature extraction for speech recognition (and some neural.
1 TEMPLATE MATCHING  The Goal: Given a set of reference patterns known as TEMPLATES, find to which one an unknown pattern matches best. That is, each.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Prepared by: Waleed Mohamed Azmy Under Supervision:
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and.
SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,
Vergina: A Modern Greek Speech Database for Speech Synthesis Alexandros Lazaridis Theodoros Kostoulas Todor Ganchev Iosif Mporas Nikos Fakotakis Artificial.
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
Building a sentential model for automatic prosody evaluation Kyuchul Yoon School of English Language & Literature Yeungnam University Korea.
Incorporating Dynamic Time Warping (DTW) in the SeqRec.m File Presented by: Clay McCreary, MSEE.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon, Mary Beckman & Chris Brew.
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
© 2013 by Larson Technical Services
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
PREPARED BY MANOJ TALUKDAR MSC 4 TH SEM ROLL-NO 05 GUKC-2012 IN THE GUIDENCE OF DR. SANJIB KR KALITA.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
G. Anushiya Rachel Project Officer
Mr. Darko Pekar, Speech Morphing Inc.
ARTIFICIAL NEURAL NETWORKS
Text-To-Speech System for English
Statistical Models for Automatic Speech Recognition
Speech and Language Processing
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Informatique et Phonétique
Statistical Models for Automatic Speech Recognition
Speaker Identification:
Measuring the Similarity of Rhythmic Patterns
Presentation transcript:

Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis

l B Möbius Unit selection synthesis Unit selection synthesis: general procedure  Dynamic selection of units at synthesis run-time  "The best solution to the synthesizer problem is to avoid it." [Carlson & Granström, 1991]  overcome restrictions by a fixed unit inventory  unit inventory: large corpus of recorded natural speech  select the smallest number of the longest units covering the target phone sequence  variable unit size (segments, syllables, words,...)  reduce perceptual impression of lack of naturalness caused by number of concatenations and signal processing

l B Möbius Unit selection synthesis Paradigm change  Inventory construction off-line  run-time unit selection  preserve natural speech as much as possible  ideal world: target utterance available in corpus  unfortunately: ideal case is extremely improbable, due to complexity/combinatorics of language and speech  however, longer units may be available in corpus  most extreme strategy (C HATR, Black & Taylor 1994 … )  no modification by signal processing  listener will tolerate occasional glitches, if overall synthesis quality approaches natural speech

l B Möbius Unit selection synthesis Unit selection based on cost functions  Minimize two cost functions, simultaneously and globally (viz. for the entire utterance)  target cost (unit distortion): how suitable is the candidate?  concatenation cost (join cost, continuity distortion): how smooth is the concatenation with adjacent units?

l B Möbius Unit selection synthesis Cost functions  target cost (C t )  distance of candidate unit from target specification  computed at run-time  based on features that are inferred from input text  concatenation cost (C c )  distance between two candidates at concatenation  computable off-line, but usually at run-time  based on all features in corpus annotation

l B Möbius Unit selection synthesis Selection algorithm  Minimize C t and C c [Hunt & Black 1996] sequence of target units lattice of candidate units

l B Möbius Unit selection synthesis Selection algorithm  Minimize C t and C c [Hunt & Black 1996] concatenation costs target costs

l B Möbius Unit selection synthesis Lattice of candidate units  Every unit in corpus is represented by a state in a state-transition network  state cost = target cost (unit distortion)  transition cost = concatenation cost (continuity dist.)  Similar to HMM-based speech recognition, but  ASR uses probabilistic models  unit selection uses cost functions  Unit selection algorithm selects an optimal sequence of units by finding a path through the state-transition network that jointly minimizes target and join costs [Hunt & Black 1996]

l B Möbius Unit selection synthesis Example: Word-based unit selection  Target utterance: I have time on Monday.  Step 1: tabulate all candidate words for target utterance I I I I have time on Monday

l B Möbius Unit selection synthesis Example: Word-based unit selection  Target utterance: I have time on Monday.  Step 2: create directed graph with start and end nodes I I I I have time on Monday SE direction of nodes (time)

l B Möbius Unit selection synthesis Example: Word-based unit selection  Target utterance: I have time on Monday.  Step 3: selection of units by optimal (cheapest) path I I I I have time on Monday SE direction of nodes (time)

l B Möbius Unit selection synthesis Cost functions: Features  Mismatch between selection criteria in target vs. corpus will produce costs and result in suboptimal selection  Commonly used features in target cost  position in phrase (initial, medial, final)  duration (actual vs. predicted by duration model)  sentence mode (declarative, interrogative, continuing)  degree of reduction  Commonly used features in concatenation cost  adjacency in corpus (zero cost)  left and right phonetic context  F 0 (and possibly intensity) continuity

l B Möbius Unit selection synthesis Cost functions  Costs are computed as weighted sums of costs incurred on individual features C = w 1 c 1 + w 2 c 2 + w 3 c 3 + …  Feature weights are set by expert or trained on data  finding optimal relative weights of features is still an open problem  Optimal match on feature will incur zero cost  in join cost, this will prefer sequences of candidates that are adjacent in corpus, and thus longer units

l B Möbius Unit selection synthesis Features: annotation and use in selection  Off-line:  annotation of speech corpus (feature vectors)  phone identities and phone boundaries  prosodic features (tones, accents, stress)  positional features (in phrase, word, syllable)  possibly: concatenation costs  On-line = at synthesis run-time  target specifications (target feature vectors)  target costs  typically: concatenation costs

l B Möbius Unit selection synthesis Speech corpus design and size  Quality of speech corpus (recordings, annotation, coverage) has tremendous effect on synthesis quality  Corpus size is single most important quality factor  Some data points:  IBM/Cambridge: ~60 min. (ASR corpora)  C HATR : phonetically balanced sentences, radio news, isolated sentences: 40 min. (Eng.), 20 min. (Jap.)  "bring a novel … of their own choice" [Campbell 1999]  AT&T 1999: news stories and system prompts, ~2 hrs.  SmartKom: open+closed domains, 160 min.  typical corpus size today: 10++ hrs.

l B Möbius Unit selection synthesis Speech corpus design and size  Quality of speech corpus (recordings, annotation, coverage) has tremendous effect on synthesis quality  Careful design of text materials essential  overall objective: consistently high quality  account for all relevant phone variations  complete coverage by diphones  various text genres (news, dialog, prompts)  various speaking styles (prosody, affective)  automatic annotation with manual verification  small corpora with perfect annotation yield better quality than large corpora without verification

l B Möbius Unit selection synthesis Unit selection synthesis: Summary  Synthesis by re-sequencing and concatenating units selected at run-time from corpus of natural speech + facilitates long units without concatenation + reduces need for signal processing + preserves natural speech waveforms  tends to produce unstable, unpredictable quality  inflexible w.r.t. speaking style and speaker voice  Standard synthesis technique in the 2000s  in competition with HMM-based synthesis (statistical parametric speech synthesis)