Download presentation
Presentation is loading. Please wait.
2
Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis Ivan Bulyko and Mari Ostendorf Electrical Engineering Department University of Washington, Seattle
3
Limited Domain Synthesis Prosody Prediction Unit Selection Waveform Concatenation Concept Speech Canonical Pronunciation Will you return from Seattle to Boston? H*L* H-H% Unit DB Dynamic Search Sequence of Units Will you return [H*] from Seattle [L*] to Boston [L*][H-H%] return [L*+H] Seattle [none] Boston [H*][H-H%] A Network of Pronunciations from Seattle [L*]... C(i,j) Find best path Compose Prosodic Target to Standard ApproachOur Approach
4
Choice of Units and Prosodic Categories Will you return from Seattle to Boston H*L* H-H% Pitch Accents: high H*, L+H* low L*, L*+H downstepped !H*, L+!H*, H+!H* Boundary Tones: L-L% L-H% H-L% H-H% Why symbolic prosodic targets? They capture categorical perceptual differences
5
Modeling Prosody with WFSTs Will you return from Seattle to Boston low/highlow/nonelow/highH-H% Will you return[high] from / 0.4 Will you return[low] from / 1.2 Seattle[low] / 0.5toBoston[low][H-H%] Boston[high][H-H%] from[none] / 0.2 from[low] / 1.8 from[high] / 2.2 from[ds] / 2.7 Seattle[none] / 1.2 Seattle[low] / 0.3 Seattle[high] / 0.8 Seattle[ds] / 2.1... + Union template prosody prediction toSeattle[none] / 0.9
6
Representing Decision Trees with WFSTs F=aF=b P(X=s)=0.8 P(X=t)=0.2 P(X=s)=0.3 P(X=t)=0.7 a:t/c(0.2) a:s/c(0.8) b:s/c(0.3) b:t/c(0.7) c(p) = -log(p)
7
Modular Structure of Prosody Model Utterance level Phrase level Phrase breaks AccentsTones Prosody Prediction WFST Phrase Break Template Prosody WFST Accent & Tone Template Prosody WFST + + Other levels (if necessary)
8
Representing Unit DB as WFST SeattletoBoston uiui ukuk uiui u i+1 ukuk u k-1 d1d1 d2d2 Concatenation Cost: C(u i,u k )=0.5(d 1 +d 2 ) to:u k /C(u i,u k )
9
Experiments l 14 target utterances in 3 versions: A. no prosody prediction, unit selection is based entirely on the concatenation costs B. only one zero-cost prosodic target in the template (all others have very high and equal costs) C. a prosody template that allows alternative paths weighted according to their relative frequency l Travel domain corpus from University of Colorado (~2hrs) – Automatically segmented – Annotated with ToBI labels (220 utterances) l 4 subjects - native speakers of American English
10
Conclusions and Future Work l Combining prosody prediction and unit selection improves naturalness l The WFST architecture is – flexible : accommodates variable size units and different forms of prosody generation – efficient : composition and finding the best path are fast operations, allowing real-time synthesis l Future work will focus on making these techniques applicable to subword units
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.