Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis Ivan Bulyko and Mari Ostendorf Electrical Engineering Department University.

Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis Ivan Bulyko and Mari Ostendorf Electrical Engineering Department University of Washington, Seattle

Limited Domain Synthesis Prosody Prediction Unit Selection Waveform Concatenation Concept Speech Canonical Pronunciation Will you return from Seattle to Boston? H*L* H-H% Unit DB Dynamic Search Sequence of Units Will you return [H*] from Seattle [L*] to Boston [L*][H-H%] return [L*+H] Seattle [none] Boston [H*][H-H%] A Network of Pronunciations from Seattle [L*]... C(i,j) Find best path Compose Prosodic Target to Standard ApproachOur Approach

Choice of Units and Prosodic Categories Will you return from Seattle to Boston H*L* H-H% Pitch Accents: high H*, L+H* low L*, L*+H downstepped !H*, L+!H*, H+!H* Boundary Tones: L-L% L-H% H-L% H-H% Why symbolic prosodic targets? They capture categorical perceptual differences

Modeling Prosody with WFSTs Will you return from Seattle to Boston low/highlow/nonelow/highH-H% Will you return[high] from / 0.4 Will you return[low] from / 1.2 Seattle[low] / 0.5toBoston[low][H-H%] Boston[high][H-H%] from[none] / 0.2 from[low] / 1.8 from[high] / 2.2 from[ds] / 2.7 Seattle[none] / 1.2 Seattle[low] / 0.3 Seattle[high] / 0.8 Seattle[ds] / 2.1... + Union template prosody prediction toSeattle[none] / 0.9

Representing Decision Trees with WFSTs F=aF=b P(X=s)=0.8 P(X=t)=0.2 P(X=s)=0.3 P(X=t)=0.7 a:t/c(0.2) a:s/c(0.8) b:s/c(0.3) b:t/c(0.7) c(p) = -log(p)

Modular Structure of Prosody Model Utterance level Phrase level Phrase breaks AccentsTones Prosody Prediction WFST Phrase Break Template Prosody WFST Accent & Tone Template Prosody WFST + + Other levels (if necessary)

Representing Unit DB as WFST SeattletoBoston uiui ukuk uiui u i+1 ukuk u k-1 d1d1 d2d2 Concatenation Cost: C(u i,u k )=0.5(d 1 +d 2 ) to:u k /C(u i,u k )

Experiments l 14 target utterances in 3 versions: A. no prosody prediction, unit selection is based entirely on the concatenation costs B. only one zero-cost prosodic target in the template (all others have very high and equal costs) C. a prosody template that allows alternative paths weighted according to their relative frequency l Travel domain corpus from University of Colorado (~2hrs) – Automatically segmented – Annotated with ToBI labels (220 utterances) l 4 subjects - native speakers of American English

Conclusions and Future Work l Combining prosody prediction and unit selection improves naturalness l The WFST architecture is – flexible : accommodates variable size units and different forms of prosody generation – efficient : composition and finding the best path are fast operations, allowing real-time synthesis l Future work will focus on making these techniques applicable to subword units

Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis Ivan Bulyko and Mari Ostendorf Electrical Engineering Department University.

Similar presentations

Presentation on theme: "Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis Ivan Bulyko and Mari Ostendorf Electrical Engineering Department University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis Ivan Bulyko and Mari Ostendorf Electrical Engineering Department University.

Similar presentations

Presentation on theme: "Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis Ivan Bulyko and Mari Ostendorf Electrical Engineering Department University."— Presentation transcript:

Similar presentations

About project

Feedback