Generation of F 0 Contours Using a Model-Constrained Data- Driven Method Atsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center) Nobuaki Minematsu.

Generation of F 0 Contours Using a Model-Constrained Data- Driven Method Atsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center) Nobuaki Minematsu (Dep. of Comm. Eng., The Univ. of Tokyo, Japan) Keikichi Hirose (Dep. of Frontier Eng., The Univ. of Tokyo, Japan)

Corpus-Based Intonation Modeling Rule-based approach: ad-hoc rules derived from experience –Human-dependent, labor-expensive Corpus-based approach: mapping from linguistic to prosodic features statistically derived from a database –Automatic, potential to improve as larger corpora become available –The F 0 model : a parametric model that reduces degrees of freedom and improves learning efficiency

F 0 Model

Characteristics of the F 0 Model: –Direct representation of physical F 0 contours –Relatively good correspondence with syntactic structure –Ability to express an F 0 contour with a small number of parameters Better training efficiency by reducing degrees of freedom

Prosodic Database Training module Intonation model Linguistic features + F 0 model parameters Intonation model Linguistic features F 0 model parameters Training/Generation Mechanism 1) Training Phase 2) Generation Phase

Parameter Prediction Using a Neural Network Neural networks are good for non-linear mappings The generalizing ability of neural networks can deal with imperfect or inconsistent databases (prosodic databases labeled by hand) Feedback loops can be used to capture the relation between accentual phrases (partial recurrent networks)

Input Layer Hidden Layer Output Layer Context Layer Input Layer Hidden Layer Output Layer State Layer (a) Elman network(b) Jordan network Input Layer Hidden Layer Output Layer (c) Multi-layer perceptron (MLP) Neural Network Structure

Input Features Position of accentual phrase within utterance Number of morae in accentual phrase Accent type of accentual phrase Number of words in accentual phrase Part-of-speech of first word Conjugation form of first word POS category of last word Conjugation form of last word 18 15 9 8 21 7 21 7 Input Feature Max. value

Chiisana unagiyani nekkinoyoona monoga minagiru （小さなうなぎ屋に熱気のようなものがみなぎる） “unagiyani” Position of accentual phrase within utterance: 2 No. of morae in accentual phrase:5 Accent type of accentual phrase: 0 No. of words in accentual phrase: 2 POS, conjugation type/category of first word: noun/0 POS, conjugation type/category of last word: particle/0 Input Features - Example

t Command ApAp AaAa t0t0 t1t1 t2t2 tAtA tBtB tDtD tCtC t Waveform Output Features Accent nucleus Phrase command magnitude (A p ) Accent command amplitude (A a ) Phrase command delay (t 0 off = t A - t 0 ) Delay of accent command onset (t 1 off = t A - t 1 or t B - t 1 ) Delay of accent command reset (t 2 off = t C - t 2 ) Phrase command flag Output Feature t A, t B, t C, t D : mora boundaries t 0, t 1, t 2 : F 0 model parameters

Background –Neural networks provide no additional information on the modeling –Binary regression trees provide human interpretability –The knowledge obtained from binary regression trees could be used as a feedback in other kinds of modeling Outline –Input and output features equal to the neural network case –Tree-growing stop criterion: minimum number of examples per leaf node Parameter Prediction Using Binary Regression Trees

Neural network example

Binary regression tree example

)]log()[log( 1 1 2' 00    N i ii FF N MSE Experimental Results (1): MSE Error for Neural Networks Neural net#Elements inMean square configurationhidden layererror MLP100.218 MLP200.217 Jordan100.220 Jordan200.215 Elman100.214 Elman200.232 elman-10

)]log()[log( 1 1 2' 00    N i ii FF N MSE Experimental Results (2): MSE Error for Binary Regression Trees StopMean square criterionerror 100.215 200.222 300.210 400.220 500.217 500.220 stop-30

Method MSE Experimental Results (3): Comparison with Rule-Based Parameter Prediction Neural network (elman-10)0.214 Binary regression tree (stop-30)0.210 Rule set I0.221 Rule set II0.193 Rule set I: Phrase and accent commands derived from rules (including phrase command flag) Rule set II: Phrase and accent commands derived from rules (excluding phrase command flag)

Experimental Results (4): Listening Tests Number of listeners: 8 Number of sentencees Neural networkBinary regression trees Rule-based Preference 28 39 13

Conclusions Advantages of data-driven intonation modeling: –No need of ad-hoc expertise –Fast and straightforward learning Difficulties: –Prediction errors –Difficulty in finding cause-effect relations for prediction errors For now on: –To explore other learning methods –To deal with the data scarcity problem

Generation of F 0 Contours Using a Model-Constrained Data- Driven Method Atsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center) Nobuaki Minematsu.

Similar presentations

Presentation on theme: "Generation of F 0 Contours Using a Model-Constrained Data- Driven Method Atsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center) Nobuaki Minematsu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Generation of F 0 Contours Using a Model-Constrained Data- Driven Method Atsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center) Nobuaki Minematsu.

Similar presentations

Presentation on theme: "Generation of F 0 Contours Using a Model-Constrained Data- Driven Method Atsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center) Nobuaki Minematsu."— Presentation transcript:

Similar presentations

About project

Feedback