Fujisaki Model 對應階層性語 流韻律架構 HPG 在國語的應 用與分析 中央研究院語言學研究所 蘇昭宇
Negsst2007 Outline Hierarchical Framework of Discourse Prosody HPG Introduction The HPG framework Prosodic features and templates of Mandarin fluent speech prosody Corpus approach and quantitative evidences Fujisaki Model(F 0 model) Auto-extraction Phrase components Accent components Predicting cross-phrase F0 patterns with higher level discourse information using the Fujisaki model Experiment & results Conclusion
Negsst2007 Reference 1. Tseng, Chiu-yu (2006). “Prosody Analysis”,in Advances in Chinese Spoken Language Processing, edited by Chin-Hui Lee, Haizhou Li, Lin-shan Lee, Ren-Hua Wang, Qiang Huo, World Scientific Publishing, Singapore,pp Tseng Chiu-yu, Pin Shao-huang, Lee Yeh-lin, Wang Hsin-min and Chen Yong-cheng (2005). “Fluent speech prosody: framework and modeling”, Speech Communication, Vol.46,issues 3- 4,(July 2005), Special Issue on Quantitative Prosody Modelling for Natural Speech Description and Generation, pp Fujisaki H, Hirose K. “ Analysis of voice fundamental frequency contours for declarative sentences ofJapanese”. J.Acoust. Soc. Jpn.(E), 1984; 5(4): Mixdorff, H. (2000): A Novel Approach to the Fully Automatic Extraction of Fujisaki Model Parameters. Proceedings of ICASSP 2000, vol. 3, pages , Istanbul, Turkey. 5. Mixdorff, H., Hu, Y. and Chen, G. (2003): Towards the Automatic Extraction of Fujisaki Model Parameters for Mandarin. In Proceedings of Eurospeech 2003, Geneva. 6. Wentao Gu, Hirose K, Fujisaki H: Comparison of Perceived Prosodic Boundaries and Global Characteristics of Voice Fundamental Frequency Contours in Mandarin Speech. ISCSLP 2006: 31-42
Negsst2007 HPG (Hierarchical Prosodic Phrase Grouping) Framework of Discourse Prosody-- Fluent Speech Prosody
Negsst2007 Introduction of HPG (1/2) 1. From bottom up, output fluent speech prosody includes lexical prosody (tone), syntactic prosody (intonation) and discourse prosody (cross-phrase semantic associations). 2. From top down, the HPG framework represents hierarchical constraints discourse, syntactic and lexical information. Thus, higher level prosodic units constrain and govern lower level ones; lower level units are subject to and associated by higher level units. 3. Phrases in speech flow should NOT be treated as independent, unrelated prosodic units. Rather, intonation units are subordinate prosodic units subject to HPG specifications.
Negsst2007 Introduction of HPG (2/2) 4. Output fluent speech prosody results from cumulative layered contributions from lexical, syntactic and discourse information. Therefore, prosody does NOT stop at phrase intonation. 5. According to HPG specifications, variations of phrase intonations across speech flow are systematic and predictable.
Negsst2007 HPG (Hierarchical Prosodic Phrase Grouping)-- Discourse Prosody Hierarchy (unit and constraints) A schematic representation of how PGs form spoken discourse
Negsst2007 Speech data annotation The speech data were manually labeled by independent transcribers for perceived boundaries and breaks (pauses), using a 5-step break labeling system corresponding our framework. Breath Group Initial PPFinal PPMedial Prosodic Phrase PW.. PW B2 B3 B4 Prosodic Group B4 B5
Negsst2007 Hand Labeling Perceived Boundary (Tseng et al, 1999) in Relation to Prosody Organization –Systematic and Predictable DefinitionCharacteristics B0 reduced syllabic boundary Syllable truncation often occurred in fast or informal fluent speech. B1 normal syllabic boundary/SYL Usually with no identifiable pauses, but more of a psycholinguistic unit for native speakers. B2 prosodic word boundary/PW Perceived as a boundary where a slight tone of voice change usually follows. B3 prosodic phrase boundary/PPh A clearly perceived pause. B4 breath group boundary/BG Perceived end of exhale cycle followed by inhaling to begin another breathing cycle. It could be where a speech paragraph ends where trailing occurs with final lengthening coupled with weakening of speech sounds. But the speaker may still go on by breathing but not ending the speech paragraph. B5 prosodic group boundary/PG A complete speech paragraph ends by final lengthening coupled with weakening of speech sounds. The speaker makes a complete stop, take a new breath, and begin a new speech paragraph.
Negsst2007 COSPRO Flow chart of speech data processing and annotation-Read speechhttp:// Task flow output files and file names footnotes Recording Speech in Sound Proof Chambers Hand Mapping Recorded Speech with Text Segmenting Speech Files using HTK Spot-checking by Hand Hand-labeling Perceived Prosodic Boundaries Analyzing Labeled Speech Data *.wav Editing Text to Match Speech Files *.phn *.adjust *.break PG model Converting Text to SAMPA *.SAMPA Text for speakers to read Designing Text for Narration 1.file extension: *.text 2.Serial numbers for text and wav files are identical. sampling rate: 16000Hz 。 sampling format: 1 channel 16-bit linear Hand Correcting Mismatch File extension: *.adjust Adjustments: 1.segment boundaries 2.multiple pronunciation characters
Negsst2007 Cross-Phrase Prosodic Features and Templates Corpus investigations and quantitative analyses enabled us to 1. obtain quantitative evidences of cumulative contributions of prosodic layers to output prosody, 2. derive cross-phrase hierarchical templates corresponding to every prosodic layer in the following 4 acoustic correlates (Tseng et al, 2004; 2005; 2006) 1. F0 contour templates 2. Duration cadence templates 3. Intensity distribution patterns 4. Pause cadence templates
Negsst2007 Quantitative Analysis and Predictions: F0, Duration, Intensity and Breaks Hierarchical linear model Fujisaki parameters Pause Duration Intensity PW SYL BG SYL PPh PW SYL Residues Auto-Extraction for Fujisaki Model Fujisaki parametersF0 contour
Negsst2007 The Fujisaki Model
Negsst2007 Fujisaki Model (1984)—Intonation model Unit—syntax defined simple sentence F0 curve corresponding to single simple phrase as defined by syntax can be generated Generation of gradually declining baselines of F0 curve can be decomposed into the phrase components (Ap) and accent components (Aa) Evidences obtained: Japanese, English, German, Mandarin, Thai, Vietnamese…etc.
Negsst2007 The Fujisaki Model (1/2) F 0 =Base frequency+ Phrase components+ Accent components
Negsst2007 The Fujisaki Model (2/2) Phrase components Accent components Base Freauency :onset of the first accent command in the jth command pair :timing of the ith phrase command :duration of the first accent command in the jth command pair :magnitude of the ith phrase command
Negsst2007 Phrase components = 0.01~0.05
Negsst2007 Accent components = 0.1~0.5
Negsst2007 Simulation of Mandarin Prosody with Fujisaki Model Phrase components Accent components Simulated Result Ap Aa Fb
Negsst2007 Simulating/Generating F0 Curves with Fujisaki Model— Auto-extraction of Parameters (other approaches vs. our approach)
Negsst2007 Mixdorff (2000, 2003)-- Interpolation and Smoothing (1/3) 1. Intermediate F0 values for unvoiced speech segments 2. Microprosodic variations are smoothed out. 3. Feature: very close simulation, one phrase at a time.
Negsst2007 Mixdorff (2000, 2003)–High-Pass Filtering and Component Separation (2/3) highpass filter(stop frequency at 0.5 Hz) The output of the highpass filter(HFC) low frequency contour (LFC): containing the sum of phrase component and Fb. Component Separation Fb : the overall minimum of the LFC Phrase components : the residual of LFC subtracted Fb
Negsst2007 Mixdorff (2000, 2003)-- Optimizing simulated F0 curve (3/3) Hill-Climbing Methodology Construct a sub-optimal solution that meets the constraints of the problem Take the solution and make an improvement upon it Repeatedly improve the solution until no more improvements are necessary/possible
Negsst2007 Gu ( 顧文濤 2006 ) Generating F0 Curves Using Speech Sample from CORSPRO_05 1.Gu did NOT consider information above phrases. 2.Gu compared generation results with HPG labeled results.
Negsst2007 Gu (2006)—Simulation of F0 Curves w/out Higher Level and Boundary Information Features: 1. Local minimum of LFC are considered and inserted with Ap 2. F0 curves and boundaries are generated
Negsst2007 Gu (2006) observed large variations of Aps exist 1. between two speakers, 2. among boundaries We observed: 1. The magnitude of Ap inserted in larger boundaries (B4, B5) are similar. 2. Similar patterns exist in BGs or PGs.
Negsst2007 Why Higher Level Discourse Information? (1/2) Gu (2006)’s traditional approach without higher level information Focus: 1. Isolated phrase intonations and boundaries are generated one at a time. 2. Simulation and fine tuning of each generation. Problems: 1. Large variations of Aps exist between speakers and among boundaries. 2. Variations can not be predicted and/or solved; concatenation of each generation can not yield patterns for technological implementation.
Negsst2007 Why Higher Level Discourse Information? (2/2) Tseng et al approach with higher level discourse information (HPG) Focus: Prediction of fluent speech prosody, i.e., cross-phrase F0 curves and boundary break Advantages: 1.Multiple phrase intonations and boundaries can be predicted according to HPG specifications. 2. Output prosody is NOT concatenation of independent isolated phrase intonations. 3. Between-speaker and among-boundary Ap variations are systematic and predictable, therefore, are NOT considered variations by HPG framework. 4. Useful to technology development (speech synthesis).
Negsst Experiments Hypothesis Predictions of phrase intonation curves can be improved with higher level information because HPG specifies cross-phrase associations. Cumulative contributions from prosodic layers can provide useful information. Implications technology development
Negsst2007 Speech Data Sinica COSPRO 08 Carrier paragraph: A 30-syllable, 3-phrase complex sentence representing a short PG was constructed A target single syllables was embedded in three PG positions, i.e., PG-Initial, -medial and –final. “ △是一個常見的字,一般人常把△字掛在嘴邊,講話時動不動 就會提到△ ” Speaking rates: 289 and 308 ms/syllable for M054C and F054C Target syllable analyzed: Tone 1
Negsst2007 Goals: 1. Patterns of Ap could be derived from speech data. 2. Evidence of interaction between phrase command and higher-level prosodic units could be found. 3. Evidences found could predict cross-phrase F0 allocation in speech flow. Experiment 1
Negsst2007 Distribution of speech data PG PositionAp range -Initial0.959~ Medial0.615~0.04 -Final0.678~0.093 Range of values of Ap from phrases produced by female speaker F054c in three PG related positions are presented.
Negsst2007 Distribution of speech data A schematic representation of the distribution of Ap of F054c where the horizontal axis represents values of Ap and the vertical axis represents number of Ap occurrence.
Negsst2007 Results Expected Cell Mean at the PPh level without PG effects: Expected Cell Mean at the PG level with PG effect: PG InitialPG MedialPG Final The expected cell mean of predictions with and without the PG effect. The Figure is a schematic representation of the patterns of phrases after PG effect is taken into consideration.
Negsst2007 Examples One expected cell mean can’t approach LFC well, PG-initial and PG-final especially. without PG-effectwith PG-effect
Negsst2007 Superimposed F 0 according to the HPG Framework Syl PPh PG F0F0 F0F0 F0F0 t t t
Negsst2007 What Does Higher Level Discourse Information Mean? Swapping PG-initial and PG-final F0F0 t F0F0 Exchanged Original t
Negsst2007 Further Evidences of HPG, Systematic and Predictable— Same Base Form and Different Distribution Yield Different Output Prosody Styles
Negsst2007 Speech Data Mandarin rhymed classical writing Style regular semi-regular irregular WeatherBroadcast Style irregular # of Syl# of PPh# of Discoursespeech_rate(ms) female f female m # of Syl# of PPh# of Discoursespeech_rate(ms) female f female m
Negsst2007 Classification of Stylistic Variations regular irregular semi-regular IndexName of files# of syllablesstyle of writing 07a 正氣歌 300 五言古詩) 17a 慈烏夜啼 90 五言古詩) 18a 古詩十九首之一 80 五言古詩) 19a 長干行 150 工整樂府) 21a 古詩十九首之二 50 五言古詩) 22a 石壕吏 120 工整樂府) 23a 古詩十九首之九 40 五言古詩) 24a 新婚別 160 工整樂府) 26a 古詩十九首之十五 50 五言古詩) 27a 春江花月夜 262 工整樂府) IndexName of files # of syllables style of writing 01a 禮運大同篇 107 古文) 02a 典論論文 187 古文) 03a 孟子齊人 202 古文) 06a 雜說四 150 古文) IndexName of files# of syllablesstyle of writing 07a 正氣歌 300 五言古詩) 17a 慈烏夜啼 90 五言古詩) 18a 古詩十九首之一 80 五言古詩) 19a 長干行 150 工整樂府) 21a 古詩十九首之二 50 五言古詩) 22a 石壕吏 120 工整樂府) 23a 古詩十九首之九 40 五言古詩) 24a 新婚別 160 工整樂府) 26a 古詩十九首之十五 50 五言古詩) 27a 春江花月夜 262 工整樂府)
Negsst2007 Predictions of Ap from Higher Level information (B3, B4, B5) PPh BG PG
Negsst2007 Distributions of Layered Contributions in Each Style (Male) Rhymed classical writing m056 regular semi-regular irregular Weather broadcast m054 (irregular) The more regular the style, the bigger the planning templates, and the more governing from higher level information
Negsst2007 Rhymed classical writing f054 regular semi-regular irregular Weather broadcast f054(irregular) Distributions of Layered Contributions in Each Style (Female) The more regular the style, the bigger the planning templates, and the more governing from higher level information
Negsst2007 PPh Contributions in Different Styles PPhrsmrirrWB female male
Negsst2007 BGrsmrirrWB female male Contributions of BG Layer in Different Styles
Negsst2007 Conclusions 1. Lexical, syntactic and discourse prosody ALL contribute to output prosody. Interactions are necessary, systematic and predictable from higher level considerations. HPG accounts for prosody of fluent continuous speech. 2. How a semantic complete speech paragraph begins, holds and ends across the phrases within is specified by HPG related positions: PG-Initial, PG-Medial and PG-final 3. Further evidences from Mandarin rhymed classics substantiated HPG as a base form for both planning and processing of fluent speech prosody. 4. Stylistic variations are built on the same base form with varied contribution distribution.