Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fujisaki Model 對應階層性語 流韻律架構 HPG 在國語的應 用與分析 中央研究院語言學研究所 蘇昭宇

Similar presentations


Presentation on theme: "Fujisaki Model 對應階層性語 流韻律架構 HPG 在國語的應 用與分析 中央研究院語言學研究所 蘇昭宇"— Presentation transcript:

1 Fujisaki Model 對應階層性語 流韻律架構 HPG 在國語的應 用與分析 中央研究院語言學研究所 蘇昭宇 morison@phslab.ihp.sinica.edu.tw

2 Negsst2007 Outline Hierarchical Framework of Discourse Prosody HPG  Introduction  The HPG framework  Prosodic features and templates of Mandarin fluent speech prosody  Corpus approach and quantitative evidences Fujisaki Model(F 0 model)  Auto-extraction  Phrase components  Accent components Predicting cross-phrase F0 patterns with higher level discourse information using the Fujisaki model Experiment & results Conclusion

3 Negsst2007 Reference 1. Tseng, Chiu-yu (2006). “Prosody Analysis”,in Advances in Chinese Spoken Language Processing, edited by Chin-Hui Lee, Haizhou Li, Lin-shan Lee, Ren-Hua Wang, Qiang Huo, World Scientific Publishing, Singapore,pp.57-76. 2. Tseng Chiu-yu, Pin Shao-huang, Lee Yeh-lin, Wang Hsin-min and Chen Yong-cheng (2005). “Fluent speech prosody: framework and modeling”, Speech Communication, Vol.46,issues 3- 4,(July 2005), Special Issue on Quantitative Prosody Modelling for Natural Speech Description and Generation, pp.284-309. 3. Fujisaki H, Hirose K. “ Analysis of voice fundamental frequency contours for declarative sentences ofJapanese”. J.Acoust. Soc. Jpn.(E), 1984; 5(4): 233-242. 4. Mixdorff, H. (2000): A Novel Approach to the Fully Automatic Extraction of Fujisaki Model Parameters. Proceedings of ICASSP 2000, vol. 3, pages 1281-1284, Istanbul, Turkey. 5. Mixdorff, H., Hu, Y. and Chen, G. (2003): Towards the Automatic Extraction of Fujisaki Model Parameters for Mandarin. In Proceedings of Eurospeech 2003, Geneva. 6. Wentao Gu, Hirose K, Fujisaki H: Comparison of Perceived Prosodic Boundaries and Global Characteristics of Voice Fundamental Frequency Contours in Mandarin Speech. ISCSLP 2006: 31-42

4 Negsst2007 HPG (Hierarchical Prosodic Phrase Grouping) Framework of Discourse Prosody-- Fluent Speech Prosody

5 Negsst2007 Introduction of HPG (1/2) 1. From bottom up, output fluent speech prosody includes lexical prosody (tone), syntactic prosody (intonation) and discourse prosody (cross-phrase semantic associations). 2. From top down, the HPG framework represents hierarchical constraints discourse, syntactic and lexical information. Thus, higher level prosodic units constrain and govern lower level ones; lower level units are subject to and associated by higher level units. 3. Phrases in speech flow should NOT be treated as independent, unrelated prosodic units. Rather, intonation units are subordinate prosodic units subject to HPG specifications.

6 Negsst2007 Introduction of HPG (2/2) 4. Output fluent speech prosody results from cumulative layered contributions from lexical, syntactic and discourse information. Therefore, prosody does NOT stop at phrase intonation. 5. According to HPG specifications, variations of phrase intonations across speech flow are systematic and predictable.

7 Negsst2007 HPG (Hierarchical Prosodic Phrase Grouping)-- Discourse Prosody Hierarchy (unit and constraints) A schematic representation of how PGs form spoken discourse

8 Negsst2007 Speech data annotation The speech data were manually labeled by independent transcribers for perceived boundaries and breaks (pauses), using a 5-step break labeling system corresponding our framework. Breath Group Initial PPFinal PPMedial Prosodic Phrase PW.. PW B2 B3 B4 Prosodic Group B4 B5

9 Negsst2007 Hand Labeling Perceived Boundary (Tseng et al, 1999) in Relation to Prosody Organization –Systematic and Predictable DefinitionCharacteristics B0 reduced syllabic boundary Syllable truncation often occurred in fast or informal fluent speech. B1 normal syllabic boundary/SYL Usually with no identifiable pauses, but more of a psycholinguistic unit for native speakers. B2 prosodic word boundary/PW Perceived as a boundary where a slight tone of voice change usually follows. B3 prosodic phrase boundary/PPh A clearly perceived pause. B4 breath group boundary/BG Perceived end of exhale cycle followed by inhaling to begin another breathing cycle. It could be where a speech paragraph ends where trailing occurs with final lengthening coupled with weakening of speech sounds. But the speaker may still go on by breathing but not ending the speech paragraph. B5 prosodic group boundary/PG A complete speech paragraph ends by final lengthening coupled with weakening of speech sounds. The speaker makes a complete stop, take a new breath, and begin a new speech paragraph.

10 Negsst2007 COSPRO http://www.myet.com/corpora Flow chart of speech data processing and annotation-Read speechhttp://www.myet.com/corpora Task flow output files and file names footnotes Recording Speech in Sound Proof Chambers Hand Mapping Recorded Speech with Text Segmenting Speech Files using HTK Spot-checking by Hand Hand-labeling Perceived Prosodic Boundaries Analyzing Labeled Speech Data *.wav Editing Text to Match Speech Files *.phn *.adjust *.break PG model Converting Text to SAMPA *.SAMPA Text for speakers to read Designing Text for Narration 1.file extension: *.text 2.Serial numbers for text and wav files are identical. sampling rate: 16000Hz 。 sampling format: 1 channel 16-bit linear Hand Correcting Mismatch File extension: *.adjust Adjustments: 1.segment boundaries 2.multiple pronunciation characters

11 Negsst2007 Cross-Phrase Prosodic Features and Templates Corpus investigations and quantitative analyses enabled us to  1. obtain quantitative evidences of cumulative contributions of prosodic layers to output prosody,  2. derive cross-phrase hierarchical templates corresponding to every prosodic layer in the following 4 acoustic correlates (Tseng et al, 2004; 2005; 2006) 1. F0 contour templates 2. Duration cadence templates 3. Intensity distribution patterns 4. Pause cadence templates

12 Negsst2007 Quantitative Analysis and Predictions: F0, Duration, Intensity and Breaks Hierarchical linear model  Fujisaki parameters  Pause  Duration  Intensity PW SYL BG SYL PPh PW SYL Residues Auto-Extraction for Fujisaki Model Fujisaki parametersF0 contour

13 Negsst2007 The Fujisaki Model

14 Negsst2007 Fujisaki Model (1984)—Intonation model Unit—syntax defined simple sentence  F0 curve corresponding to single simple phrase as defined by syntax can be generated  Generation of gradually declining baselines of F0 curve can be decomposed into the phrase components (Ap) and accent components (Aa) Evidences obtained: Japanese, English, German, Mandarin, Thai, Vietnamese…etc.

15 Negsst2007 The Fujisaki Model (1/2) F 0 =Base frequency+ Phrase components+ Accent components

16 Negsst2007 The Fujisaki Model (2/2) Phrase components Accent components Base Freauency :onset of the first accent command in the jth command pair :timing of the ith phrase command :duration of the first accent command in the jth command pair :magnitude of the ith phrase command

17 Negsst2007 Phrase components = 0.01~0.05

18 Negsst2007 Accent components = 0.1~0.5

19 Negsst2007 Simulation of Mandarin Prosody with Fujisaki Model Phrase components Accent components Simulated Result Ap Aa Fb

20 Negsst2007 Simulating/Generating F0 Curves with Fujisaki Model— Auto-extraction of Parameters (other approaches vs. our approach)

21 Negsst2007 Mixdorff (2000, 2003)-- Interpolation and Smoothing (1/3) 1. Intermediate F0 values for unvoiced speech segments 2. Microprosodic variations are smoothed out. 3. Feature: very close simulation, one phrase at a time.

22 Negsst2007 Mixdorff (2000, 2003)–High-Pass Filtering and Component Separation (2/3) highpass filter(stop frequency at 0.5 Hz)  The output of the highpass filter(HFC)  low frequency contour (LFC): containing the sum of phrase component and Fb. Component Separation Fb : the overall minimum of the LFC Phrase components : the residual of LFC subtracted Fb

23 Negsst2007 Mixdorff (2000, 2003)-- Optimizing simulated F0 curve (3/3) Hill-Climbing Methodology  Construct a sub-optimal solution that meets the constraints of the problem  Take the solution and make an improvement upon it  Repeatedly improve the solution until no more improvements are necessary/possible

24 Negsst2007 Gu ( 顧文濤 2006 ) Generating F0 Curves Using Speech Sample from CORSPRO_05 1.Gu did NOT consider information above phrases. 2.Gu compared generation results with HPG labeled results.

25 Negsst2007 Gu (2006)—Simulation of F0 Curves w/out Higher Level and Boundary Information Features: 1. Local minimum of LFC are considered and inserted with Ap 2. F0 curves and boundaries are generated

26 Negsst2007 Gu (2006) observed large variations of Aps exist 1. between two speakers, 2. among boundaries We observed: 1. The magnitude of Ap inserted in larger boundaries (B4, B5) are similar. 2. Similar patterns exist in BGs or PGs.

27 Negsst2007 Why Higher Level Discourse Information? (1/2) Gu (2006)’s traditional approach without higher level information Focus: 1. Isolated phrase intonations and boundaries are generated one at a time. 2. Simulation and fine tuning of each generation. Problems: 1. Large variations of Aps exist between speakers and among boundaries. 2. Variations can not be predicted and/or solved; concatenation of each generation can not yield patterns for technological implementation.

28 Negsst2007 Why Higher Level Discourse Information? (2/2) Tseng et al approach with higher level discourse information (HPG) Focus:  Prediction of fluent speech prosody, i.e., cross-phrase F0 curves and boundary break Advantages:  1.Multiple phrase intonations and boundaries can be predicted according to HPG specifications. 2. Output prosody is NOT concatenation of independent isolated phrase intonations. 3. Between-speaker and among-boundary Ap variations are systematic and predictable, therefore, are NOT considered variations by HPG framework. 4. Useful to technology development (speech synthesis).

29 Negsst2007 2 Experiments Hypothesis  Predictions of phrase intonation curves can be improved with higher level information because HPG specifies cross-phrase associations.  Cumulative contributions from prosodic layers can provide useful information. Implications  technology development

30 Negsst2007 Speech Data Sinica COSPRO 08  Carrier paragraph: A 30-syllable, 3-phrase complex sentence representing a short PG was constructed A target single syllables was embedded in three PG positions, i.e., PG-Initial, -medial and –final.  “ △是一個常見的字,一般人常把△字掛在嘴邊,講話時動不動 就會提到△ ”  Speaking rates: 289 and 308 ms/syllable for M054C and F054C  Target syllable analyzed: Tone 1

31 Negsst2007 Goals: 1. Patterns of Ap could be derived from speech data. 2. Evidence of interaction between phrase command and higher-level prosodic units could be found. 3. Evidences found could predict cross-phrase F0 allocation in speech flow. Experiment 1

32 Negsst2007 Distribution of speech data PG PositionAp range -Initial0.959~0.499 -Medial0.615~0.04 -Final0.678~0.093 Range of values of Ap from phrases produced by female speaker F054c in three PG related positions are presented.

33 Negsst2007 Distribution of speech data A schematic representation of the distribution of Ap of F054c where the horizontal axis represents values of Ap and the vertical axis represents number of Ap occurrence.

34 Negsst2007 Results Expected Cell Mean at the PPh level without PG effects: 0.4595 Expected Cell Mean at the PG level with PG effect: PG InitialPG MedialPG Final 0.69840.35360.3265 The expected cell mean of predictions with and without the PG effect. The Figure is a schematic representation of the patterns of phrases after PG effect is taken into consideration.

35 Negsst2007 Examples One expected cell mean can’t approach LFC well, PG-initial and PG-final especially. without PG-effectwith PG-effect

36 Negsst2007 Superimposed F 0 according to the HPG Framework Syl PPh PG F0F0 F0F0 F0F0 t t t

37 Negsst2007 What Does Higher Level Discourse Information Mean? Swapping PG-initial and PG-final F0F0 t F0F0 Exchanged Original t

38 Negsst2007 Further Evidences of HPG, Systematic and Predictable— Same Base Form and Different Distribution Yield Different Output Prosody Styles

39 Negsst2007 Speech Data Mandarin rhymed classical writing  Style regular semi-regular irregular WeatherBroadcast  Style irregular # of Syl# of PPh# of Discoursespeech_rate(ms) female f054 705472034193 female m054 709674734165 # of Syl# of PPh# of Discoursespeech_rate(ms) female f054 350271030271 female m056 351071130202

40 Negsst2007 Classification of Stylistic Variations regular irregular semi-regular IndexName of files# of syllablesstyle of writing 07a 正氣歌 300 五言古詩) 17a 慈烏夜啼 90 五言古詩) 18a 古詩十九首之一 80 五言古詩) 19a 長干行 150 工整樂府) 21a 古詩十九首之二 50 五言古詩) 22a 石壕吏 120 工整樂府) 23a 古詩十九首之九 40 五言古詩) 24a 新婚別 160 工整樂府) 26a 古詩十九首之十五 50 五言古詩) 27a 春江花月夜 262 工整樂府) IndexName of files # of syllables style of writing 01a 禮運大同篇 107 古文) 02a 典論論文 187 古文) 03a 孟子齊人 202 古文) 06a 雜說四 150 古文) IndexName of files# of syllablesstyle of writing 07a 正氣歌 300 五言古詩) 17a 慈烏夜啼 90 五言古詩) 18a 古詩十九首之一 80 五言古詩) 19a 長干行 150 工整樂府) 21a 古詩十九首之二 50 五言古詩) 22a 石壕吏 120 工整樂府) 23a 古詩十九首之九 40 五言古詩) 24a 新婚別 160 工整樂府) 26a 古詩十九首之十五 50 五言古詩) 27a 春江花月夜 262 工整樂府)

41 Negsst2007 Predictions of Ap from Higher Level information (B3, B4, B5) PPh BG PG

42 Negsst2007 Distributions of Layered Contributions in Each Style (Male) Rhymed classical writing m056 regular semi-regular irregular Weather broadcast m054 (irregular) The more regular the style, the bigger the planning templates, and the more governing from higher level information

43 Negsst2007 Rhymed classical writing f054 regular semi-regular irregular Weather broadcast f054(irregular) Distributions of Layered Contributions in Each Style (Female) The more regular the style, the bigger the planning templates, and the more governing from higher level information

44 Negsst2007 PPh Contributions in Different Styles PPhrsmrirrWB female0.3492 0.44791 9 0.6417 32 0.627672 male 0.44713 12 0.50363 7 0.7324 69 0.603251

45 Negsst2007 BGrsmrirrWB female 0.38724 59 0.1321 83 0.1289 9 0.0500 21 male 0.17585 15 0.1313 06 0.0707 16 0.0372 5 Contributions of BG Layer in Different Styles

46 Negsst2007 Conclusions 1. Lexical, syntactic and discourse prosody ALL contribute to output prosody. Interactions are necessary, systematic and predictable from higher level considerations. HPG accounts for prosody of fluent continuous speech. 2. How a semantic complete speech paragraph begins, holds and ends across the phrases within is specified by HPG related positions: PG-Initial, PG-Medial and PG-final 3. Further evidences from Mandarin rhymed classics substantiated HPG as a base form for both planning and processing of fluent speech prosody. 4. Stylistic variations are built on the same base form with varied contribution distribution.


Download ppt "Fujisaki Model 對應階層性語 流韻律架構 HPG 在國語的應 用與分析 中央研究院語言學研究所 蘇昭宇"

Similar presentations


Ads by Google