Fujisaki Model 對應階層性語流韻律架構 HPG 在國語的應用與分析中央研究院語言學研究所蘇昭宇

Slides:

Advertisements

Similar presentations

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.

Advertisements

Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.

The Role of F0 in the Perceived Accentedness of L2 Speech Mary Grantham O’Brien Stephen Winters GLAC-15, Banff, Alberta May 1, 2009.

Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.

IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.

Outlines  Objectives  Study of Thai tones  Construction of contextual factors  Design of decision-tree structures  Design of context clustering.

Nuclear Accent Shape and the Perception of Prominence Rachael-Anne Knight Prosody and Pragmatics 15 th November 2003.

AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

Analyzing Students’ Pronunciation and Improving Tonal Teaching Ropngrong Liao Marilyn Chakwin Defense.

Niebuhr, D‘Imperio, Gili Fivela, Cangemi 1 Are there “Shapers” and “Aligners” ? Individual differences in signalling pitch accent category.

Tone, Accent and Stress February 14, 2014 Practicalities Production Exercise #2 is due at 5 pm today! For Monday after the break: Yoruba tone transcription.

CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.

Connecting Acoustics to Linguistics in Chinese Intonation Greg Kochanski (Oxford Phonetics) Chilin Shih (University of Illinois) Tan Lee (CUHK) with Hongyan.

FLST: Prosodic Models FLST: Prosodic Models for Speech Technology Bernd Möbius

VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.

Discourse Prosodic Attributes, Boundary Information and Prosodic Highlight Speaker: Jr-Feng Huang PI: Chiu-yu Tseng Phonetics Lab, Institute of Linguistics,

On Constrained Optimization Approach To Object Segmentation Chia Han, Xun Wang, Feng Gao, Zhigang Peng, Xiaokun Li, Lei He, William Wee Artificial Intelligence.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.

Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.

Retargetting Motion to New Characters SIGGRAPH ’98 Speaker: Alvin Date: 6 July 2004.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.

Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.

The Calibration Process

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)

Suprasegmentals Segmental Segmental refers to phonemes and allophones and their attributes refers to phonemes and allophones and their attributes Supra-

Relative Clauses in Mandarin Chinese Conversation Na Wang.

Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

World Languages Mandarin English Challenges in Mandarin Speech Recognition  Highly developed language model is required due to highly contextual nature.

Building a sentential model for automatic prosody evaluation Kyuchul Yoon School of English Language & Literature Yeungnam University Korea.

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

On Different Perspectives of Utilizing the Fujisaki Model to Mandarin Speech Prosody Zhao-yu Su Phonetics Lab, Institute of Linguistics, Academia Sinica.

Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.

Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.

Introduction to Computational Linguistics

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

A Fully Annotated Corpus of Russian Speech

National Taiwan University, Taiwan

Tone, Accent and Quantity October 19, 2015 Thanks to Chilin Shih for making some of these lecture materials available.

Performance Comparison of Speaker and Emotion Recognition

Detecting Accent Sandhi in Japanese Using a Superpositional F0 Model Atsuhiro Sakurai Hiromichi Kawanami Keikichi Hirose Depart. of Communication and Information.

Topic 4 - Database Design Unit 1 – Database Analysis and Design Advanced Higher Information Systems St Kentigern’s Academy.

0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Levels of Linguistic Analysis

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Nuclear Accent Shape and the Perception of Syllable Pitch Rachael-Anne Knight LAGB 16 April 2003.

Control of prosodic features under perturbation in collaboration with Frank Guenther Dept. of Cognitive and Neural Systems, BU Carrie Niziolek [carrien]

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Pitch Tracking + Prosody January 19, 2012 Homework! For Tuesday: introductory course project report Background information on your consultant and the.

Welcome to All S. Course Code: EL 120 Course Name English Phonetics and Linguistics Lecture 1 Introducing the Course (p.2-8) Unit 1: Introducing Phonetics.

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Investigating Pitch Accent Recognition in Non-native Speech

Overview Part 1 – Design Procedure Beginning Hierarchical Design

Recognizing Structure: Sentence, Speaker, andTopic Segmentation

Levels of Linguistic Analysis

Applied Linguistics Chapter Four: Corpus Linguistics

Auditory Morphing Weyni Clacken

Presentation transcript:

Fujisaki Model 對應階層性語流韻律架構 HPG 在國語的應用與分析中央研究院語言學研究所蘇昭宇

Negsst2007 Outline Hierarchical Framework of Discourse Prosody HPG  Introduction  The HPG framework  Prosodic features and templates of Mandarin fluent speech prosody  Corpus approach and quantitative evidences Fujisaki Model(F 0 model)  Auto-extraction  Phrase components  Accent components Predicting cross-phrase F0 patterns with higher level discourse information using the Fujisaki model Experiment & results Conclusion

Negsst2007 Reference 1. Tseng, Chiu-yu (2006). “Prosody Analysis”,in Advances in Chinese Spoken Language Processing, edited by Chin-Hui Lee, Haizhou Li, Lin-shan Lee, Ren-Hua Wang, Qiang Huo, World Scientific Publishing, Singapore,pp Tseng Chiu-yu, Pin Shao-huang, Lee Yeh-lin, Wang Hsin-min and Chen Yong-cheng (2005). “Fluent speech prosody: framework and modeling”, Speech Communication, Vol.46,issues 3- 4,(July 2005), Special Issue on Quantitative Prosody Modelling for Natural Speech Description and Generation, pp Fujisaki H, Hirose K. “ Analysis of voice fundamental frequency contours for declarative sentences ofJapanese”. J.Acoust. Soc. Jpn.(E), 1984; 5(4): Mixdorff, H. (2000): A Novel Approach to the Fully Automatic Extraction of Fujisaki Model Parameters. Proceedings of ICASSP 2000, vol. 3, pages , Istanbul, Turkey. 5. Mixdorff, H., Hu, Y. and Chen, G. (2003): Towards the Automatic Extraction of Fujisaki Model Parameters for Mandarin. In Proceedings of Eurospeech 2003, Geneva. 6. Wentao Gu, Hirose K, Fujisaki H: Comparison of Perceived Prosodic Boundaries and Global Characteristics of Voice Fundamental Frequency Contours in Mandarin Speech. ISCSLP 2006: 31-42

Negsst2007 HPG (Hierarchical Prosodic Phrase Grouping) Framework of Discourse Prosody-- Fluent Speech Prosody

Negsst2007 Introduction of HPG (1/2) 1. From bottom up, output fluent speech prosody includes lexical prosody (tone), syntactic prosody (intonation) and discourse prosody (cross-phrase semantic associations). 2. From top down, the HPG framework represents hierarchical constraints discourse, syntactic and lexical information. Thus, higher level prosodic units constrain and govern lower level ones; lower level units are subject to and associated by higher level units. 3. Phrases in speech flow should NOT be treated as independent, unrelated prosodic units. Rather, intonation units are subordinate prosodic units subject to HPG specifications.

Negsst2007 Introduction of HPG (2/2) 4. Output fluent speech prosody results from cumulative layered contributions from lexical, syntactic and discourse information. Therefore, prosody does NOT stop at phrase intonation. 5. According to HPG specifications, variations of phrase intonations across speech flow are systematic and predictable.

Negsst2007 HPG (Hierarchical Prosodic Phrase Grouping)-- Discourse Prosody Hierarchy (unit and constraints) A schematic representation of how PGs form spoken discourse

Negsst2007 Speech data annotation The speech data were manually labeled by independent transcribers for perceived boundaries and breaks (pauses), using a 5-step break labeling system corresponding our framework. Breath Group Initial PPFinal PPMedial Prosodic Phrase PW.. PW B2 B3 B4 Prosodic Group B4 B5

Negsst2007 Hand Labeling Perceived Boundary (Tseng et al, 1999) in Relation to Prosody Organization –Systematic and Predictable DefinitionCharacteristics B0 reduced syllabic boundary Syllable truncation often occurred in fast or informal fluent speech. B1 normal syllabic boundary/SYL Usually with no identifiable pauses, but more of a psycholinguistic unit for native speakers. B2 prosodic word boundary/PW Perceived as a boundary where a slight tone of voice change usually follows. B3 prosodic phrase boundary/PPh A clearly perceived pause. B4 breath group boundary/BG Perceived end of exhale cycle followed by inhaling to begin another breathing cycle. It could be where a speech paragraph ends where trailing occurs with final lengthening coupled with weakening of speech sounds. But the speaker may still go on by breathing but not ending the speech paragraph. B5 prosodic group boundary/PG A complete speech paragraph ends by final lengthening coupled with weakening of speech sounds. The speaker makes a complete stop, take a new breath, and begin a new speech paragraph.

Negsst2007 COSPRO Flow chart of speech data processing and annotation-Read speechhttp:// Task flow output files and file names footnotes Recording Speech in Sound Proof Chambers Hand Mapping Recorded Speech with Text Segmenting Speech Files using HTK Spot-checking by Hand Hand-labeling Perceived Prosodic Boundaries Analyzing Labeled Speech Data *.wav Editing Text to Match Speech Files *.phn *.adjust *.break PG model Converting Text to SAMPA *.SAMPA Text for speakers to read Designing Text for Narration 1.file extension: *.text 2.Serial numbers for text and wav files are identical. sampling rate: 16000Hz 。 sampling format: 1 channel 16-bit linear Hand Correcting Mismatch File extension: *.adjust Adjustments: 1.segment boundaries 2.multiple pronunciation characters

Negsst2007 Cross-Phrase Prosodic Features and Templates Corpus investigations and quantitative analyses enabled us to  1. obtain quantitative evidences of cumulative contributions of prosodic layers to output prosody,  2. derive cross-phrase hierarchical templates corresponding to every prosodic layer in the following 4 acoustic correlates (Tseng et al, 2004; 2005; 2006) 1. F0 contour templates 2. Duration cadence templates 3. Intensity distribution patterns 4. Pause cadence templates

Negsst2007 Quantitative Analysis and Predictions: F0, Duration, Intensity and Breaks Hierarchical linear model  Fujisaki parameters  Pause  Duration  Intensity PW SYL BG SYL PPh PW SYL Residues Auto-Extraction for Fujisaki Model Fujisaki parametersF0 contour

Negsst2007 The Fujisaki Model

Negsst2007 Fujisaki Model (1984)—Intonation model Unit—syntax defined simple sentence  F0 curve corresponding to single simple phrase as defined by syntax can be generated  Generation of gradually declining baselines of F0 curve can be decomposed into the phrase components (Ap) and accent components (Aa) Evidences obtained: Japanese, English, German, Mandarin, Thai, Vietnamese…etc.

Negsst2007 The Fujisaki Model (1/2) F 0 =Base frequency+ Phrase components+ Accent components

Negsst2007 The Fujisaki Model (2/2) Phrase components Accent components Base Freauency :onset of the first accent command in the jth command pair :timing of the ith phrase command :duration of the first accent command in the jth command pair :magnitude of the ith phrase command

Negsst2007 Phrase components = 0.01~0.05

Negsst2007 Accent components = 0.1~0.5

Negsst2007 Simulation of Mandarin Prosody with Fujisaki Model Phrase components Accent components Simulated Result Ap Aa Fb

Negsst2007 Simulating/Generating F0 Curves with Fujisaki Model— Auto-extraction of Parameters (other approaches vs. our approach)

Negsst2007 Mixdorff (2000, 2003)-- Interpolation and Smoothing (1/3) 1. Intermediate F0 values for unvoiced speech segments 2. Microprosodic variations are smoothed out. 3. Feature: very close simulation, one phrase at a time.

Negsst2007 Mixdorff (2000, 2003)–High-Pass Filtering and Component Separation (2/3) highpass filter(stop frequency at 0.5 Hz)  The output of the highpass filter(HFC)  low frequency contour (LFC): containing the sum of phrase component and Fb. Component Separation Fb : the overall minimum of the LFC Phrase components : the residual of LFC subtracted Fb

Negsst2007 Mixdorff (2000, 2003)-- Optimizing simulated F0 curve (3/3) Hill-Climbing Methodology  Construct a sub-optimal solution that meets the constraints of the problem  Take the solution and make an improvement upon it  Repeatedly improve the solution until no more improvements are necessary/possible

Negsst2007 Gu ( 顧文濤 2006 ） Generating F0 Curves Using Speech Sample from CORSPRO_05 1.Gu did NOT consider information above phrases. 2.Gu compared generation results with HPG labeled results.

Negsst2007 Gu (2006)—Simulation of F0 Curves w/out Higher Level and Boundary Information Features: 1. Local minimum of LFC are considered and inserted with Ap 2. F0 curves and boundaries are generated

Negsst2007 Gu (2006) observed large variations of Aps exist 1. between two speakers, 2. among boundaries We observed: 1. The magnitude of Ap inserted in larger boundaries (B4, B5) are similar. 2. Similar patterns exist in BGs or PGs.

Negsst2007 Why Higher Level Discourse Information? (1/2) Gu (2006)’s traditional approach without higher level information Focus: 1. Isolated phrase intonations and boundaries are generated one at a time. 2. Simulation and fine tuning of each generation. Problems: 1. Large variations of Aps exist between speakers and among boundaries. 2. Variations can not be predicted and/or solved; concatenation of each generation can not yield patterns for technological implementation.

Negsst2007 Why Higher Level Discourse Information? (2/2) Tseng et al approach with higher level discourse information (HPG) Focus:  Prediction of fluent speech prosody, i.e., cross-phrase F0 curves and boundary break Advantages:  1.Multiple phrase intonations and boundaries can be predicted according to HPG specifications. 2. Output prosody is NOT concatenation of independent isolated phrase intonations. 3. Between-speaker and among-boundary Ap variations are systematic and predictable, therefore, are NOT considered variations by HPG framework. 4. Useful to technology development (speech synthesis).

Negsst Experiments Hypothesis  Predictions of phrase intonation curves can be improved with higher level information because HPG specifies cross-phrase associations.  Cumulative contributions from prosodic layers can provide useful information. Implications  technology development

Negsst2007 Speech Data Sinica COSPRO 08  Carrier paragraph: A 30-syllable, 3-phrase complex sentence representing a short PG was constructed A target single syllables was embedded in three PG positions, i.e., PG-Initial, -medial and –final.  “ △是一個常見的字，一般人常把△字掛在嘴邊，講話時動不動就會提到△ ”  Speaking rates: 289 and 308 ms/syllable for M054C and F054C  Target syllable analyzed: Tone 1

Negsst2007 Goals: 1. Patterns of Ap could be derived from speech data. 2. Evidence of interaction between phrase command and higher-level prosodic units could be found. 3. Evidences found could predict cross-phrase F0 allocation in speech flow. Experiment 1

Negsst2007 Distribution of speech data PG PositionAp range -Initial0.959~ Medial0.615~0.04 -Final0.678~0.093 Range of values of Ap from phrases produced by female speaker F054c in three PG related positions are presented.

Negsst2007 Distribution of speech data A schematic representation of the distribution of Ap of F054c where the horizontal axis represents values of Ap and the vertical axis represents number of Ap occurrence.

Negsst2007 Results Expected Cell Mean at the PPh level without PG effects: Expected Cell Mean at the PG level with PG effect: PG InitialPG MedialPG Final The expected cell mean of predictions with and without the PG effect. The Figure is a schematic representation of the patterns of phrases after PG effect is taken into consideration.

Negsst2007 Examples One expected cell mean can’t approach LFC well, PG-initial and PG-final especially. without PG-effectwith PG-effect

Negsst2007 Superimposed F 0 according to the HPG Framework Syl PPh PG F0F0 F0F0 F0F0 t t t

Negsst2007 What Does Higher Level Discourse Information Mean? Swapping PG-initial and PG-final F0F0 t F0F0 Exchanged Original t

Negsst2007 Further Evidences of HPG, Systematic and Predictable— Same Base Form and Different Distribution Yield Different Output Prosody Styles

Negsst2007 Speech Data Mandarin rhymed classical writing  Style regular semi-regular irregular WeatherBroadcast  Style irregular # of Syl# of PPh# of Discoursespeech_rate(ms) female f female m # of Syl# of PPh# of Discoursespeech_rate(ms) female f female m

Negsst2007 Classification of Stylistic Variations regular irregular semi-regular IndexName of files# of syllablesstyle of writing 07a 正氣歌 300 五言古詩） 17a 慈烏夜啼 90 五言古詩） 18a 古詩十九首之一 80 五言古詩） 19a 長干行 150 工整樂府） 21a 古詩十九首之二 50 五言古詩） 22a 石壕吏 120 工整樂府） 23a 古詩十九首之九 40 五言古詩） 24a 新婚別 160 工整樂府） 26a 古詩十九首之十五 50 五言古詩） 27a 春江花月夜 262 工整樂府） IndexName of files # of syllables style of writing 01a 禮運大同篇 107 古文） 02a 典論論文 187 古文） 03a 孟子齊人 202 古文） 06a 雜說四 150 古文） IndexName of files# of syllablesstyle of writing 07a 正氣歌 300 五言古詩） 17a 慈烏夜啼 90 五言古詩） 18a 古詩十九首之一 80 五言古詩） 19a 長干行 150 工整樂府） 21a 古詩十九首之二 50 五言古詩） 22a 石壕吏 120 工整樂府） 23a 古詩十九首之九 40 五言古詩） 24a 新婚別 160 工整樂府） 26a 古詩十九首之十五 50 五言古詩） 27a 春江花月夜 262 工整樂府）

Negsst2007 Predictions of Ap from Higher Level information (B3, B4, B5) PPh BG PG

Negsst2007 Distributions of Layered Contributions in Each Style (Male) Rhymed classical writing m056 regular semi-regular irregular Weather broadcast m054 (irregular) The more regular the style, the bigger the planning templates, and the more governing from higher level information

Negsst2007 Rhymed classical writing f054 regular semi-regular irregular Weather broadcast f054(irregular) Distributions of Layered Contributions in Each Style (Female) The more regular the style, the bigger the planning templates, and the more governing from higher level information

Negsst2007 PPh Contributions in Different Styles PPhrsmrirrWB female male

Negsst2007 BGrsmrirrWB female male Contributions of BG Layer in Different Styles

Negsst2007 Conclusions 1. Lexical, syntactic and discourse prosody ALL contribute to output prosody. Interactions are necessary, systematic and predictable from higher level considerations. HPG accounts for prosody of fluent continuous speech. 2. How a semantic complete speech paragraph begins, holds and ends across the phrases within is specified by HPG related positions: PG-Initial, PG-Medial and PG-final 3. Further evidences from Mandarin rhymed classics substantiated HPG as a base form for both planning and processing of fluent speech prosody. 4. Stylistic variations are built on the same base form with varied contribution distribution.