FLST: Prosodic Models FLST: Prosodic Models for Speech Technology Bernd Möbius

Slides:



Advertisements
Similar presentations
FLST: Speech Recognition Bernd Möbius
Advertisements

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.
Coarticulation Analysis of Dysarthric Speech Xiaochuan Niu, advised by Jan van Santen.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Modelling Polish Intonation for Speech Synthesis Dominika Oliver 23 May 2002.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
Connecting Acoustics to Linguistics in Chinese Intonation Greg Kochanski (Oxford Phonetics) Chilin Shih (University of Illinois) Tan Lee (CUHK) with Hongyan.
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
1 Phonetics Study of the sounds of Speech Articulatory Acoustic Experimental.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Chapter three Phonology
1 ENGLISH PHONETICS AND PHONOLOGY Lesson 3A Introduction to Phonetics and Phonology.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Producing Emotional Speech Thanks to Gabriel Schubiner.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Natural Language Understanding
Phonology, phonotactics, and suprasegmentals
Toshiba Update 14/09/2005 Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young A Statistical Approach.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Phonetics and Phonology
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.
7-Speech Recognition Speech Recognition Concepts
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Suprasegmentals Segmental Segmental refers to phonemes and allophones and their attributes refers to phonemes and allophones and their attributes Supra-
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
1 The Ferret Copy Detector Finding short passages of similar texts in large document collections Relevance to natural computing: System is based on processing.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Today we will talk about: -Phonology -Phonemes -morphemes.
English Linguistics: An Introduction
A prosodically sensitive diphone synthesis system for Korean Kyuchul Yoon Linguistics Department The Ohio State University.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon, Mary Beckman & Chris Brew.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
Introduction to Computational Linguistics
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
Lecture 2 Phonology Sounds: Basic Principles. Definition Phonology is the component of linguistic knowledge concerned with rules, representations, and.
Hello, Everyone! Part I Review Review questions 1.In what ways can English consonants be classified? 2. In what ways can English vowels be classified?
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
A Fully Annotated Corpus of Russian Speech
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Performance Comparison of Speaker and Emotion Recognition
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
Control of prosodic features under perturbation in collaboration with Frank Guenther Dept. of Cognitive and Neural Systems, BU Carrie Niziolek [carrien]
Suprasegmental Properties of Speech Robert A. Prosek, Ph.D. CSD 301 Robert A. Prosek, Ph.D. CSD 301.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Suprasegmental features and Prosody Lect 6A&B LING1005/6105.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Mr. Darko Pekar, Speech Morphing Inc.
Tone in Sherpa (Sino-Tibetan) Joyce McDonough1, Rebecca Baier2 and
Representing Intonational Variation
Intonational and Its Meanings
The American School and ToBI
Information Structure and Prosody
Representing Intonational Variation
Representing Intonational Variation
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Research on the Modeling of Chinese Continuous Speech Recognition
Presentation transcript:

FLST: Prosodic Models FLST: Prosodic Models for Speech Technology Bernd Möbius

FLST: Prosodic Models Prosody: Duration and intonation  Temporal and tonal structure in speech synthesis  all synthesis methods use models to predict duration and F0 models are trained on observed duration and F0 data  Unit Selection: phone duration and phone-level F0 used in target specification F0 smoothness considered  HMM synthesis: duration modeled by probability of remaining in the same state 2

FLST: Prosodic Models Duration prediction  Task of duration model in TTS:  predict duration of speech sound as precisely as possible, based on factors affecting duration  factors must be computable/inferrable from text 3

FLST: Prosodic Models Duration prediction 4

FLST: Prosodic Models Duration prediction  Task of duration model in TTS:  predict duration of speech sound as precisely as possible, based on factors affecting duration  factors must be computable/inferrable from text  Why is this task difficult?  extremely context-dependent durations, e.g. [ ɛ ] = 35 ms in jetzt, 252 ms in Herren  factors: accent status of word, syllabic stress, position in utterance, segmental context, …  factors define a huge feature space 5

FLST: Prosodic Models Duration models  Automatic construction of duration models  general-purpose statistical prediction systems Classification and Regression Trees [Breiman et al. 1984; e.g. Riley 1992] Multiple regression [e.g. Iwahashi and Sagisaka 1993] Neural Nets [e.g. Campbell 1992]  statistically accurate for training data  but often insufficient performance on new data 6

FLST: Prosodic Models Data sparsity  Why is this a problem?  data sparsity: feature space (>10k vectors) cannot be covered exhaustively by training data  LNRE distribution: large number of rare events - rare vectors must not be ignored, because there are so many rare vectors that the probability of encountering at least one of them in any sentence is very high 7

FLST: Prosodic Models Data sparsity: word frequencies 8

FLST: Prosodic Models Data sparsity  Why is this a problem?  data sparsity: feature space (>10k vectors) cannot be covered exhaustively by training data  LNRE distribution: large number of rare events - rare vectors must not be ignored, because there are so many rare vectors that the probability of encountering at least one of them in any sentence is very high  vectors unseen in training data must be predicted by extrapolation and generalization  general-purpose prediction systems have poor extrapolation and are not robust w.r.t. missing data 9

FLST: Prosodic Models Sum-of-products model  Current best practice: Sum-of-products model [van Santen 1993, 1998; Möbius and van Santen 1996]  exploits expert knowledge and well-behaved properties of speech (e.g. directional invariance, monotonicity)  uses well-behaved mathematical operations (add./mult.)  estimates parameters even for unbalanced frequency distributions of features in training data 10

FLST: Prosodic Models Sum-of-products model  Sum-of-products model: general form [van Santen 1993, 1998] K : set of indices of product terms I i : set of indices of factors occurring in i-th product term S i,j : set of parameters, each corresponding to a level on j-th factor f j : feature on j-th factor (e.g., f 1 = Vowel_ID, f 2 =  stress,...) 11

FLST: Prosodic Models Sum-of-products model  Sum-of-products model: specific form [van Santen 1993, 1998] V : vowel identity (15 levels) C : consonant after V (2 levels:  voiced) P : position in phrase (2 levels: medial/final) here: 21 parameters to estimate ( ) 12

FLST: Prosodic Models Sum-of-products model  SoP model requires:  definition of factors affecting duration (literature, pilot)  segmented and annotated speech corpus  greedy algorithm to optimize coverage: select from large text corpus a smallest subset with same coverage  SoP model yields:  complete picture of temporal characteristics of speaker  homogeneous, consistent results for set of factors  best performance: r = 0.9 for observed vs. predicted phone durations (Engl., Ger., Fr., Dutch, Chin., Jap., …) 13

FLST: Prosodic Models SoP model: phonetic tree 14

FLST: Prosodic Models Intonation prediction  Task of intonation model in TTS  compute a continuous acoustic parameter (F0) from a symbolic representation of intonation inferred from text 15

FLST: Prosodic Models Intonation (F 0 ) 16

FLST: Prosodic Models Intonation prediction  Task of intonation model in TTS  compute a continuous acoustic parameter (F0) from a symbolic representation of intonation inferred from text  Intonation models commonly applied in TTS systems:  phonological tone-sequence models (Pierrehumbert)  acoustic-phonetic superposition models (Fujisaki)  acoustic stylization models (Tilt, PaIntE, IntSint)  perception-based models (IPO)  function-oriented models (KIM) 17

FLST: Prosodic Models Tone sequence model  Autosegmental-metrical theory of intonation [Pierrehumbert 1980]  intonation is represented by sequence of high (H) and low (L) tones  H and L are members of a primary phonological contrast  hierarchy of intonational domains IP – Intonation Phrase; boundary tones: H%, L% ip – intermediary phrase; phrase tones: H-, L- pw – prosodic word; pitch accents: H*, H*L, L*H, … 18

FLST: Prosodic Models Pierrehumbert's model  Finite-state grammar of well-formed tone sequences pw ip IP  Example [adapted from Pierrehumbert 1980, p. 276] That's a remarkably clever suggestion. | | %H H* H*L L- L% 19

FLST: Prosodic Models Pierrehumbert's model  Finite-state graph 20 pw ip IP

FLST: Prosodic Models ToBI: Tones and Break Indices  Formalization of intonation model as transcription system [Pitrelli et al. 1992]  phonemic (=broad phonetic) transcription  originally designed for American English  limited applicability to other varieties/languages language-specific inventory of phonological units language-specific details of F0 contours  adapted to many languages (e.g. GToBI, JToBI, KToBI)  implemented in many TTS systems abstract tonal representation converted to F0 contours by means of phonetic realization rules 21

FLST: Prosodic Models Fujisaki's model 22 [Fujisaki 1983, 1988; Möbius 1993]

FLST: Prosodic Models Fujisaki's model  Properties:  superpositional  physiological basis and interpretation of components and control parameters  linguistic interpretation of components  applied to many (typologically diverse) languages  Origins:  Öhman and Lindqvist (1966), Öhman (1967)  Fujisaki et al. (1979), Fujisaki (1983, 1988), … 23

FLST: Prosodic Models Fujisaki's model: Components 24 [Möbius 1993]

FLST: Prosodic Models Fujisaki's model: Example 25 [Möbius 1993] Approximation of natural F 0 by optimal parameter values within linguistic constraints (accents, phrase structure)

FLST: Prosodic Models Comparison of models  Tone sequence or superposition?  intonation TS: consists of linear sequence of tonal elements SP: overlay of components of longer/shorter domain  F0 contour TS: generated from sequences of phonological tones SP: complex patterns from superimposed components  interaction TS: tones locally determined, non-interactive SP: simultaneous, highly interactive components 26

FLST: Prosodic Models F 0 as a complex phenomenon  Main problem for intonation models: linguistic, paralinguistic, extralinguistic factors – all conveyed by F0  lexical tones  syllabic stress, word accent  stress groups, accent groups  prosodic phrasing  sentence mode  discourse intonation  pitch range, register  phonation type, voice quality  microprosody: intrinsic and coarticulatory F0 27

FLST: Prosodic Models Thanks! 28 More on prosody in speech technology: ASR (Wed Jan 28)