Toshiba Update 14/09/2005 Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young A Statistical Approach.

Slides:

Advertisements

Similar presentations

Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.

Advertisements

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.

Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.

High Level Prosody features: through the construction of a model for emotional speech Loic Kessous Tel Aviv University Speech, Language and Hearing

Prosody modification in speech signals Project by Edi Fridman & Alex Zalts supervision by Yizhar Lavner.

IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.

Outlines  Objectives  Study of Thai tones  Construction of contextual factors  Design of decision-tree structures  Design of context clustering.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

FLST: Prosodic Models FLST: Prosodic Models for Speech Technology Bernd Möbius

ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.

Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen.

AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.

Emotions and Voice Quality: Experiments with Sinusoidal Modeling Authors: Carlo Drioli, Graziano Tisato, Piero Cosi, Fabio Tesser Institute of Cognitive.

December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.

MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.

Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.

Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,

Speaker Adaptation for Vowel Classification

Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.

Back-End Synthesis* Julia Hirschberg (*Thanks to Dan, Jim, Richard Sproat, and Erica Cooper for slides)

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.

Optimal Adaptation for Statistical Classifiers Xiao Li.

Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.

Producing Emotional Speech Thanks to Gabriel Schubiner.

Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Robust Recognition of Emotion from Speech Mohammed E. Hoque Mohammed Yeasin Max M. Louwerse {mhoque, myeasin, Institute for Intelligent.

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:

PROSODY MODELING AND EIGEN- PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION Zi-He Chen, Yuan-Fu Liao, and Yau-Tarng Juang ICASSP 2005 Presenter: Fang-Hui.

Advanced Signal Processing 2, SE Professor Horst Cerjak, Andrea Sereinig Graz, Basics of Hidden Markov Models Basics of HMM-based.

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.

1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.

1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.

Prepared by: Waleed Mohamed Azmy Under Supervision:

Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,

Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)

Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.

Speech Parameter Generation From HMM Using Dynamic Features Keiichi Tokuda, Takao Kobayashi, Satoshi Imai ICASSP 1995 Reporter: Huang-Wei Chen.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.

Modeling and Generation of Accentual Phrase F 0 Contours Based on Discrete HMMs Synchronized at Mora-Unit Transitions Atsuhiro Sakurai (Texas Instruments.

Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.

Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

July Age and Gender Recognition from Speech Patterns Based on Supervised Non-Negative Matrix Factorization Mohamad Hasan Bahari Hugo Van hamme.

HMM training strategy for incremental speech synthesis.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Performance Comparison of Speaker and Emotion Recognition

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

Predicting Voice Elicited Emotions

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

Business Unit or Product Name © 2006 IBM Corporation 26/02/2007 Zeynep Inanoglu & Steve Young Machine Intelligence Lab, CUED March 12 th, 2008 Data-driven.

G. Anushiya Rachel Project Officer

Emotional Speech Modelling and Synthesis

Voice conversion using Artificial Neural Networks

Representing Intonational Variation

Statistical Models for Automatic Speech Recognition

Presentation transcript:

Toshiba Update 14/09/2005 Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young A Statistical Approach To Emotional Prosody Generation

Toshiba Update 14/09/2005 Agenda Previous Toshiba Update A Review of Emotional Speech Synthesis Motivation for Proposed Approach Proposed Approach: Intonation Generation from Syllable HMMs. – Intonation Models and Training – Recognition Performance of Intonation Units – Intonation Synthesis from HMMs – MLLR based Intonation Adaptation – Perceptual Tests Summary and Future Direction

Toshiba Update 14/09/2005 Previous Toshiba Update: A Brief Review Emotion Recognition – Demonstrated work on HMM-based emotion detection in voic messages. (Emotive Alert) – Reported the set of acoustic features that maximize classification accuracy for each emotion type identified using sequential forward floating algorithm. Expressive Speech Synthesis – Demonstrated the importance of prosody in emotional expression through copy-synthesis of emotional prosody onto neutral utterances. – Suggested the linguistically descriptive intonation units for prosody modelling. (accents, boundary tones)

Toshiba Update 14/09/2005 A Review of Emotional Synthesis The importance of prosody in emotional expression has been confirmed. (Banse& Scherer, 1996; Mozziconacci, 1998) The available prosody rules are mainly defined for global parameters. (mean pitch, pitch range, speaking rate, declination) Interaction of linguistic units and emotion is largely untested. (Banziger, 2005) Strategies for emotional synthesis vary based on the type of synthesizer. – Formant Synthesis allows control over various segmental and prosodic parameter Emotional prosody rules extracted from literature are applied by modifying neutral synthesizer parameters. (Cahn, 1990; Burkhardt, 2000; Murray&Arnott, 1995) – Diphone Synthesis allows prosody control by defining target contours and durations based on emotional prosody rules. (Schroeder, 2004; Burkhardt, 2005) – Unit-Selection Synthesis provides minimal parametric flexibility. Attempts at emotional expression involve recording entire unit databases for each emotion and selecting units from the appropriate database at run time. (Iida et al, 2003) – HMM Synthesis allows spectral and prosodic control at the segmental level. Provides statistical framework for modelling emotions. (Tsuzuki et al, 2004)

Toshiba Update 14/09/2005 A Review of Emotional Synthesis Statistical Rule-Based Unit Replication SegmentalGlobalIntonational (syllable/phrase) Formant Synthesis / Diphone Synthesis - Only as good as hand-crafted rules - Poor to medium baseline quality Unit-Selection Synthesis + Very good quality - Not scalable, too much effort HMM Synthesis +Statistical -Too granular for prosody modelling Unexplored GRANULARITY METHOD

Toshiba Update 14/09/2005 Motivation For Proposed Approach 1 We propose a generative model of prosody. – We envision evaluating this prosodic model in a variety of synthesis contexts through signal manipulation schemes such as TD-PSOLA. Statistical – Rule based systems are only as good as their hand-crafted rules. Why not learn rules from data? – Success of HMM methods in speech synthesis. Syllable-based – Pitch movements are most relevant on the syllable or intonational phrase level. However, the effects of emotion on contour shapes and linguistic units are largely unexplored. Linguistic Units of Intonation – Coupling of emotion and linguistic phenomena has not been investigated. 1 This work will be published in the Proceedings of ACII, October 2005, Beijing

Toshiba Update 14/09/2005 Overview Neutral Speech Data Emotion Data c a c rb … Syllable Boundaries Training Context Sensitive HMMs F0 Generation MLLR Syllable Labels Emotion HMMs Phonetic LabelsSynthesized Contour Mean Pitch Step 1: Train intonation models on neutral data Step 2: Generate intonation contours from HMMStep 3: Adapt models given a small amount of emotion data TD-PSOLA Step 4: Transplant contour onto an utterance.Current focus is on pitch modelling only.Syllable-based intonation models.

Toshiba Update 14/09/2005 Intonation Models and Training Basic Models – Seven basic models: A (accent), C (unstressed), RB (rising boundary), FB (falling boundary), ARB, AFB, SIL Context-Sensitive models – Tri-unit models (Preceding and following intonation unit) – Full-context models (Position of syllable in intonational phrase, forward counts of accents, boundary tones in IP position of vowel in syllable, number of phones in the syllable) – Decision tree-based parameter tying was performed for context- sensitive models. Data: Boston Radio Corpus. Features: Normalized raw f0 and energy values as well as differentials.

Toshiba Update 14/09/2005 Recognition Results Evaluation of models was performed in a recognition framework to assess how well the models represent intonation units and to quantify the benefits of incorporating context. A held-out test set was used for predicting intonation sequences Basic models were tested with a varying numbers of mixture components. This was compared with accuracy rates of full-context models. Basic Label Set (7 models, 3 emitting states, N mix) Full Context Label Set with Decision Tree Based Tying GMM N=1GMM N=2GMM N=4GMM N=10 %Corr=53.26 %Acc =44.52 %Corr=53.36 %Acc =45.48 %Corr=54.65 %Acc =46.31 %Corr=59.58 %Acc =50.40 %Corr= %Acc = 55.88

Toshiba Update 14/09/2005 Intonation Synthesis From HMM The goal is to generate an optimal sequence of observations directly from syllable HMMs given the intonation models: The optimal state sequence is predetermined by basic duration models. So parameter generation problem becomes The solution is the sequence of mean vectors for the state sequence Q max We used the cepstral parameter generation algorithm of HTS system for interpolated F0 generation (Tokuda et al, 1995) Differential F0 features (Δf and ΔΔf) are used as constraints in contour generation. Maximization is done for static parameters only.

Toshiba Update 14/09/2005 A single observation vector consists of static and dynamic features: The relationship between the static and dynamic features are as follows: This relationship can be expressed in matrix form where O is the sequence of full feature vectors and F is the sequence of static features only. W is the matrix form of window functions. The maximization problem then becomes: The solution is a set of equations that can be solved in a time recursive manner. (Tokuda et al, 1995) Intonation Synthesis From HMM

Toshiba Update 14/09/2005 Intonation Synthesis From HMM Neutral Speech Data Emotion Data c a c rb … Syllable Boundaries Training Context Sensitive HMMs F0 Generation MLLR Syllable Labels Emotion HMMs Phonetic Labels Synthesized Contour Mean Pitch

Toshiba Update 14/09/2005 a a c c c c Perceptual Effects of Intonation Units a a c a c fb

Toshiba Update 14/09/2005 Pitch Contour Samples Generated Neutral Contours Transplanted on Unseen Utterances original synthesized tri-unit full-context

Toshiba Update 14/09/2005 MLLR Adaptation to Emotional Speech Maximum Likelihood Linear Regression (MLLR) adaptation computes a set of linear transformations for the mean and variance parameters of a continuous HMM. The number of transforms are based on a regression tree and a threshold for what is considered “enough” adaptation data. Adaptation data from Emotional Prosody Corpus which consists of four syllable phrases in a variety of emotions. Happy and sad speech were chosen for this experiment Not Enough Data: Use transformation From parent node.

Toshiba Update 14/09/2005 MLLR Adaptation To Happy & Sad Data arb c ccc c NeutralSadHappy

Toshiba Update 14/09/2005 Perceptual Tests Test 1: How natural are neutral contours? Ten listeners were asked to rate utterances in terms of naturalness of intonation. Some utterances were unmodified and others had synthetic contours. A t-test (p<0.05) on the data showed that distributions of ratings for the two hidden groups overlap sufficiently, i.e. there is no significant difference in terms of quality.

Toshiba Update 14/09/2005 Perceptual Tests Test 2: Does adaptation work? The goal is to find out if adapted models produce contours that people perceive to be more emotional than the neutral contours. Given pairs of utterances, 14 listeners were asked to identify the happier/sadder one.

Toshiba Update 14/09/2005 Perceptual Tests Utterances with sad contours were identified 80% of the time. This was significant. (p<0.01) Listeners formed a bimodal distribution in their ability to detect happy utterances. Overall only 46% of the happy intonation was identified as happier than neutral. (Smiling voice is infamous in literature) Happy models worked better with utterances with more accents and rising boundaries - the organization of labels matters!!!

Toshiba Update 14/09/2005 Summary and Future Direction A statistical approach to prosody generation was proposed with an initial focus on F0 contours. The results of the perceptual tests were encouraging and yielded guidelines for future direction: – Bypass the use of perceptual labels. Use lexical stress information as a prior in automatic labelling of corpora. – Investigate the role of emotion on accent frequency to come up with a “Language Model” of emotion. – Duration Modelling: Evaluate HSMM framework as well as duration adaptation by using vowel specific conversion functions. – Voice Source Modelling: Treat LF parameters as part of prosody. – Investigate the use of graphical models for allowing hierarchical constraints on generated parameters. – Incorporate the framework into one or more TTS systems.