Human Language Technologies – Text-to-Speech © 2007 IBM Corporation Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Automatic Exploration.

Slides:



Advertisements
Similar presentations
SSW6 Bonn Aug Communicative Speech Synthesis with XIMERA: a First Step Shinsuke Sakai 1,2, Jinfu Ni 1,2, Ranniery Maia 1,2, Keiichi Tokuda 1,3,
Advertisements

The Extended Cohn-Kanade Dataset(CK+):A complete dataset for action unit and emotion-specified expression Author:Patrick Lucey, Jeffrey F. Cohn, Takeo.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
Combining Inductive and Analytical Learning Ch 12. in Machine Learning Tom M. Mitchell 고려대학교 자연어처리 연구실 한 경 수
Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen.
Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.
A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts 04 10, 2014 Hyun Geun Soo Bo Pang and Lillian Lee (2004)
AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.
Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
EMOTIONS NATURE EVALUATION BASED ON SEGMENTAL INFORMATION BASED ON PROSODIC INFORMATION AUTOMATIC CLASSIFICATION EXPERIMENTS RESYNTHESIS VOICE PERCEPTUAL.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.
Conditional Random Fields
Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Scalable Text Mining with Sparse Generative Models
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech.
Chapter 7. BEAT: the Behavior Expression Animation Toolkit
Prepared by: Waleed Mohamed Azmy Under Supervision:
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
A Language Independent Method for Question Classification COLING 2004.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
HMM-Based Synthesis of Creaky Voice
Chapter 23: Probabilistic Language Models April 13, 2004.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
HMM training strategy for incremental speech synthesis.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Objectives of session By the end of today’s session you should be able to: Define and explain pragmatics and prosody Draw links between teaching strategies.
Interpreting Ambiguous Emotional Expressions Speech Analysis and Interpretation Laboratory ACII 2009.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Support Feature Machine for DNA microarray data
Investigating Pitch Accent Recognition in Non-native Speech
Text-To-Speech System for English
Multimodal Caricatural Mirror
John H.L. Hansen & Taufiq Al Babba Hasan
Low-Rank Sparse Feature Selection for Patient Similarity Learning
Presentation transcript:

Human Language Technologies – Text-to-Speech © 2007 IBM Corporation Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Automatic Exploration of Corpus-Specific Properties for Expressive Text-to-Speech. (A Case Study in Emphasis.) Raul Fernandez and Bhuvana Ramabhadran I.B.M. T.J. Watson Research Center

Human Language Technologies – Text to Speech © 2007 IBM Corporation 2Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Motivation. Review of Expressive TTS Architecture Expression Mining: Emphasis. Evaluation. Outline

Human Language Technologies – Text to Speech © 2007 IBM Corporation 3Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Expressive TTS We have shown that corpus-based approaches to expressive CTTS manage to convey expressiveness if the corpus is well designed to contain the desired expression(s). There are, however, shortcomings to this approach: Adding new expressions, or increasing the size of the repository for an existing one, is expensive and time consuming. The footprint of the system increases as we add new expressions. Without abandoning this framework, we propose to partially address these limitations by an approach that exploits the properties of the existing databases to maximize the expressive range of the TTS system.

Human Language Technologies – Text to Speech © 2007 IBM Corporation 4Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Some observations about data and listeners…  Production variability:  Speakers produce subtle expressive variations, even when they’re asked to speak in a mostly-neutral style.

Human Language Technologies – Text to Speech © 2007 IBM Corporation 5Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Some observations about data and listeners…  Production variability:  Speakers produce subtle expressive variations, even when they’re asked to speak in a mostly-neutral style. Anger Fear Sad Neutral  Perceptual confusability/redundancy:  Several studies have shown that there’s an overlap in the way listeners interpret the prosodic-acoustic realizations of different expressions.

Human Language Technologies – Text to Speech © 2007 IBM Corporation 6Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Some observations about data and listeners…  Production variability:  Speakers produce subtle expressive variations, even when they’re asked to speak in a mostly-neutral style. Anger Fear Sad Neutral  Perceptual confusability/redundancy:  Several studies have shown that there’s an overlap in the way listeners interpret the prosodic-acoustic realizations of different expressions.

Human Language Technologies – Text to Speech © 2007 IBM Corporation 7Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007  Goals: Exploit the variability present in a given dataset to increase the expressive range of the TTS engine. Augment the corpus-based with an expression-mining approach for expressive synthesis.  Challenge: Automatic annotation of instances in the corpus where an expression of interest occurs. (Approach may still require collecting a smaller expression- specific corpus to bootstrap data-driven learning algorithms.)  Case study: Emphasis. Expression Mining

Human Language Technologies – Text to Speech © 2007 IBM Corporation 8Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Motivation. Review of Expressive TTS Architecture Expression Mining: Emphasis. Evaluation. Outline

Human Language Technologies – Text to Speech © 2007 IBM Corporation 9Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 The Expressive Framework of the IBM TTS System  The IBM Expressive Text-to-Speech consists of: a rules-based front-end for text analysis acoustic models (DTs) for generating synthesis candidate units prosody models (DTs) for generating pitch and duration targets a module to carry out a Viterbi search a waveform generation module to concatenate the selected units  Expressiveness is achieved in this framework by associating symbolic attribute vectors with the synthesis units. These attribute values are able to influence the target prosody generation unit-search selection

Human Language Technologies – Text to Speech © 2007 IBM Corporation 10Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Attributes Style Good News Apologetic Uncertain … Default Attribute

Human Language Technologies – Text to Speech © 2007 IBM Corporation 11Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Attributes Style Good News Apologetic Uncertain … Emphasis 0 1

Human Language Technologies – Text to Speech © 2007 IBM Corporation 12Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Attributes Style Good News Apologetic Uncertain … Emphasis 0 1 ? (e.g., voice quality={breathy,…}, etc.)

Human Language Technologies – Text to Speech © 2007 IBM Corporation 13Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 How do attributes influence the search? - Corpus is tagged a priori. - At run time: Input is tagged at the word level (e.g., via user-provided mark-up) with annotations indicating the desired attribute. Annotations are propagated down to the unit level. - A component of the target cost function penalizes label substitutions: NeutralGood newsBad news Neutral Good news Bad news

Human Language Technologies – Text to Speech © 2007 IBM Corporation 14Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 How do attributes influence the search? - Additionally, the style attribute has style-specific prosody models (for pitch and duration) associated with it. Therefore, prosody targets are produced according to the style requested. Prosody Model Style 1 Prosody Model Style 3 Prosody Model Style 2 Prosody Targets Model Output Generation Normalized Text Target Style

Human Language Technologies – Text to Speech © 2007 IBM Corporation 15Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Motivation. Review of Expressive TTS Architecture Expression Mining: Emphasis. Evaluation. Outline

Human Language Technologies – Text to Speech © 2007 IBM Corporation 16Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Mining Emphasis Emphasis Corpus (~1K sents.) Statistical Learner Trained Emphasis Classifier Baseline Corpus w. Emphasis Labels Baseline Corpus (~10K sents.) Build TTS System w. Emphasis

Human Language Technologies – Text to Speech © 2007 IBM Corporation 17Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Training Materials Two sets of recordings, one from a female and one from a male speaker of US English. Approximately 1K sentences in script. Approximately 20% of words in script contain emphasis. Recordings are single channel, 22.05kHz.  To hear DIRECTIONS to this destination say YES.  I'd LOVE to hear how it SOUNDS.  It is BASED on the information that the company gathers, but not DEPENDENT on it. Exs:

Human Language Technologies – Text to Speech © 2007 IBM Corporation 18Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Modeling Emphasis – Classification Scheme - Modeled at the word level. - Feature set: prosodic features derived from (i) pitch (absolute; speaker- normalized), (ii) duration, and (iii) energy measures. - Individual classifiers are trained, and results stacked (this marginally improves the generalization performance estimated through 10-fold CV). K-Nearest Neighbor SVM Naïve Bayes Interm. Output Probs. Final Output Probs. Prosodic Features

Human Language Technologies – Text to Speech © 2007 IBM Corporation 19Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Modeling Emphasis – Classification Results TP Rate FP Rate Prec. F-Meas. Class emphasis notemphasis Correctly Classified Instances 91.2 % TP Rate FP Rate Prec. F-Meas. Class emphasis notemphasis Correctly Classified Instances 89.9 % MALEMALE FEMALEFEMALE

Human Language Technologies – Text to Speech © 2007 IBM Corporation 20Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 What does it find in the corpus?

Human Language Technologies – Text to Speech © 2007 IBM Corporation 21Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 What does it find in the corpus? I think they will diverge from bonds, and they may even go up.

Human Language Technologies – Text to Speech © 2007 IBM Corporation 22Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 What does it find in the corpus?

Human Language Technologies – Text to Speech © 2007 IBM Corporation 23Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 What does it find in the corpus? Please say the full name of the person you want to call.

Human Language Technologies – Text to Speech © 2007 IBM Corporation 24Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 What does it find in the corpus?

Human Language Technologies – Text to Speech © 2007 IBM Corporation 25Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 What does it find in the corpus? There's a long fly ball to deep center field. Going, going. It's gone, a home run.

Human Language Technologies – Text to Speech © 2007 IBM Corporation 26Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Motivation. Review of Expressive TTS Architecture Expression Mining: Emphasis. Evaluation. Outline

Human Language Technologies – Text to Speech © 2007 IBM Corporation 27Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Listening Tests – Stimuli and Conditions Sent. Type Emphasis in Text? Baseline Neutral Units Baseline Corpus w/ Mined Emphasis Training Corpus w/ Explicit Emphasis AN BY CY Condition 1 Pair: 1 Type-A sentence vs. 1 Type-B sentence (in random order). Condition 2 Pair: 1 Type-A sentence vs. 1 Type-C sentence (in random order). Synthesis Sources Target

Human Language Technologies – Text to Speech © 2007 IBM Corporation 28Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 B1 vs A1 A2 vs B2 A3 vs B3 B12 vs A12 … A1 vs C1 A2 vs C2 C3 vs A3 C12 vs A12 … Condition 1 (12 Pairs) Condition 2 (12 Pairs) A2 vs C2 B1 vs A1 B12 vs A12 … A3 vs B3 + Shuffle C2 vs A2 A1 vs B1 A12 vs B12 … B3 vs A3 Reverse Order Pair Listening Tests – Setup LIST1LIST1 LIST2LIST2

Human Language Technologies – Text to Speech © 2007 IBM Corporation 29Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Listening Tests – Task Description A total of 31 participants listen to a playlist (16 to List 1; 15 to List 2) For each pair of stimuli, the listeners are asked to select which member of the pair contains emphasis-bearing words No information is given about which words may be emphasized. Listeners may opt to listen to a pair repeatedly.

Human Language Technologies – Text to Speech © 2007 IBM Corporation 30Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Listening Tests – Results ConditionNeutral (A)Emphatic (B/C) 161.6%38.4% 248.7%51.3%

Human Language Technologies – Text to Speech © 2007 IBM Corporation 31Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Conclusions When only the limited expressive corpus is considered, listeners actually prefer the neutral baseline. Possible explanation is that biasing the search heavily toward a small corpus is introducing artifacts that interfere with the perception of emphasis. However, when the small expressive corpus is augmented with automatic annotations, the perception of intended emphasis increases significantly by 13% (p<0.001). Although further work is needed to reliably convey emphasis, we have demonstrated the advantages of automatic mining the dataset to augment the search space of expressive synthesis units.

Human Language Technologies – Text to Speech © 2007 IBM Corporation 32Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Future Work Explore alternative feature sets to improve automatic emphasis classification. Extend the proposed framework to automatically detect more complex expressions in a “neutral” database and augment the search space for our expressive systems (e.g., good news; apologies; uncertainty) Explore how the perceptual confusion between different labels can be exploited to increase the range of expressiveness of the TTS system. N A GN U A U N

Human Language Technologies – Text-to-Speech © 2007 IBM Corporation Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Automatic Exploration of Corpus-Specific Properties for Expressive Text-to-Speech. (A Case Study in Emphasis.) Raul Fernandez and Bhuvana Ramabhadran I.B.M. T.J. Watson Research Center