SSW6 Bonn Aug. 2007 Communicative Speech Synthesis with XIMERA: a First Step Shinsuke Sakai 1,2, Jinfu Ni 1,2, Ranniery Maia 1,2, Keiichi Tokuda 1,3,

Slides:



Advertisements
Similar presentations
Japanese Trials of the Bioethics Education Project Fumi Maekawa Darryl Macer University of Tsukuba.
Advertisements

Writing. It is important that the practitioner observes the child in the process of writing in order to see how the writing is attempted: Concentration.
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
1 New York State English as a Second Language Achievement Test (NYSESLAT) Presented by: Vanessa Lee Mercado Assistant in Educational Testing Office of.
PF-STAR: emotional speech synthesis Istituto di Scienze e Tecnologie della Cognizione, Sezione di Padova – “Fonetica e Dialettologia”, CNR.
Test of English as a Foreign Language - Measures English language proficiency and aptitude - College or university admissions requirement - World’s accessible.
The Role of F0 in the Perceived Accentedness of L2 Speech Mary Grantham O’Brien Stephen Winters GLAC-15, Banff, Alberta May 1, 2009.
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
GSK: Development and Distribution of Resources Hitoshi ISAHARA GSK: Gengo Shigen Kyokai (Language Resource Association) National Institute of Information.
Analyses on IFA corpus Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC) Project meeting INTAS.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
Outlines  Objectives  Study of Thai tones  Construction of contextual factors  Design of decision-tree structures  Design of context clustering.
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.
Vocal Emotion Recognition with Cochlear Implants Xin Luo, Qian-Jie Fu, John J. Galvin III Presentation By Archie Archibong.
Emotions and Voice Quality: Experiments with Sinusoidal Modeling Authors: Carlo Drioli, Graziano Tisato, Piero Cosi, Fabio Tesser Institute of Cognitive.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
Human Language Technologies – Text-to-Speech © 2007 IBM Corporation Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Automatic Exploration.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Track: Speech Technology Kishore Prahallad Assistant Professor, IIIT-Hyderabad 1Winter School, 2010, IIIT-H.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Initiation of Standardization on Network-based Speech-to-speech Translation at ITU-T SG16 National Institute of Information and Communications Technology,
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
STANDARDIZATION OF SPEECH CORPUS Li Ai-jun, Yin Zhi-gang Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
Prepared by: Waleed Mohamed Azmy Under Supervision:
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and.
Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Conversion Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and.
Vergina: A Modern Greek Speech Database for Speech Synthesis Alexandros Lazaridis Theodoros Kostoulas Todor Ganchev Iosif Mporas Nikos Fakotakis Artificial.
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon, Mary Beckman & Chris Brew.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Linguistics Scientific study of language.. Linguistics can be: Theoretical:encompasses a number of sub-fields. Comparative: compares languages and their.
The Effect of Pitch Span on Intonational Plateaux Rachael-Anne Knight University of Cambridge Speech Prosody 2002.
Perceptual Analysis of Talking Avatar Head Movements: A Quantitative Perspective Xiaohan Ma, Binh H. Le, and Zhigang Deng Department of Computer Science.
Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
UNDERSTANDING EFFECTIVE COMMUNICATION TECHNIQUES.
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
A quick walk through phonetic databases Read English –TIMIT –Boston University Radio News Spontaneous English –Switchboard ICSI transcriptions –Buckeye.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
Performance Comparison of Speaker and Emotion Recognition
HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Objectives of session By the end of today’s session you should be able to: Define and explain pragmatics and prosody Draw links between teaching strategies.
Danielle Werle Undergraduate Thesis Intelligibility and the Carrier Phrase Effect in Sinewave Speech.
2014 Development of a Text-to-Speech Synthesis System for Yorùbá Language Olúòkun Adédayọ̀ Tolulope Department of Computer Science.
To my presentation about:  IELTS, meaning and it’s band scores.  The tests of the IELTS  Listening test.  Listening common challenges.  Reading.
G. Anushiya Rachel Project Officer
التوجيه الفني العام للغة الإنجليزية
Investigating Pitch Accent Recognition in Non-native Speech
Mr. Darko Pekar, Speech Morphing Inc.
Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Teaching prominence through kazoos
Auditory Morphing Weyni Clacken
Presentation transcript:

SSW6 Bonn Aug Communicative Speech Synthesis with XIMERA: a First Step Shinsuke Sakai 1,2, Jinfu Ni 1,2, Ranniery Maia 1,2, Keiichi Tokuda 1,3, Minoru Tsuzaki 1,4,Tomoki Toda 1,5, Hisashi Kawai 2,6, Satoshi Nakamura 1,2 1 NiCT, Japan 2 ATR-SLC, Japan 3 Nagoya Institute of Technology, Japan 4 Kyoto City University of Arts, Japan 5 Nara Institute of Science and Technology, Japan 6 KDDI Research and Development Labs, Japan

SSW6 Bonn Aug Introduction Motivation: high naturalness achieved by concatenative synthesis, but monotonous (Always speaks in the same very articulate way!) ex. Taihen moushiwake arimasenga gokouhyounitsuki genzai shinagireto natteorimasu. (apology spoken rather objectively.) Goal: synthesizers which can speak in an appropriate style for communicative purposes. Input extension: style tags There was no room available tonight. Technical approach: Concatenative synthesis (XIMERA) with style-specific target (HMM) and/or style-specific units (IBM Eide et al. 2003,2004, Pitrelli et al. 2006).

SSW6 Bonn Aug Overview of XIMERA speech synthesis system Large corpora (for Japanese: 110 hours male, 60 hours female) HMM target models including prosody. Cost functions optimized by perceptual experiments. Japanese, Chinese, and English versions.

SSW6 Bonn Aug Corpus development Speaker: F009 (Japanese female) – 60H neutral DB available. Prompt text: extracted subset (corresponding to 2.6 hours of speech) from prompts for neutral DB and modified to have conversational endings. –1/2 Newspaper sentences with conversational utterance-end expressions. (conversion tool with hand correction) –1/4 Phonetically balanced 500 sentence set (ATR503), half as is, half with conversational sentence-ends. –1/4 BTEC (basic travel conversation) sentences that can be news. Sentences from novels and essays with conversational endings. Approximately 3 hours of speech with each of good news and bad news styles were collected.

SSW6 Bonn Aug B3.2h+N10hN2.6hN10hG3h+N10h XIMERA flowchart B3.2hG3h HMM models Speech database (B: bad news, N: neutral, G: good news) Communicative target models and unit databases Bad news (2.7h) Neutral (2.3h) Good news (2.8h)

SSW6 Bonn Aug Experiment: target models and unit databases Main things we wanted to know Q1: Are good/bad news styles well observed? Q2: Do we need style-specific DB, not just style-specific target models? Q3: Does neutral DB help in naturalness? unit db target G2G2+ N10 N2N10B2B2+ N10 G123 N45 B678 8 systems with different combinations of target HMMs and unit DBs.

SSW6 Bonn Aug Experiment: Listening test design Test data –All sentences carefully designed to be interpreted as good / bad / neutral news. –10 sentences x 8 systems = 80 waveforms. Listeners native Japanese speakers. Experiment I –5-level opinion score on naturalness.. 40 waveforms. Experiment II –7-level opinion score on style perception.. 40 waveforms. Ex. -3: sounds like a bad news, -2: pretty sure that it sounds like a bad news, -1: rather sounds like a bad news, 0: no distinction, …

SSW6 Bonn Aug Experiments: results Observations (1)Intended style perception well achieved while maintaining a good naturalness. Good news recognized 66.7%, MOS 3.6 (G-G2) Bad news recognized 98.4%, MOS 2.9 (B-B2) (2)good news styles sounded more natural to listeners. good news more similar to neutral (..?) (3)Clearer style perception for bad news (2) (3) (1)

SSW6 Bonn Aug Experiments: results (contd) Other observations (1)Target alone is not enough. Unit DB for the specific style makes difference. (2)Addition of neutral data doesnt improve naturalness (a little degradation instead). (3)Speech with good/bad news styles sounded more natural if developed with the same amount of data (2) (3) (1)

SSW6 Bonn Aug Experiments: F0-related observations Natural F0 for: –bad news speech: F0 mean is low dynamic range is narrower –good news speech: F0 mean a little higher. Dynamic range little wider.

SSW6 Bonn Aug Conclusion Initial attempt at communicative speech synthesis with good news and bad news styles using 3 hours of each style-specific corpora. Intended style perception well achieved while maintaining a good naturalness. Good news recognized at 66.7% with MOS 3.6 (G-G2). Bad news recognized at 98.4% with MOS 2.9 (B-B2). Not only target models but also unit databases with specific styles were effective in synthesizing speech in the intended corresponding styles. Plan to investigate contributions from each of spectral, F0, and duration features separately, instead of the models themselves.

SSW6 Bonn Aug appendices

SSW6 Bonn Aug (appendix) test sentences neutral, good news, bad news input01 input02 input03 input04 input05 OB input06 input07 input08 TBS input09 input10

SSW6 Bonn Aug ( )

SSW6 Bonn Aug ? : -2: -1: 0: 1: 2: 3: