Prepared by: Waleed Mohamed Azmy Under Supervision:

Slides:

Advertisements

Similar presentations

Testing Relational Database

Advertisements

KARAOKE FORMATION Pratik Bhanawat (10bec113) Gunjan Gupta Gunjan Gupta (10bec112)

Improvement of Audio Capture in Handheld Devices through Digital Filtering Problem Microphones in handheld devices are of low quality to reduce cost. This.

Prosody modification in speech signals Project by Edi Fridman & Alex Zalts supervision by Yizhar Lavner.

Communicating with Robots using Speech: The Robot Talks (Speech Synthesis) Stephen Cox Chris Watkins Ibrahim Almajai.

Vineel Pratap Girish Govind Abhilash Veeragouni. Human listeners are capable of extracting information from the acoustic signal beyond just the linguistic.

Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen.

VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.

Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.

Vocal Emotion Recognition with Cochlear Implants Xin Luo, Qian-Jie Fu, John J. Galvin III Presentation By Archie Archibong.

AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.

Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.

December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.

Human Language Technologies – Text-to-Speech © 2007 IBM Corporation Sixth Speech Synthesis Workshop, Bonn, Germany.August 22-24, 2007 Automatic Exploration.

MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.

6/3/20151 Voice Transformation : Speech Morphing Gidon Porat and Yizhar Lavner SIPL – Technion IIT December

EMOTIONS NATURE EVALUATION BASED ON SEGMENTAL INFORMATION BASED ON PROSODIC INFORMATION AUTOMATIC CLASSIFICATION EXPERIMENTS RESYNTHESIS VOICE PERCEPTUAL.

Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.

Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos

A Full Frequency Masking Vocoder for Legal Eavesdropping Conversation Recording R. F. B. Sotero Filho, H. M. de Oliveira (qPGOM), R. Campello de Souza.

Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,

Toshiba Update 14/09/2005 Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young A Statistical Approach.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.

Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.

Presented by Tienwei Tsai July, 2005

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,

SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,

Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Conversion Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Multimodal Information Analysis for Emotion Recognition

1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:

Speech Signal Processing I By Edmilson Morais And Prof. Greg. Dogil Second Lecture Stuttgart, October 25, 2001.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.

Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Speech Signal Processing I

HMM-Based Synthesis of Creaky Voice

ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska

1.INTRODUCTION The use of the adaptive codebook (ACB) in CELP-like speech coders allows the achievement of high quality speech, especially for voiced segments.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Ways to generate computer speech Record a human speaking every sentence HAL will ever speak (not likely) Make a mathematical model of the human vocal.

Performance Comparison of Speaker and Emotion Recognition

SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.

Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Speaker Verification System Middle Term Presentation Performed by: Barak Benita & Daniel Adler Instructor: Erez Sabag.

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

RESEARCH MOTHODOLOGY SZRZ6014 Dr. Farzana Kabir Ahmad Taqiyah Khadijah Ghazali (814537) SENTIMENT ANALYSIS FOR VOICE OF THE CUSTOMER.

Research Methodology Proposal Prepared by: Norhasmizawati Ibrahim (813750)

PREPARED BY MANOJ TALUKDAR MSC 4 TH SEM ROLL-NO 05 GUKC-2012 IN THE GUIDENCE OF DR. SANJIB KR KALITA.

2014 Development of a Text-to-Speech Synthesis System for Yorùbá Language Olúòkun Adédayọ̀ Tolulope Department of Computer Science.

Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.

G. Anushiya Rachel Project Officer

Mr. Darko Pekar, Speech Morphing Inc.

ARTIFICIAL NEURAL NETWORKS

For Evaluating Dialog Error Conditions Based on Acoustic Information

Voice conversion using Artificial Neural Networks

EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES

Linear Predictive Coding Methods

Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.

Auditory Morphing Weyni Clacken

Presentation transcript:

The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman Dr. Sherif Mahdy Abdou

Agenda Motivations & Applications Emotional Synthesis Approaches Unit Selection & Blending Data Approach Arabic Speech Database & Challenges Festival – Speech Synthesis framework Arabic Voice Building Proposed Utterance Structure Proposed Target Cost Function System Evaluation & Results Conclusion & Future Works

Motivations & Applications Emotions are inseparable components of the natural human speech. Because of that, the level of human speech can only be achieved with the ability to synthesize emotions. As the tremendous increasing in the demand of speech synthesis systems, Emotional synthesis become one of the major tends in this field.

Motivations & Applications Emotional speech synthesis can be applied to different applications like: Spoken Dialogue Systems Customer-care centers Task planning Tutorial systems Automated agents Future trends are towards approaching Artificial Intelligence

Emotional Synthesis Approaches Attempts to add emotion effects to synthesised speech have existed for more than a decade. Most emotional synthesis approaches inherent properties of the various normal synthesis techniques used. Different emotional synthesis techniques provide control over acoustic parameters to very different degrees. Important emotional related acoustic parameters like: Pitch, Duration, speaking rate …etc

Emotional Synthesis Approaches Emotional Formant Synthesis Also known as rule-based synthesis No human speech recordings are involved at run time. The resulting speech sounds relatively unnatural and “robot-like”. Emotional Diphone Concatenation Use Diphone concatenation on Emotional recordings A majority reports shows a degree of success in conveying some emotions. May harm voice quality

Emotional Synthesis Approaches HMM-Based Parametric synthesis Use HMM models to change speaking style and emotional expression of the synthetic speech. Train models on speech produced with different speaking styles. Requires several hundreds of sentences spoken in a different speaking style. Emotional-Based Unit-Selection Use variable length units Units that are similar to the given target are likely to be selected. It is the used approach in our proposed system.

Emotional Unit Selection Units are selected from large databases of natural speech. The quality of unit selection synthesis depends heavily on the quality of the speaker and on the coverage of the database. The quality of the synthesis relies on the fact that little or no signal processing is done on the selected units. Two major cost functions are used Joint Cost function Target Cost function

Blending Data Approach Two main techniques of building emotional voices for unit selection Tiering techniques Limited domain synthesis. Requires a large number of databases for different emotions. Blending techniques General purpose application. All databases are located in one data pool used for selection. … used technique in our system. In blending approach, The system will choose the units from only one Database. Requires careful selection criterion to match target speaking style.

Arabic Speech Database In our system we used RDI TTS Saudi speaker database. This database consists of 10 hours of recording with neutral emotion and one hour of recordings for four different emotions that are sadness, happiness, surprise and questioning. The EGG signal is also recorded to support pitch marking during the synthesis process. The database is for male speaker sampled with 16 kHZ sampling rate. HMMs based Viterbi alignment procedure is used to produce the phone level segmentation boundaries

Festival – Speech Synthesis framework The Festival TTS system was developed at the University of Edinburgh by Alan Black and Paul Taylor. Festival is primarily a research toolkit for speech synthesis (TTS). It provides a state-of-the-art unit selection synthesis module called “Multisyn”. Festival uses a data structure called an utterance structure that consists of items and relations. An utterance represents some chunk of text that is to be rendered as speech

Festival – Speech Synthesis framework Example of utterance structure:

Arabic Voice Building Voice building phase is one of the major steps in building a unit selection synthesizer. The voice building toolkit that comes with festival has been used for building our Arabic Emotional voice. The following steps were used: Generate power factors and wave file normalization Using EGGs for accurate pitch marking extraction Generating LPC and residuals from wave files Generate MFCCs, pitch values and spectral coefficients

Arabic Voice Building(Pitch marking) In unit selection; Accurate estimation of pitchmarks is necessary for pitch modification to assure optimal quality of synthetic speech. We used EGG signal to extract pitchmarks. Pitchmarking enhancements has been carried out by matrix optimization process. The low pass filter and high pass filter cut-off frequencies have been chosen for the optimization process.

Arabic Voice Building(Pitch marking) Example of the default parameters of the pitchmarking application versus the optimized ones. Default Pitchmarking Optimized Pitchmarking

Proposed Utterance Structure The HMM-based alignment transcription is converted to the utterance structure. ASMO-449 is used to transliterate the Arabic text to English characters in the utterance files. The utterance structures of both the utterances in the training databases and the target utterances is changed to carry emotional information. A new feature called emotion has been added to the word item type in the utterance structure.

Proposed Utterance Structure The proposed system is designed to work with three emotional state (normal, sad and question). So, The emotion feature takes one of three emotional values. Normal  Normal state Sad  Sad emotion state Question  Question emotion sate

Proposed Target Cost Function The target cost is a weighted sum of functions that check if features in the target utterance match those features in the candidate utterance. The standard target cost in festival does not contain any computation differences for emotional state of the utterance. A new emotional target cost is proposed

Proposed Emotional Target Cost Function The target cost: The Emotional target cost:

Proposed Target Cost Function The algorithm in general favors units that are similar in the emotional sate or classified to be emotional in a two stages of penalties. The tuning weighting factor of the emotional target cost is optimized by try- and-error to find appropriate value.

System Evaluation Six sentences were synthesized for each emotional state. The Sentences were from the news websites, usual conversations and the holy Qur’an. Two major set of evaluation used Deterministic Evaluation Perceptual Evaluation In deterministic evaluation tests the emotional classifier system Emovoice is used across the output utterance. Emovoice is trained on the Arabic data using SVM model and its complete feature set.

System Evaluation In perceptual tests, 15 participant have involved Two types of listening tests were performed in order to evaluate the system perceptually. Test intelligibility The listener was asked to type in what they heard Word Error Rate (WER) is computed

System Evaluation Test Naturalness & Emotiveness The actual experiment took place on a computer where subjects had to listen to the synthesized sentences using headphones. The listeners are asked to rate the quality of the output voice in terms of (Naturalness & Emotiveness) Participants were asked to give ratings between 1 and 4 for poor, acceptable, good and excellent respectively.

System Evaluation - Results The confusion matrix of the classifier output emotion and the target emotional state of the utterance is Classified As Normal Sad Question 5 1 6

System Evaluation - Results The results of the WER is shown in figure. It shows a maximum average of 8% in sad emotion WER

System Evaluation - Results The descriptive statistics of the naturalness and emotiveness ratings is summarized.. Naturalness Emotiveness Rating Mean Standard Deviation Normal 2.72 1.01 Sad 2.71 1.02 Question 2.56 0.96 All 2.67 1.06 Rating Mean Standard Deviation Normal 3.28 0.71 Sad 2.76 1.09 Question 3.08 0.81 All 3.04 0.95

System Evaluation - Results The overall mean of naturalness ratings is 2.7 which approach a good quality naturalness. The average ratings of overall emotiveness are 3.1 which indicate good emotive state of the synthesized speech The naturalness and emotiveness ratings for question emotion sentences has lower mean value and high variance which means that they were not –to some extent- recognized as natural human speech. However it shows a good emotiveness scores.

Conclusions The main goal of this research was to develop an Emotional Arabic TTS voice. This research focused on three important emotional sates; normal, sad and questions. According to the different tests performed on the system, it shows promising results. At most the participants feel acceptable natural voice with clear good emotive state.

Future works It is recommended to increase the duration of acted or real emotional utterances in the RDI Arabic speech database. However the work done for accurate pitchmarking, some further enhancements are needed especially for question speech utterances. Optimizing the pitch modification module in festival for better concatenation with different emotions. Use emotional speech conversion based on signal processing and feature modelling technologies. The initial key features are commonly known to be pitch and duration

Questions ?