The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman Dr. Sherif Mahdy Abdou
Agenda Motivations & Applications Emotional Synthesis Approaches Unit Selection & Blending Data Approach Arabic Speech Database & Challenges Festival – Speech Synthesis framework Arabic Voice Building Proposed Utterance Structure Proposed Target Cost Function System Evaluation & Results Conclusion & Future Works
Motivations & Applications Emotions are inseparable components of the natural human speech. Because of that, the level of human speech can only be achieved with the ability to synthesize emotions. As the tremendous increasing in the demand of speech synthesis systems, Emotional synthesis become one of the major tends in this field.
Motivations & Applications Emotional speech synthesis can be applied to different applications like: Spoken Dialogue Systems Customer-care centers Task planning Tutorial systems Automated agents Future trends are towards approaching Artificial Intelligence
Emotional Synthesis Approaches Attempts to add emotion effects to synthesised speech have existed for more than a decade. Most emotional synthesis approaches inherent properties of the various normal synthesis techniques used. Different emotional synthesis techniques provide control over acoustic parameters to very different degrees. Important emotional related acoustic parameters like: Pitch, Duration, speaking rate …etc
Emotional Synthesis Approaches Emotional Formant Synthesis Also known as rule-based synthesis No human speech recordings are involved at run time. The resulting speech sounds relatively unnatural and “robot-like”. Emotional Diphone Concatenation Use Diphone concatenation on Emotional recordings A majority reports shows a degree of success in conveying some emotions. May harm voice quality
Emotional Synthesis Approaches HMM-Based Parametric synthesis Use HMM models to change speaking style and emotional expression of the synthetic speech. Train models on speech produced with different speaking styles. Requires several hundreds of sentences spoken in a different speaking style. Emotional-Based Unit-Selection Use variable length units Units that are similar to the given target are likely to be selected. It is the used approach in our proposed system.
Emotional Unit Selection Units are selected from large databases of natural speech. The quality of unit selection synthesis depends heavily on the quality of the speaker and on the coverage of the database. The quality of the synthesis relies on the fact that little or no signal processing is done on the selected units. Two major cost functions are used Joint Cost function Target Cost function
Blending Data Approach Two main techniques of building emotional voices for unit selection Tiering techniques Limited domain synthesis. Requires a large number of databases for different emotions. Blending techniques General purpose application. All databases are located in one data pool used for selection. … used technique in our system. In blending approach, The system will choose the units from only one Database. Requires careful selection criterion to match target speaking style.
Arabic Speech Database In our system we used RDI TTS Saudi speaker database. This database consists of 10 hours of recording with neutral emotion and one hour of recordings for four different emotions that are sadness, happiness, surprise and questioning. The EGG signal is also recorded to support pitch marking during the synthesis process. The database is for male speaker sampled with 16 kHZ sampling rate. HMMs based Viterbi alignment procedure is used to produce the phone level segmentation boundaries
Festival – Speech Synthesis framework The Festival TTS system was developed at the University of Edinburgh by Alan Black and Paul Taylor. Festival is primarily a research toolkit for speech synthesis (TTS). It provides a state-of-the-art unit selection synthesis module called “Multisyn”. Festival uses a data structure called an utterance structure that consists of items and relations. An utterance represents some chunk of text that is to be rendered as speech
Festival – Speech Synthesis framework Example of utterance structure:
Arabic Voice Building Voice building phase is one of the major steps in building a unit selection synthesizer. The voice building toolkit that comes with festival has been used for building our Arabic Emotional voice. The following steps were used: Generate power factors and wave file normalization Using EGGs for accurate pitch marking extraction Generating LPC and residuals from wave files Generate MFCCs, pitch values and spectral coefficients
Arabic Voice Building(Pitch marking) In unit selection; Accurate estimation of pitchmarks is necessary for pitch modification to assure optimal quality of synthetic speech. We used EGG signal to extract pitchmarks. Pitchmarking enhancements has been carried out by matrix optimization process. The low pass filter and high pass filter cut-off frequencies have been chosen for the optimization process.
Arabic Voice Building(Pitch marking) Example of the default parameters of the pitchmarking application versus the optimized ones. Default Pitchmarking Optimized Pitchmarking
Proposed Utterance Structure The HMM-based alignment transcription is converted to the utterance structure. ASMO-449 is used to transliterate the Arabic text to English characters in the utterance files. The utterance structures of both the utterances in the training databases and the target utterances is changed to carry emotional information. A new feature called emotion has been added to the word item type in the utterance structure.
Proposed Utterance Structure The proposed system is designed to work with three emotional state (normal, sad and question). So, The emotion feature takes one of three emotional values. Normal Normal state Sad Sad emotion state Question Question emotion sate
Proposed Target Cost Function The target cost is a weighted sum of functions that check if features in the target utterance match those features in the candidate utterance. The standard target cost in festival does not contain any computation differences for emotional state of the utterance. A new emotional target cost is proposed
Proposed Emotional Target Cost Function The target cost: The Emotional target cost:
Proposed Target Cost Function The algorithm in general favors units that are similar in the emotional sate or classified to be emotional in a two stages of penalties. The tuning weighting factor of the emotional target cost is optimized by try- and-error to find appropriate value.
System Evaluation Six sentences were synthesized for each emotional state. The Sentences were from the news websites, usual conversations and the holy Qur’an. Two major set of evaluation used Deterministic Evaluation Perceptual Evaluation In deterministic evaluation tests the emotional classifier system Emovoice is used across the output utterance. Emovoice is trained on the Arabic data using SVM model and its complete feature set.
System Evaluation In perceptual tests, 15 participant have involved Two types of listening tests were performed in order to evaluate the system perceptually. Test intelligibility The listener was asked to type in what they heard Word Error Rate (WER) is computed
System Evaluation Test Naturalness & Emotiveness The actual experiment took place on a computer where subjects had to listen to the synthesized sentences using headphones. The listeners are asked to rate the quality of the output voice in terms of (Naturalness & Emotiveness) Participants were asked to give ratings between 1 and 4 for poor, acceptable, good and excellent respectively.
System Evaluation - Results The confusion matrix of the classifier output emotion and the target emotional state of the utterance is Classified As Normal Sad Question 5 1 6
System Evaluation - Results The results of the WER is shown in figure. It shows a maximum average of 8% in sad emotion WER
System Evaluation - Results The descriptive statistics of the naturalness and emotiveness ratings is summarized.. Naturalness Emotiveness Rating Mean Standard Deviation Normal 2.72 1.01 Sad 2.71 1.02 Question 2.56 0.96 All 2.67 1.06 Rating Mean Standard Deviation Normal 3.28 0.71 Sad 2.76 1.09 Question 3.08 0.81 All 3.04 0.95
System Evaluation - Results The overall mean of naturalness ratings is 2.7 which approach a good quality naturalness. The average ratings of overall emotiveness are 3.1 which indicate good emotive state of the synthesized speech The naturalness and emotiveness ratings for question emotion sentences has lower mean value and high variance which means that they were not –to some extent- recognized as natural human speech. However it shows a good emotiveness scores.
Conclusions The main goal of this research was to develop an Emotional Arabic TTS voice. This research focused on three important emotional sates; normal, sad and questions. According to the different tests performed on the system, it shows promising results. At most the participants feel acceptable natural voice with clear good emotive state.
Future works It is recommended to increase the duration of acted or real emotional utterances in the RDI Arabic speech database. However the work done for accurate pitch- marking, some further enhancements are needed especially for question speech utterances. Optimizing the pitch modification module in festival for better concatenation with different emotions. Use emotional speech conversion based on signal processing and feature modelling technologies. The initial key features are commonly known to be pitch and duration
Questions ?