Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06.

Slides:



Advertisements
Similar presentations
Normal Aspects of Articulation. Definitions Phonetics Phonology Articulatory phonetics Acoustic phonetics Speech perception Phonemic transcription Phonetic.
Advertisements

Speech Perception Dynamics of Speech
The sound patterns of language
Identification of prosodic near- minimal Pairs in Spontaneous Speech Keesha Joseph Howard University Center for Spoken Language Understanding (CSLU) Oregon.
Frequency, Pitch, Tone and Length October 15, 2012 Thanks to Chilin Shih for making some of these lecture materials available.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **
Outlines  Objectives  Study of Thai tones  Construction of contextual factors  Design of decision-tree structures  Design of context clustering.
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Prosodic Signalling of (Un)Expected Information in South Swedish Gilbert Ambrazaitis Linguistics and Phonetics Centre for Languages and Literature.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
Introduction to Speech Synthesis ● Key terms and definitions ● Key processes in sythetic speech production ● Text-To-Phones ● Phones to Synthesizer parameters.
Advanced Technology Center Stuttgart EMOTIONAL SPACE IMPROVES EMOTION RECOGNITION Raquel Tato, Rocio Santos, Ralf Kompe Man Machine Interface Lab Advance.
Looking at Spectrogram in Praat cs4706, Jan 30 Fadi Biadsy.
1 Phonetics Study of the sounds of Speech Articulatory Acoustic Experimental.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Sound and Speech. The vocal tract Figures from Graddol et al.
Chapter three Phonology
1 ENGLISH PHONETICS AND PHONOLOGY Lesson 3A Introduction to Phonetics and Phonology.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Praat Fadi Biadsy.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
Phonics. Phonics Instruction “Phonics instruction teaches children the relationship between the letters of written language and the individual sounds.
Review of the paper entitled “The development of a phonetically balanced word recognition test in the Ilocano language” written by Renita Sagon, Doctor.
Phonology, phonotactics, and suprasegmentals
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Arabic TTS (status & problems) O. Al Dakkak & N. Ghneim.
Speech Communications (Chapter 7) Prepared by: Ahmed M. El-Sherbeeny, PhD 1.
How IPA is Used in SSML and PLS Paolo Baggia, Loquendo Wed. August 9 th, 2006.
Speech & Language Development 1 Normal Development of Speech & Language Language...“Standardized set of symbols and the knowledge about how to combine.
Phonetics and Phonology
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Suprasegmentals Segmental Segmental refers to phonemes and allophones and their attributes refers to phonemes and allophones and their attributes Supra-
Speech Perception1 Fricatives and Affricates We will be looking at acoustic cues in terms of … –Manner –Place –voicing.
Emotions in Hindi -Recognition and Conversion S.S. Agrawal CDAC, Noida & KIIT, Gurgaon
Speech Science Fall 2009 Nov 2, Outline Suprasegmental features of speech Stress Intonation Duration and Juncture Role of feedback in speech production.
The role of prosody in dialect synthesis and authentication Kyuchul Yoon Division of English Kyungnam University Spring 2008 Joint Conference of KSPS.
English Phonetics and Phonology
Connected speech processes Coarticulation Suprasegmentals.
Acoustic Properties of Taiwanese High School Students ’ Stress in English Intonation Advisor: Dr. Raung-Fu Chung Student: Hong-Yao Chen.
A prosodically sensitive diphone synthesis system for Korean Kyuchul Yoon Linguistics Department The Ohio State University.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
Frequency, Pitch, Tone and Length October 16, 2013 Thanks to Chilin Shih for making some of these lecture materials available.
Stops Stops include / p, b, t, d, k, g/ (and glottal stop)
Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan.
Thai Intonation in Four Emotions Apiluck Tumtavitikul and Kanlayarat Thitikannara Linguistics Dept., Kasetsart University Bangkok, Thailand
English Phonetics 许德华 许德华. Objectives of the Course This course is intended to help the students to improve their English pronunciation, including such.
Chapter II phonology II. Classification of English speech sounds Vowels and Consonants The basic difference between these two classes is that in the production.
Tone, Accent and Quantity October 19, 2015 Thanks to Chilin Shih for making some of these lecture materials available.
Levels of Linguistic Analysis
Speech recognition Home Work 1. Problem 1 Problem 2 Here in this problem, all the phonemes are detected by using phoncode.doc There are several phonetics.
Control of prosodic features under perturbation in collaboration with Frank Guenther Dept. of Cognitive and Neural Systems, BU Carrie Niziolek [carrien]
TEACHING PRONUNCIATION
EXPRESS YOURSELF. NEUTRAL ACCENT Neutral accent is a way of speaking a language without regionalism. Accent means variation in pronunciation and it should.
Speech in the DHH Classroom A new perspective. Speech in the DHH Bilingual Classroom Important to look beyond the traditional view of speech Think of.
Pitch Tracking + Prosody January 19, 2012 Homework! For Tuesday: introductory course project report Background information on your consultant and the.
Definition of syllable One or more letters representing a unit ofletters spoken language consisting of a single uninterrupted sound.language A syllable.
11 How we organize the sounds of speech 12 How we use tone of voice 2009 년 1 학기 담당교수 : 홍우평 언어커뮤니케이션의 기 초.
G. Anushiya Rachel Project Officer
Linguistic knowledge for Speech recognition
RESULTS AND DISCUSSION Fall Level High-rising Fall Level High-rising
Job Google Job Title: Linguistic Project Manager
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Levels of Linguistic Analysis
Giovanni M. Di Liberto, James A. O’Sullivan, Edmund C. Lalor 
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Looking at Spectrogram in Praat cs4706, Jan 30
Presentation transcript:

Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Outline  Introduction  Phonetic Nature of Kannada language  Prosodic Feature Values  Time Duration  Intensity  Pitch  Result Analysis  Conclusions  References

Introduction  Inclusion of Emotional aspects into speech will improve the Naturalness of speech synthesis system.  The different emotions like Sadness, Anger, Happiness are manifested in speech as prosodic elements like Time Duration, Pitch & Intensity.  The prosodic values corresponding to different emotions are analyzed at word as well as phonemic level, using speech analysis and manipulation tool PRAAT.  This paper presents the emotional analysis of the prosodic features such as time duration, pitch and intensity of Kannada speech.

 Our Analysis shows that time duration variation for different emotions at word level are: Anger < Neutral < Happiness < Sadness Time Duration is least for Anger and highest for Sadness.  Where as Anger > Happiness > Neutral > Sadness Intensity is highest for Anger and least for Sadness.  Also the Time Duration variation at phonemic level is large for Vowels compared to Consonants.  The Pitch contour is almost flat for Neutral speech hence shows bigger variation for different emotions.

 Kannada is a Dravidian Language & phonetic in nature having a written form that has direct correspondence to the spoken form.  The phonemes are divided into two types: Vowel (swaras) & Consonant (vyanjanas)  Kannada language  13 Vowels &  34 basic consonants Phonetic Nature of Kannada language

Vowels (Swaras)  Independently existing letters Consonants (Vyanjanas)  Dependent on vowels to take a independent form of the Consonant.

Consonant (Vyanjana) + Vowel (matra) --> Letter (Akshara) Kagunitha : The combination of consonant phoneme and a vowel phoneme produces a syllable. Consonant phoneme + Vowel phoneme = > Syllable

A universal character set Provides a unique number for each character in a language Supports all platforms & all the languages Unicode

Kannada Unicode

Basic units to Word (Pada)

ConsonantsBilabialLabio Dental DentalRetroflexPalatalVelarGl ott al vlvd vlvdvlvdvlvdvlvdvl Plosives Un pb tdṭḍ kg As phbh thdhṭhḍh khgh Affricates Un čj As čhjh Nasals m n ṇ ṅ Fricatives s ṣ š h LiquidsLate rals l ḷ Trill r Semi vowels v y Table 1 : The phonemes are categorized according to the method of speech production and articulation The column wise arrangement is according to the manner of articulation, whereas the row wise arrangement is according to the method of speech production. The phonetic nature of the language and the systematic categorization of the alphabet set can be effectively used for analysis and modeling.

Prosody as related to language, refers to aspects like rhythm, melody and stress. These features are quantity (duration), stress (intensity) and intonation (pitch). Phonemes need to be categorized into groups based on position and context. Each syllable is broken down into combinations of vowels and consonants. The durational patterns of the resultant phonemes at Word Initial, Medial & Final position are analyzed. Prosody

InitialMedialFinal 11 ms 9 ms 8 ms

The waveform, pitch contour, time duration and average intensity of the word /ba illi/ (come here) uttered in different emotions, by the same person is shown in Figure 1. From the plot it can be seen that the prosodic features show distinct variation for different emotions in comparison with neutral speech. Prosodic Feature Values

Figure 1 shows that the time duration is least for anger and highest for sadness of the sentence / ba illi / ( come here ) for different emotions. In comparison with neutral speech (606ms), the duration of the speech increases for happiness (750ms) and sadness (1.106sec), but it reduces considerably for anger (447ms). Angry < Neutral < Happy < Sadness. The duration pattern varies from person to person, but different emotions show general trends. Time Duration

WordsEmotionSpeakers 123 / yelli / (Where) Anger Happiness Sadness / appa / (Father) Anger Happiness Sadness Table 2 gives the duration of the speech of the three speakers, uttering two words in different emotions, as percentage in terms of neutral speech. Neutral speech is taken as 100% and the duration of speech with each emotion is given, in terms of the duration of neutral speech (% duration = duration with emotion x 100 / neutral duration). It can be seen that even though the percentage is different for the three speakers, the general trend is same for each of the emotions. Table 2: Duration of words (ms) uttered by different speakers in different emotions (% change in comparison with neutral speech)

SentenceEmotionninnahesaruenu /ninna hesaru enu/ (What is your name) Anger Happiness Sadness Table 3 gives the duration of different words (ms) in a sentence, /ninna hesaru enu/ (What is your name) in different emotions, as percentage in terms of neutral speech. Here also it can be seen that different emotions show general trends. Table 3: Duration of different words (ms) in a sentence for different emotions (% change in comparison with neutral speech)

EmotionPhonemesTotal Duration (ms) appa Anger Neutral Happiness Sadness Table 4 gives the duration values of phonemes in the word / appa / (vowels /a/ and consonant /p/). It can be seen that phonemes also follow the general trend of duration variation for different emotions. Table 4: Duration of Phonemes (ms) in the word /appa/ (father) for different emotions.

Figure 2: Duration (ms) change of word /appa/ (father) for different emotions Figure 3: Duration (ms) change of vowels /a/ and consonant /p/ in the word /appa/ (father) with four different emotions

SamplesEmotionIntensity / ba illi/ (come here) Anger Happiness Sadness98.90 /basava bandidana/ (has basava come) Anger Happiness Sadness94.98 From Figure 1, it is seen that anger emotion is articulated with maximum intensity where as sadness has minimum intensity. i.e. Anger > happiness > neutral > sadness. Table 5 confirms that the average intensity variation for different emotions is least for sadness and maximum for anger. Intensity Table 5: Average Intensity (dB) variation for different emotions (% in comparison with neutral speech)

SamplesEmotionPitch / ba illi/ (come here) Anger Happiness Sadness / basava bandidana / (has basava come) Anger Happiness Sadness Pitch From Figure 4, Figure 5 & Figure 6 the pitch contour of neutral speech is almost flat and is of minimum value. The following three figures show pitch contours for each emotional type sentence with its corresponding emotionless sentence. Pitch Table 6: Average Pitch (Hz) variation for different emotions (% in comparison with neutral speech)

Anger emotionless (Why did you do this) Anger emotion Figure 4 :

Happiness Emotion (What a beautiful flower) Happiness Emotionless Figure 5 :

Sadness Emotion ( I am extremely unhappy) Sadness Emotionless Figure 4 :

Result Analysis For instance to stimulate anger Duration has to be reduced while increasing pitch and intensity. Similarly to stimulate sadness Duration and pitch has to be increased while reducing intensity. Due to the phonetic categorization of the alphabet set, rules need to be framed only for each category of phonemes. The phonemes in each category share similar phonetic features. This reduces the complexity of prosodic modeling as well as the framing of rules for synthesis. Rules can be framed for different phonemes for prosodic modifications from phonemic level analysis.

From the manner of articulation of different emotions it can be recognized that, the rise time and fall time can capture a lot of emotion information more than any other prosodic parameter. For anger speech Duration is lowest and intensity is highest. whereas for sadness speech Duration is highest and intensity is lowest.

The duration % of different emotions, in comparison with neutral speech, calculated for different words, spoken by different speakers, shows that the duration of words is highest for sadness followed by happiness and neutral and is smallest for anger. The pitch contour is almost flat for neutral. The average pitch value for emotional speech is higher compared to neutral speech. The intensity level of a word is lowest for sadness and highest for anger. The phoneme level analysis on duration shows that it is the vowels that capture the emotional variation more compared to consonants. Conclusions

This can be used effectively for framing rules for emotional speech synthesis. Incorporating these durational effects in speech synthesis system, will produce a better speech compared to the system without using this knowledge.

References  I.R. Murray, M.D. Edgington, D. Campion, etc. “Rule-Based Emotion Synthesis Using Concatenated Speech,” Proc. of ISCA Workshop on Speech and Emotion, Belfast, North Ireland, pp ,  X X.J. Ma, W. Zhang, W.B. Zhu, etc, “Probability based Prosody Model for Unit Selection,” Proc. of. ICASSP’04, Montreal, Canada, pp , May  Pascal van Lieshout, Ph.D. ”PRAAT”, Oral Dynamics Lab V , October 7,  D.J.Ravi and Sudarshan Patilkulkarni “Kannada Text-To-Speech Systems: Duration Analysis” Proc. of ISCO 2009, Coimbatore. pp. 53.  D.J.Ravi and Sudarshan Patilkulkarni “Speaker Dependent Duration Analysis of Vowels and consonants for Kannada Text-To-Speech Systems” Proc.Of NICE 2009, Bangalore. pp  D.J.Ravi and Sudarshan Patilkulkarni “Time Duration Variation Analysis of Vowels and Consonants for KannadaText to Speech Systems.” "Journal of Advance Research in Computer Engineering: An International Journal", July to December 2009  Deepa P.Gopinath, Sheeba P.S and Achuthsankar S. Nair, “Emotional Analysis for Malayalam Text to Speech Synthesis Systems” SETIT 2007.