Download presentation
Presentation is loading. Please wait.
Published byEmil Hawkins Modified over 9 years ago
2
Emotions in Hindi -Recognition and Conversion S.S. Agrawal CDAC, Noida & KIIT, Gurgaon email: ssagrawal@cdacnoida.in, ss_agrawal@hotmail.com
3
Contents Intonation patterns with sentence type categories A relationship between F0 values in Vowels and Emotions –Analytical study Recognition and Perception of Emotions based spectral and prosodic values obtained from vowels. F0 Pattern Analysis of Emotion sentences in Hindi. Emotion conversion using the intonation data base from sentences and words. Comparison of machine and Perception Experiments.
4
Hindi speech possesses pitch patterns depending on the meaning, structure and type. Intonation also decides the meaning of certain words depending on the type of sentence or phrase where these occur. In Hindi we observe three levels of intonations and these can be classified as ‘normal’, ‘high’ and ‘low’. In exceptional cases presence of VH (very high) and EH(extremely high) is felt, though it rarely occurs. For observing intonation patterns due to sentence type, we may classify them into the following seven categories namely - Affirmative, Negative, Interrogative, Imperative, Doubtful, Desiderative, Conditional, and Exclamatory. Intonation Patterns of Hindi
5
Intonation patterns of Hindi Affirmative ( MHL pitch pattern ) Negative (MHL pitch pattern ) Imperative ( ML pitch pattern) Doubtful
6
Desiderative Exclamatory ( MHM Pitch pattern) Intonation Patterns of Hindi
7
???Application on Emotional Behavior Recognition of Emotion Conversion of Emotion
8
Emotion Recognition For natural human-machine interaction, there is a requirement of machine based emotional intelligence. For satisfactory responses to human emotions; computer systems need accurate emotion recognition. We can, therefore, monitor physiological state of individuals in several demanding work environments which can be used to augment automated medical or forensic data analysis systems.
9
METHODMETHOD Material Speakers: Six male graduate students (from drama club, Aligarh Muslim University, Aligarh), Native speakers of Hindi, Age group 20-23 years, Sentences: Short 5 neutral Hindi sentences, Emotions: Neutral, happiness, anger, sadness, and fear. Repetitions: 4 In this way there were 600 (6×5×5×4) sentences.
10
Recording Electret microphone Partially sound treated room “PRAAT” software. Sampling rate 16 kHz / 16 bit Distance between mouth and microphone was adjusted nearly 30 cm.
11
Above 600 sentences were first randomized in sentences and speakers, and then were presented to 20 naive listeners to evaluate the emotions within five categories: neutral, happiness, anger, sadness, and fear. Only those sentences, whose emotions were identified by at least 80% of all the listeners, were selected for this study. After selection, we had left with 400 sentences for our study Listening test
12
Acoustic Analysis Prosody related features (mean value of pitch (F0), duration, rms value of sound pressure, and speech power. Spectral features (15 mel frequency cepstral coefficients )
13
Prosody features For present study, central 60 ms portion of vowels /a/ occurring in all the sentences at different positions (underlined in sentences given in Appendix) from selected words of sentences were used to measure all the features. In this way there were total 13 /a/ vowels (3 in first sentence, 3 in second, 2 in third, 4 in four, and 1 in fifth sentence). After taking 60-60ms of all the /a/ vowels, average of all the vowels of each sentence was taken. Besides F 0, speech power, and sound pressure were also calculated.
14
Feature extraction method Praat software was used to measure all the prosody features. Figure shows the representation of waveform (upper) and spectrogram (lower) (pitch in blue line) for word / sItar / (a) for anger (b) for fear (c) for happiness (d) for neutral and (e) for sadness as obtained in “PRAAT” software.
15
(a) An (b) Fe (b) Fe (c) Ha (d) Ne
16
(e) Sa
17
Table 1: F0 value for vowel AEIiAv Anger237218222.2244234.6 Sadness107110.4106.0111110.1 Neutral134.5131.0132.5146.9136.3 Happiness194.5190.5189.1189.0191.8 Fear160.9163.7162.3191.2173.2
18
Spectral features MFCC coefficients were calculated using MATLAB programming. Frame duration was 16ms. Overlapping in frames was of 9ms From each frame, 3 MFCCs were calculated, and as we had five frames, so we obtained 15 MFCCs for each sample. Thus in total, there are 19 parameters. All the measured 19 parameters of sentences of each emotion were then normalized with respect to parameters of neutral sentences of the given speaker.
19
Recognition of emotion Independent variables : Measured acoustic parameters Dependent variables : Emotional categories. Recognition had been done by people as well as by neural network classifier. By people Selected 400 sentences were randomized sentence-wise and speaker-wise. These randomized sentences were presented to 20 native listeners of Hindi to identify the emotions within five categories: neutral, happiness, anger, sadness, and fear. All the listeners were educated from Hindi background and of age group of 18 to 28 years.
20
By Neural network classifier (using PRAAT software) For training 70% of data And for classification test 30% of data. As parameters were normalized with respect to neutral category, only four emotions (Anger, fear, happiness, and sadness) were recognized by classifier.
21
Contd…. In present study 3-layered (two hidden layers and one output layer) feed forward neural network had been used, in which both hidden layers had 10-10 nodes. There were 19 input units which represented used acoustic parameters. Output layer had 4 units which represented output categories (4 emotions in present case). Results were obtained using neural network classifier on 2000 training epochs and 1 run for each data set.
22
RESULT AND DISCUSSION Recognition of emotion By people Most recognizable emotion: Anger (82.3%) Least recognizable emotion: Fear (75.8%). Average recognition of emotion: 78.3 %. Recognition of emotion was in the order: anger > sadness > neutral > happy > fear
23
CategoryNeutralHappinessAngerSadnessFear Neutral 77.0 1.0 3.8 14.2 4.0 Happiness 4.1 76.5 7.8 5.4 6.2 Anger 7.2 5.0 82.3 2.1 3.4 Sadness 7.3 1.7 3.4 80.0 7.6 Fear 5.1 4.5 2.8 11.8 75.8 Table1. Confusion Matrix of recognition of emotion by People Performance
24
By neural network classifier (NNC) Confusion Matrix obtained by NNC is shown in Table2. Most recognizable emotion: Anger (90%), sadness (90%) Least recognizable emotion: Fear (60%). Average recognition of emotion: 80 %. The recognition of emotion was in the order: anger =sadness > happy > fear. In “Figure 2”, histogram of comparison of percentage correct recognition of emotion by people and NNC is shown.
25
CategoryHappinessAngerSadnessFear Happiness 80.0 3.3 3.3 3.4 Anger 6.7 90.0 0.0 3.3 Sadness 0.0 0.0 90.0 10.0 Fear 10.0 6.7 23.3 60.0 Table 2 Confusion Matrix of recognition of emotion by NNC
26
Figure2 Comparison of percentage correct recognition of emotion by people and NNC
27
Emotion Conversion
28
Intonation based Emotional Database six native speakers, 20 sentences of Hindi utterances,five expressive styles Neutral Sadness Anger Surprise Happy.
29
Happiness F0 curve of utterances rise and fall pattern at the beginning of the sentences hold pattern at the end of the sentences
31
Anger F0 curve of utterances rise & fall in the beginning of the sentences. fall towards the end of the sentences
33
F0 contour of utterances of sadness fall or hold at the end of sentences fall & rise in the beginning of the sentences fall-fall pattern throughout the contour Sadness
35
F0 curve of utterances falls at the end of the utterances rise & fall in the beginning of the sentences. In most of the case we observed fall in sentence final position irrespective of the speaker Normal
37
F0 curve of utternaces rise & fall pattern for sentence initial position rise pattern for sentence final position most of the utterances of surprise emotion in the form of question based surprise state. Surprise
39
Emotion Conversion To store all utterances of all the expressive style is really a difficult and time consuming task. Also consume huge memory space. There should be an approach which minimizes the time and memory space for emotion rich database. Taking this fact in consideration authors have proposed an algorithm for emotion conversion.
40
Contd… This algorithm requires storing neutral utterances in the database. Other expressive style utterances will be produced from neutral emotion. Proposed algorithm is based on linear modification model (LMM), where fundamental frequency (F0) is one of the factors to convert emotions
41
Intonation based Emotional Database Another database which is directly associated with the main module of emotion conversion. The database is used to keep the pitch point values (Table 1) for the utterances, already present in the Speech Database. The numbers of pitch points are based on number of syllables present in the sentence and resolution frequency (fr). Resolution frequency is the minimum amount by which every remaining pitch point will lie above or below the line that connects two neighbours pitch points
42
Table: Pitch Point Table (Neutral emotion recorded by one of the speaker).
43
Sentenc e Pt1 (Hz) Pt2Pt3Pt4Pt5Pt6Pt7Pt8Pt9Pt1 0 Pt11 1200.4232.2391. 4 236.5 185. 7 208. 1 496.8211. 6 179. 2 -- 2244.2213.8262. 6 -159. 6 210. 0 172.6177. 4 --- 3200.9262.3231259219. 7 175. 8 87.788.8207. 6 201.9 - 4200.1231.1183. 8 230.1 234. 6 188. 7 173.2152. 1 246. 8 233.7 - 5227.3255.3220. 1 249189. 9 231. 7 166.7221. 5 187. 4 170.5 203. 4 6232.9252.7197. 7 237.5 205. 5 258. 3 206.8246. 3 201. 9 193.5 - 7205.7237.6203. 4 228.2 165. 9 202. 1 ----- 8260.9230.1251. 8 211.6 238. 3 200. 3 98.494.2202. 3 182- 9258.5215.5202. 3 233.7 175. 8 144. 3 83197. 5 181. 9 -- 10229203.9316. 8 229.4 207. 2 79256.8192. 8 202. 4 148.3 193 11208.5201.8235203.5 216. 7 507. 9 489216. 2 168. 4 85. 3 96.7 12253.1223.5251. 4 221.6 249. 9 189. 7 172.785.689.2203.3 186. 8 13229.4204273. 6 198.3 240. 7 200. 3 234161198. 3 -- 14244.6265.6224. 4 280.5 198. 4 265. 6 165.7191. 7 --- 15259.6209.7308. 8 235.6 224. 7 252. 5 205.4177. 4 --- 16210.6223.2181. 3 91. 3 93.6------ 17277225.1107. 3 105.2 229. 7 110. 9 108.8211. 1 198. 4 98. 1 93.4 18273234.4262. 9 204.4 228506. 0 257.6180. 5 185. 7 -- 19264.8219.6254. 5 195.9 225. 1 184. 2 192.797.5189. 9 209.8 178 20242.5207.9257. 7 179201. 3 162. 8 191.2----
44
F0 Based Emotion Conversion Emotion conversion at Sentence level Emotion conversion at Word level
45
F0 Based Emotion Conversion In these methods pitch points (Pi) were studied for the desired source emotion (Neutral) and target emotion (Surprise) and then difference between corresponding pitch points were evaluated after normalization This serves as an indicator of the values by which pitch points of source speech utterance must be increased or decreased to convert it to target utterance. For pitch analysis step length is taken as.01 second and minimum and maximum pitch is taken as 75 Hz and 500 Hz. Then stylization process is performed to remove the excess pitch points and then valid numbers of pitch points were noted
46
F0 Based Emotion Conversion… On comparison between source and target emotion training set, pitch points are divided in four groups and set the initial frequency as x1, x2, x3, and x4 respectively. On the basis of observation of training set y1, y2, y3 and y4 is added to the subsequent “x” values. In some cases, Pitch point number also matters and gets focus to decide the transformed F0 value. xi and yi, values are came out after the rigorous analysis of pitch patterns of neutral and emotional utterances.
47
Pitch point1Range Difference (y) (Hz) +40+100+150 Utterance Frequency 82%10%8% Pitch point2Range Difference (y) (Hz) -40+40>+100 Utterance Frequency 25%70%5% Pitch point3Range Difference (y) (Hz) -100+25>+80 Utterance Frequency 10%73%17% Pitch point4Range Difference (y) (Hz) -10+40>+80 Utterance Frequency 17%55%28%
48
Sentence Based Emotion Conversion -Algorithm Select desired sound wave form Convert speech waveform in pitch tier // Stylization For all P i s Select Pi, that is more close to straight line and compare with resolution frequency (fr) if distance between pi and straight line > fr Stop the process else Repeat for other P i s Divide the pitch points in four groups For each group group[1] = x1+y1 || x1-y1 group[2]= x2+y2+2(pitch point number) group[3]= x3+y3 group[4]= x4+y4+3(pitch point number) Remove existing pitch points Add newly calculated pitch points in place of old pitch points.
49
Figure 1. Pitch points for natural neutral emotion Figure 2. Pitch points for natural surprise emotion
50
Experimental Results For this process, “ कल तुम्हें फाँसी हो जाएगी। “ was considered and their results are given in figure 5 and table 5. In figure, upper picture shows the natural surprise utterance and lower part displays the transformed neutral to surprise utterance. Table 5 gives the idea about the conversion algorithm pitch points wise.
51
Figure 5 Natural and transformed Surprise emotion utterance.
52
Table 5. Comparision table for “ कल तुम्हें फाँसी हो जाएगी। “ Pitch pointsNatural Surprise Utterance (Hz)Transformed Surprise Utterance (Hz) 1349.5326.4 2389389.7 3244.5291.6 4217.9251.3 5255.6479.7 6414.1316.9 7261.7321.8 8492.1461.3 9324.2355.4 10375.9386.7 11399.2452
53
Analysis For Word Boundary Detection In a sentence where one word ends and new word is started, its intensity value decreases significantly. There are many points where intensity value is decreased in a recorded speech, but every such point is not the word boundary. (Refer to the figure) In most cases, at the point, where intensity is decreased and the pitch value is undefined, there is a word boundary. We have several regions where the pitch value is undefined, and in each region there can only be one word boundary point out of many points. Sometimes we have several low intensity points where pitch value is defined and there may also be word boundary. In the region where pitch values are defined, no two word boundaries exist within the time span of 0.10 seconds.
55
Word Boundary Detection Algorithm Rule 1: Intensity valleys above threshold value I 0 are not considered as word segment boundary. Rule 2: Intensity Valleys below I 0 are considered as Word Boundary Rule 3:Valleys on non- pitch range can be considered as word segment boundary Rule 4:If there are more than one Intensity Valleys during pitch contour pattern, the lowest value valleys will be considered as word segment boundary. Rule 5:If there is no intensity point in undefined pitch pattern, there will not be any word boundary. Rule 6: If there is more Intensity valleys on a pitch defined range and duration difference is less than 0.9 sec, only lowest intensity point will be considered as Word boundary.
56
Emotion Conversion algorithm (Word level) For all P i s Select Pi, that is more close to straight line and compare with resolution frequency (fr) if distance between pi and straight line > fr Stop the process else Repeat for other P i s Divide the pitch points into word segments as produced by WBD For each word segment’s F0 min, F0 max, F0 beg and F0 end Word i,f = C n,i,f * X n i,f // Where n is emotional state, i denotes the word segment and f denotes the F0 values for various prosodic points.. Remove existing pitch points Add newly calculated pitch points and duration in place of old pitch points.
57
Algorithm Implementation Original neutral AAJ College Jana Hain Sentence Based ConversionWord Based conversion Anger Happy Sad Surprise Anger Happy Sad Surprise
58
Results (Word Boundary Detection) Speaker IDRecognition Rate (%)False Recognition (%) S18810.2 S29111.6 S387.69.3 S493.410.4 S591.68.5 S685.211.3 S782.59.3 S889.710.6 S993.26.8 S1093.17.2 S1187.39.1 S1270.315.6 S1388.810.3 S1493.210.0 S1586.69.5
59
Results (Word Boundary Detection) Word – Boundary Non-Word Boundary Word-Boundary90.8%9.2% Non-Word Boundary 15.6%80.1%
60
Perception Test
61
Comparison Of Transformed Emotions
62
Transformed Perception matrix Listeners are divided into 3 groups of 5 candidates each. EmotionPerception Surprise91.2 % Sadness89.6% Neutral10.4% Anger8.8%
63
In this paper we have not considered the alignment of pitch points by linguistic rules, this will be the next objective for emotion conversion. We have only taken care of F0 and Energy factor for our experiment; we have not considered the other factor like Spectrum, Duration, Syllable information etc, that can be further investigated. The experiment has been performed on 800 utterances and not rich enough in terms of numbers. The database should be increased to achieve the perfect ness. Since few distinctions have been made for Hindi from other languages so, it is justified to design Hindi based Intonational model where transformation of emotions can be incorporated. Conclusion and Future work
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.