Emotions in Hindi -Recognition and Conversion S.S. Agrawal CDAC, Noida & KIIT, Gurgaon

Slides:



Advertisements
Similar presentations
Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
Advertisements

Displaying data and calculations  How many trials have been performed?  Were there enough to satisfy the IB Internal Assessment criteria? FYI: IB wants.
Road-Sign Detection and Recognition Based on Support Vector Machines Saturnino, Sergio et al. Yunjia Man ECG 782 Dr. Brendan.
Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Prachi Saraph, Mark Last, and Abraham Kandel. Introduction Black-Box Testing Apply an Input Observe the corresponding output Compare Observed output with.
Vineel Pratap Girish Govind Abhilash Veeragouni. Human listeners are capable of extracting information from the acoustic signal beyond just the linguistic.
Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.
Vocal Emotion Recognition with Cochlear Implants Xin Luo, Qian-Jie Fu, John J. Galvin III Presentation By Archie Archibong.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.
Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.
Simple Neural Nets For Pattern Classification
Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Presented By: Karan Parikh Towards the Automated Social Analysis of Situated Speech Data Watt, Chaudhary, Bilmes, Kitts CS546 Intelligent.
Final Project: Project 9 Part 1: Neural Networks Part 2: Overview of Classifiers Aparna S. Varde April 28, 2005 CS539: Machine Learning Course Instructor:
Speaker Adaptation for Vowel Classification
Evaluating Hypotheses
Lecture 10 Comparison and Evaluation of Alternative System Designs.
October 14, 2010Neural Networks Lecture 12: Backpropagation Examples 1 Example I: Predicting the Weather We decide (or experimentally determine) to use.
CS 4700: Foundations of Artificial Intelligence
A PRESENTATION BY SHAMALEE DESHPANDE
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.
Representing Acoustic Information
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Image Segmentation by Clustering using Moments by, Dhiraj Sakumalla.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
LE 460 L Acoustics and Experimental Phonetics L-13
Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.
December 5, 2012Introduction to Artificial Intelligence Lecture 20: Neural Network Application Design III 1 Example I: Predicting the Weather Since the.
Presented by Tienwei Tsai July, 2005
7-Speech Recognition Speech Recognition Concepts
Kumar Srijan ( ) Syed Ahsan( ). Problem Statement To create a Neural Networks based multiclass object classifier which can do rotation,
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Prepared by: Waleed Mohamed Azmy Under Supervision:
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Implementing a Speech Recognition System on a GPU using CUDA
SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,
Multimodal Information Analysis for Emotion Recognition
Dan Rosenbaum Nir Muchtar Yoav Yosipovich Faculty member : Prof. Daniel LehmannIndustry Representative : Music Genome.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER
Imposing native speakers’ prosody on non-native speakers’ utterances: Preliminary studies Kyuchul Yoon Spring 2006 NAELL The Division of English Kyungnam.
Performance Comparison of Speaker and Emotion Recognition
Face Image-Based Gender Recognition Using Complex-Valued Neural Network Instructor :Dr. Dong-Chul Kim Indrani Gorripati.
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
Predicting Voice Elicited Emotions
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
RESEARCH MOTHODOLOGY SZRZ6014 Dr. Farzana Kabir Ahmad Taqiyah Khadijah Ghazali (814537) SENTIMENT ANALYSIS FOR VOICE OF THE CUSTOMER.
Kim HS Introduction considering that the amount of MRI data to analyze in present-day clinical trials is often on the order of hundreds or.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
University of Rochester
August 15, 2008, presented by Rio Akasaka
ARTIFICIAL NEURAL NETWORKS
International Workshop
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
EE513 Audio Signals and Systems
Digital Systems: Hardware Organization and Design
Domingo Mery Department of Computer Science
Measuring the Similarity of Rhythmic Patterns
Auditory Morphing Weyni Clacken
Presentation transcript:

Emotions in Hindi -Recognition and Conversion S.S. Agrawal CDAC, Noida & KIIT, Gurgaon

Contents Intonation patterns with sentence type categories A relationship between F0 values in Vowels and Emotions –Analytical study Recognition and Perception of Emotions based spectral and prosodic values obtained from vowels. F0 Pattern Analysis of Emotion sentences in Hindi. Emotion conversion using the intonation data base from sentences and words. Comparison of machine and Perception Experiments.

Hindi speech possesses pitch patterns depending on the meaning, structure and type. Intonation also decides the meaning of certain words depending on the type of sentence or phrase where these occur. In Hindi we observe three levels of intonations and these can be classified as ‘normal’, ‘high’ and ‘low’. In exceptional cases presence of VH (very high) and EH(extremely high) is felt, though it rarely occurs. For observing intonation patterns due to sentence type, we may classify them into the following seven categories namely - Affirmative, Negative, Interrogative, Imperative, Doubtful, Desiderative, Conditional, and Exclamatory. Intonation Patterns of Hindi

Intonation patterns of Hindi Affirmative ( MHL pitch pattern ) Negative (MHL pitch pattern ) Imperative ( ML pitch pattern) Doubtful

Desiderative Exclamatory ( MHM Pitch pattern) Intonation Patterns of Hindi

???Application on Emotional Behavior Recognition of Emotion Conversion of Emotion

Emotion Recognition For natural human-machine interaction, there is a requirement of machine based emotional intelligence. For satisfactory responses to human emotions; computer systems need accurate emotion recognition. We can, therefore, monitor physiological state of individuals in several demanding work environments which can be used to augment automated medical or forensic data analysis systems.

METHODMETHOD Material Speakers: Six male graduate students (from drama club, Aligarh Muslim University, Aligarh), Native speakers of Hindi, Age group years, Sentences: Short 5 neutral Hindi sentences, Emotions: Neutral, happiness, anger, sadness, and fear. Repetitions: 4 In this way there were 600 (6×5×5×4) sentences.

Recording Electret microphone Partially sound treated room “PRAAT” software. Sampling rate 16 kHz / 16 bit Distance between mouth and microphone was adjusted nearly 30 cm.

Above 600 sentences were first randomized in sentences and speakers, and then were presented to 20 naive listeners to evaluate the emotions within five categories: neutral, happiness, anger, sadness, and fear. Only those sentences, whose emotions were identified by at least 80% of all the listeners, were selected for this study. After selection, we had left with 400 sentences for our study Listening test

Acoustic Analysis Prosody related features (mean value of pitch (F0), duration, rms value of sound pressure, and speech power. Spectral features (15 mel frequency cepstral coefficients )

Prosody features For present study, central 60 ms portion of vowels /a/ occurring in all the sentences at different positions (underlined in sentences given in Appendix) from selected words of sentences were used to measure all the features. In this way there were total 13 /a/ vowels (3 in first sentence, 3 in second, 2 in third, 4 in four, and 1 in fifth sentence). After taking 60-60ms of all the /a/ vowels, average of all the vowels of each sentence was taken. Besides F 0, speech power, and sound pressure were also calculated.

Feature extraction method Praat software was used to measure all the prosody features. Figure shows the representation of waveform (upper) and spectrogram (lower) (pitch in blue line) for word / sItar / (a) for anger (b) for fear (c) for happiness (d) for neutral and (e) for sadness as obtained in “PRAAT” software.

(a) An (b) Fe (b) Fe (c) Ha (d) Ne

(e) Sa

Table 1: F0 value for vowel AEIiAv Anger Sadness Neutral Happiness Fear

Spectral features MFCC coefficients were calculated using MATLAB programming. Frame duration was 16ms. Overlapping in frames was of 9ms From each frame, 3 MFCCs were calculated, and as we had five frames, so we obtained 15 MFCCs for each sample. Thus in total, there are 19 parameters. All the measured 19 parameters of sentences of each emotion were then normalized with respect to parameters of neutral sentences of the given speaker.

Recognition of emotion Independent variables : Measured acoustic parameters Dependent variables : Emotional categories. Recognition had been done by people as well as by neural network classifier. By people Selected 400 sentences were randomized sentence-wise and speaker-wise. These randomized sentences were presented to 20 native listeners of Hindi to identify the emotions within five categories: neutral, happiness, anger, sadness, and fear. All the listeners were educated from Hindi background and of age group of 18 to 28 years.

By Neural network classifier (using PRAAT software) For training 70% of data And for classification test 30% of data. As parameters were normalized with respect to neutral category, only four emotions (Anger, fear, happiness, and sadness) were recognized by classifier.

Contd…. In present study 3-layered (two hidden layers and one output layer) feed forward neural network had been used, in which both hidden layers had nodes. There were 19 input units which represented used acoustic parameters. Output layer had 4 units which represented output categories (4 emotions in present case). Results were obtained using neural network classifier on 2000 training epochs and 1 run for each data set.

RESULT AND DISCUSSION Recognition of emotion By people Most recognizable emotion: Anger (82.3%) Least recognizable emotion: Fear (75.8%). Average recognition of emotion: 78.3 %. Recognition of emotion was in the order: anger > sadness > neutral > happy > fear

CategoryNeutralHappinessAngerSadnessFear Neutral Happiness Anger Sadness Fear Table1. Confusion Matrix of recognition of emotion by People Performance

By neural network classifier (NNC) Confusion Matrix obtained by NNC is shown in Table2. Most recognizable emotion: Anger (90%), sadness (90%) Least recognizable emotion: Fear (60%). Average recognition of emotion: 80 %. The recognition of emotion was in the order: anger =sadness > happy > fear. In “Figure 2”, histogram of comparison of percentage correct recognition of emotion by people and NNC is shown.

CategoryHappinessAngerSadnessFear Happiness Anger Sadness Fear Table 2 Confusion Matrix of recognition of emotion by NNC

Figure2 Comparison of percentage correct recognition of emotion by people and NNC

Emotion Conversion

Intonation based Emotional Database six native speakers, 20 sentences of Hindi utterances,five expressive styles Neutral Sadness Anger Surprise Happy.

Happiness F0 curve of utterances rise and fall pattern at the beginning of the sentences hold pattern at the end of the sentences

Anger F0 curve of utterances rise & fall in the beginning of the sentences. fall towards the end of the sentences

F0 contour of utterances of sadness fall or hold at the end of sentences fall & rise in the beginning of the sentences fall-fall pattern throughout the contour Sadness

F0 curve of utterances falls at the end of the utterances rise & fall in the beginning of the sentences. In most of the case we observed fall in sentence final position irrespective of the speaker Normal

F0 curve of utternaces rise & fall pattern for sentence initial position rise pattern for sentence final position most of the utterances of surprise emotion in the form of question based surprise state. Surprise

Emotion Conversion To store all utterances of all the expressive style is really a difficult and time consuming task. Also consume huge memory space. There should be an approach which minimizes the time and memory space for emotion rich database. Taking this fact in consideration authors have proposed an algorithm for emotion conversion.

Contd… This algorithm requires storing neutral utterances in the database. Other expressive style utterances will be produced from neutral emotion. Proposed algorithm is based on linear modification model (LMM), where fundamental frequency (F0) is one of the factors to convert emotions

Intonation based Emotional Database Another database which is directly associated with the main module of emotion conversion. The database is used to keep the pitch point values (Table 1) for the utterances, already present in the Speech Database. The numbers of pitch points are based on number of syllables present in the sentence and resolution frequency (fr). Resolution frequency is the minimum amount by which every remaining pitch point will lie above or below the line that connects two neighbours pitch points

Table: Pitch Point Table (Neutral emotion recorded by one of the speaker).

Sentenc e Pt1 (Hz) Pt2Pt3Pt4Pt5Pt6Pt7Pt8Pt9Pt1 0 Pt

F0 Based Emotion Conversion Emotion conversion at Sentence level Emotion conversion at Word level

F0 Based Emotion Conversion In these methods pitch points (Pi) were studied for the desired source emotion (Neutral) and target emotion (Surprise) and then difference between corresponding pitch points were evaluated after normalization This serves as an indicator of the values by which pitch points of source speech utterance must be increased or decreased to convert it to target utterance. For pitch analysis step length is taken as.01 second and minimum and maximum pitch is taken as 75 Hz and 500 Hz. Then stylization process is performed to remove the excess pitch points and then valid numbers of pitch points were noted

F0 Based Emotion Conversion… On comparison between source and target emotion training set, pitch points are divided in four groups and set the initial frequency as x1, x2, x3, and x4 respectively. On the basis of observation of training set y1, y2, y3 and y4 is added to the subsequent “x” values. In some cases, Pitch point number also matters and gets focus to decide the transformed F0 value. xi and yi, values are came out after the rigorous analysis of pitch patterns of neutral and emotional utterances.

Pitch point1Range Difference (y) (Hz) Utterance Frequency 82%10%8% Pitch point2Range Difference (y) (Hz) >+100 Utterance Frequency 25%70%5% Pitch point3Range Difference (y) (Hz) >+80 Utterance Frequency 10%73%17% Pitch point4Range Difference (y) (Hz) >+80 Utterance Frequency 17%55%28%

Sentence Based Emotion Conversion -Algorithm Select desired sound wave form Convert speech waveform in pitch tier // Stylization For all P i s Select Pi, that is more close to straight line and compare with resolution frequency (fr) if distance between pi and straight line > fr Stop the process else Repeat for other P i s Divide the pitch points in four groups For each group group[1] = x1+y1 || x1-y1 group[2]= x2+y2+2(pitch point number) group[3]= x3+y3 group[4]= x4+y4+3(pitch point number) Remove existing pitch points Add newly calculated pitch points in place of old pitch points.

Figure 1. Pitch points for natural neutral emotion Figure 2. Pitch points for natural surprise emotion

Experimental Results For this process, “ कल तुम्हें फाँसी हो जाएगी। “ was considered and their results are given in figure 5 and table 5. In figure, upper picture shows the natural surprise utterance and lower part displays the transformed neutral to surprise utterance. Table 5 gives the idea about the conversion algorithm pitch points wise.

Figure 5 Natural and transformed Surprise emotion utterance.

Table 5. Comparision table for “ कल तुम्हें फाँसी हो जाएगी। “ Pitch pointsNatural Surprise Utterance (Hz)Transformed Surprise Utterance (Hz)

Analysis For Word Boundary Detection In a sentence where one word ends and new word is started, its intensity value decreases significantly. There are many points where intensity value is decreased in a recorded speech, but every such point is not the word boundary. (Refer to the figure)  In most cases, at the point, where intensity is decreased and the pitch value is undefined, there is a word boundary. We have several regions where the pitch value is undefined, and in each region there can only be one word boundary point out of many points. Sometimes we have several low intensity points where pitch value is defined and there may also be word boundary. In the region where pitch values are defined, no two word boundaries exist within the time span of 0.10 seconds.

Word Boundary Detection Algorithm Rule 1: Intensity valleys above threshold value I 0 are not considered as word segment boundary. Rule 2: Intensity Valleys below I 0 are considered as Word Boundary Rule 3:Valleys on non- pitch range can be considered as word segment boundary Rule 4:If there are more than one Intensity Valleys during pitch contour pattern, the lowest value valleys will be considered as word segment boundary. Rule 5:If there is no intensity point in undefined pitch pattern, there will not be any word boundary. Rule 6: If there is more Intensity valleys on a pitch defined range and duration difference is less than 0.9 sec, only lowest intensity point will be considered as Word boundary.

Emotion Conversion algorithm (Word level) For all P i s Select Pi, that is more close to straight line and compare with resolution frequency (fr) if distance between pi and straight line > fr Stop the process else Repeat for other P i s Divide the pitch points into word segments as produced by WBD For each word segment’s F0 min, F0 max, F0 beg and F0 end Word i,f = C n,i,f * X n i,f // Where n is emotional state, i denotes the word segment and f denotes the F0 values for various prosodic points.. Remove existing pitch points Add newly calculated pitch points and duration in place of old pitch points.

Algorithm Implementation Original neutral AAJ College Jana Hain Sentence Based ConversionWord Based conversion Anger Happy Sad Surprise Anger Happy Sad Surprise

Results (Word Boundary Detection) Speaker IDRecognition Rate (%)False Recognition (%) S S S S S S S S S S S S S S S

Results (Word Boundary Detection) Word – Boundary Non-Word Boundary Word-Boundary90.8%9.2% Non-Word Boundary 15.6%80.1%

Perception Test

Comparison Of Transformed Emotions

Transformed Perception matrix Listeners are divided into 3 groups of 5 candidates each. EmotionPerception Surprise91.2 % Sadness89.6% Neutral10.4% Anger8.8%

In this paper we have not considered the alignment of pitch points by linguistic rules, this will be the next objective for emotion conversion. We have only taken care of F0 and Energy factor for our experiment; we have not considered the other factor like Spectrum, Duration, Syllable information etc, that can be further investigated. The experiment has been performed on 800 utterances and not rich enough in terms of numbers. The database should be increased to achieve the perfect ness. Since few distinctions have been made for Hindi from other languages so, it is justified to design Hindi based Intonational model where transformation of emotions can be incorporated. Conclusion and Future work