Download presentation
Presentation is loading. Please wait.
Published byAlbert Parker Modified over 9 years ago
1
M. Brendel 1, R. Zaccarelli 1, L. Devillers 1,2 1 LIMSI-CNRS, 2 Paris-South University French National Research Agency - Affective Avatar project (2007-2010)
2
Example of application of the system for emotions detection from speech to control an affective avatar: Skype (the speaker is depicted by his/her avatar) The avatar should show the expressive behavior in facial and gesture corresponding to the emotion detected in voice – speech synchronized with the lips movements The mapping between the output of the emotion detection system and the expressive avatar was done with the ECA team at LIMSI and the other partners
3
This application has two main challenges for emotion detection: speaker- independent emotion detection real-time emotion detection We focus on emotion detection for 4 macro-classes: Anger ( Annoyance, Hot anger, Impatience) Sadness (Disappointment, Sadness) Positive (Amusement, Joy, Satisfaction) Neutral
4
Choice of appropriate corpora for training models -> fundamental. Data must be as close as possible to the behaviors observed in the real application but sometimes such application does not exist Corpus must be large enough, Large number of speakers, Sufficient variability of emotional expressions, including complex, mixed and shaded emotions. Available corpora in the community are mainly acted and small including few speakers and little variations in the expression of emotions without any application in sight. LIMSI corpora mainly collected in call centers (bank, emergency or stock exchange call centers) with lots of negative emotions.
5
CINEMO (Rollet et al., 2009, Schuller et al., 2010) contains acted emotional expression (mainly everyday situations) obtained by playing dubbing exercices (Cine Karaoké) by 50 speakers - manually segmented - 2 coders (annotation scheme allows to annotate mixtures of emotion) lot of shaded emotions and mixtures JEMO obtained by an emotion detection game with 39 speakers - first prototype of a real time detection system – automatically segmented - 2 coders prototypical emotions: very few mixtures of emotions were annotated Question of this paper: Can we mix different kinds of corpus recorded in same conditions for training more efficient classifier?
6
Sub-corpus CINEMO (50 speakers) POSSADANGNEUTOTAL #segments3133643445101012 Sub-corpora on consensual segments were chosen for training models for detection of 4 classes – we have not considered mixtures of emotions Sub-corpus JEMO (38 diff. speakers) POSSADANGNEUTOTAL #segments3162231794161062
7
We have compared Corpus-anger with Corpus-All with some acoustic features, we have plotted the for each feature across the three corpora. 1- rolloff05% 2- rolloff25% 3- rolloff50% 4- rolloff75% 5- rolloff95% 6- centroid 7- spectralslope 8- spectralorigin 9- bandenergy0-250 10- bandenergy250-650 11- bandenergy0-650 12- bandenergy650-1k 13- bandenergy1k-4k 14- barkband1... 37- barkband24 38- mfcc0 39- mfcc1 … 50- mfcc12 51- zcr 52- meanloudness 53- rmsintensity *1000 54- rapportMaxMinF0 55- varF0 56- F2-F1 57- F3-F2 58- varF1 59- varF2 60- varF3 61- voicedratio 62- jitterlocal 63- shimmerlocal 64- HNR (Tahon & devillers, Speech Prosody 2010)
8
Computed on voiced segments Low-Level Descriptors (# nb computed with functionals) Functionals Energy (29)moments(2) RMS Energy (22)absolute mean, max FO (23)extremes(2) Zero-Crossing Rate (18)2 x values, range MFCC 1-16 (366)linear regression(2) MAE/MSE, slope quartiles(2) quartile, tquartile
9
RR/UARTest CINEMOTest JEMO Train CINEMO0.50/ 0.480.51/0.48 Train JEMO0.43/0.390.60/0.55 Training on CINEMO and testing on JEMO performs better than vice-versa. It seems better to train on a wider set (more variability of emotional expressions in CINEMO – different contexts) and test on a narrower (JEMO contains more prototypical emotions) than the other way. Surprisingly, training on CINEMO then testing on JEMO gives a slightly better performance than testing on CINEMO itself.
10
RR/ UAR SpD CV WEKA SpI CV CINEMO0.57/ 0.56 0.50/ 0.48 JEMO0.63/ 0.59 0.60/ 0.55 CINEMO +JEMO 0.58/ 0.56 0.54/ 0.51 Be careful, WEKA use SPD cross-validation!
11
This means that the unification of the corpora improved the results. We could not be better than JEMO, but it is obvious that the good result of JEMO on itself is because it is a small corpus with prototypical emotions only, and it has no good generalization power: training on JEMO, testing on CINEMO 0.43/0.39 After balancing tests, We can also conclude that the performance improvement is mainly due to the large number of instances.
12
12
13
RR/UARSPI CV All features SPI CV SFFS Female0.59/ 0.550.65 (31 features) Male0.52/0.490.55 (38 features) All0.54/0.51 PositiveSadnessAngerNeutral Male252262267432 Female377325256494
14
Unification of both corpora (88 speakers) allows to improve the results: The number of instances is approximately doubled The classes are more balanced The two corpora enrich each other Spitting the corpus along gender is also beneficial: the models trained on the sub-corpus are better; Gender information was available in our application of affective avatar features selection seems also beneficial (need cross- corpora studies)
15
15
16
Emotional databases often small, sparse resource when using natural context (often less than 10% of utt. are emotional), difficult to build generic models from one corpus Find measures for qualifying emotional databases Cross-corpora studies are very important Use multiple corpora collected in different contexts to train models
17
Thanks for attention
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.