Download presentation
Presentation is loading. Please wait.
Published byErik Young Modified over 9 years ago
2
Emotions in IVR Systems Julia Hirschberg COMS 4995/6998 Thanks to Sue Yuen and Yves Scherer
3
Motivation “The context of emergency gives a larger palette of complex and mixed emotions.” Emotions in emergency situations are more extreme, and are “really felt in a natural way.” Debate on acted vs. real emotions Ethical concerns?
4
: Real-life Emotions Detection with Lexical and Paralinguistic Cues on Human-Human Call Center Dialogs (Devillers & Vidrascu ’06) Domain: Medical emergencies Motive: Study real-life speech in highly emotional situations Emotions studied: Anger, Fear, Relief, Sadness (but finer-grained annotation) Corpus: 680 dialogs, 2258 speaker turns Training-test split: 72% - 28% Machine Learning method: Log-likelihood ratio (linguistic), SVM (paralinguistic)
5
CEMO Corpus 688 dialogs, avg 48 turns per dialog Annotation: – Decisions of 2 annotators are combined in a soft vector: – Emotion mixtures – 8 coarse-level emotions, 21 fine-grained emotions – Inter-annotator agreement for client turns: 0.57 (moderate) – Consistency checks: Self-reannotation procedure (85% similarity) Perception test (no details given) Restrict corpus to caller utterances: 2258 utterances, 680 speakers.
6
Features Lexical features / Linguistic cues: Unigrams of user utterances, stemmed Prosodic features / Paralinguistic cues: – Loudness (energy) – Pitch contour (F0) – Speaking rate – Voice quality (jitter,...) – Disfluency (pauses) – Non-linguistic events (mouth noise, crying, …) – Normalized by speaker
7
Annotation Utterances annotated with one of the following nonmixed emotions: – Anger, Fear, Relief, Sadness – Justification for this choice?
8
Lexical Cue Model Log-likelihood ratio: 4 unigram emotion models (1 for each emotion) – A general task-specific model – Interpolation coefficient to avoid data sparsity problems – A coefficient of 0.75 gave the best results Stemming: – Cut inflectional suffixes (more important for rich morphology languages like French) – Improves overall recognition rates by 12-13 points
9
Paralinguist (Prosodic) Cue Model 100 features, fed into SVM classifier: – F0 (pitch contour) and spectral features (formants) – Energy (loudness) – Voice quality (jitter, shimmer,...) Jitter: varying pitch in the voice Shimmer: varying loudness in the voice NHR: Noise-to-harmonic ratio HNR: Harmonic-to-noise ratio – Speaking rate, silences, pauses, filled pauses – Mouth noise, laughter, crying, breathing Normalized by speaker (~24 user turns per dialog)
10
Results AngerFearReliefSadnessTotal # Utts49384107100640 Lexical59%90%86%34%78% Prosodic39%64%58%57%59.8% Relief associated to lexical markers like thanks or I agree. “Sadness is more prosodic or syntactic than lexical.”
11
Prosody-based Automatic Detection of Annoyance and Frustration in Human-Computer (Ang et al ’02) What’s new: – Naturally-produced emotions – Automatic methods (except for style features) – Emotion vs. style Corpus: DARPA Communicator, 21,899 utts Labels: – Neutral, Annoyed, Frustrated, Fired, Amused, Other, NA – Hyperarticulation, pausing, ‘raised voice’ – Repeats and corrections – Data quality: nonnative, spkr switch, system developer
12
Annotation, Features, Method ASR output Prosodic features – Duration, rate, pause, pitch, energy, tilt Utterance position, correction labels Language model Feature selection Downsampled data to correct for neutral skew
13
Results Useful features: duration, rate, pitch, repeat/ correction, energy, position Raised voice only stylistic predictor – and that is acoustically defined Comparable to human agreement
14
Two Stream Emotion Recognition for Call Center Monitoring (Gupta & Rajput ’07) Goal: Help supervisors evaluate agents at call centers Method: Develop two stream technique to detect strong emotion – Acoustic features – Lexical features – Weighted combination Corpus: 900 calls (60h)
15
Two-Stream Recognition Acoustic Stream Extracted features based on pitch and energy Semantic Stream Performed speech-to-text conversion Text classification algorithms (TF-IDF) identified phrases such as “pleasure,” “thanks,” “useless,” & “disgusting.”
16
Implementation Method: – Two streams analyzed separately: speech utterance/acoustic features spoken text/semantics/speech recognition of conversation – Confidence levels of two streams combined – Examined 3 emotions Neutral Hot-anger Happy Tested two data sets: – LDC data – 20 real-world call-center calls
17
Two Stream - Conclusion Table 2 suggested that two-stream analysis is more accurate than acoustic or semantic alone LDC data recognition significantly higher than real-world data Neutral emotions had less accuracy Combination of two-stream processing showed improvement (~20%) in identification of “happy” and “anger” emotions Low acoustic stream accuracy may be attributed to length of sentences in real-world data. Normal people do not exhibit different emotions significantly in long sentences
18
Questions Gupta&Rajput analyzed 3 emotions (happy, neutral, hot-anger): Why break it down into these categories? Implications? Can this technique be applied to a wider range of emotions? For other applications? Speech to text may not translate the complete conversation. Would further examination greatly improve results? What are the pros and cons? Pitch range was from 50-400Hz. Research may not be applicable outside this range. Do you think it necessary to examine other frequencies? In this paper, TF-IDF (Term Frequency – Inverse Document Frequency) technique is used to classify utterances. Accuracy for acoustics only is about 55%. Previous research suggest that alternative techniques may be better. Would implementation better results? What are the pros and cons of using the TF-IDF technique?
19
Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis Previous work: – 1995; Mozziconacci suggested that VQ combined with f 0 combined could create affect – 2002; Gobl suggested synthesized stimuli with VQ can add affective coloring. Study suggested that “VQ + f 0 ” stimuli is more affective than “f 0 only” – 2003; Gobl tested VQ with large f 0 range. Did not examine contribution of affect-related f 0 contours Objective: To examine affects of VQ and f 0 on affect expression
20
Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis 3 series of stimuli of Sweden utterance – “ja adjo”: – Stimuli exemplifying VQ – Stimuli with modal voice quality with different affect-related f 0 contours – Stimuli combining both Tested parameters exemplifying 5 voice quality (VQ): – Modal voice – Breathy voice – Whispery voice – Lax-creaky voice – Tense voice 15 synthesized stimuli test samples (see Table 1)
21
What is Voice Quality? Phonation Gestures Derived from a variety of laryngeal and supralaryngeal features Adductive tension: interarytenoid muscles adduct the arytenoid muscles Medial compression: adductive force on vocal processes- adjustment of ligamental glottis Longitudinal pressure: tension of vocal folds
22
Tense Voice Very strong tension of vocal folds, very high tension in vocal tract
23
Whispery Voice Very low adductive tension Medial compression moderately high Longitudinal tension moderately high Little or no vocal fold vibration Turbulence generated by friction of air in and above larynx
24
Creaky Voice Vocal fold vibration at low frequency, irregular Low tension (only ligamental part of glottis vibrates) The vocal folds strongly adducted Longitudinal tension weak Moderately high medial compression
25
Breathy Voice Tension low – Minimal adductive tension – Weak medial compression Medium longitudinal vocal fold tension Vocal folds do not come together completely, leading to frication
26
Modal Voice “Neutral” mode Muscular adjustments moderate Vibration of vocal folds periodic, full closing of glottis, no audible friction Frequency of vibration and loudness in low to mid range for conversational speech
27
Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis Six sub-tests with 20 native speakers of Hiberno-English. Rated on 12 different affective attributes: – Sad – happy – Intimate – formal – Relaxed – stressed – Bored – interested – Apologetic – indignant – Fearless – scared Participants asked to mark their response on scale IntimateFormal No affective load
28
Voice Quality and f 0 Test: Conclusion Categorized results into 4 groups. No simple one-to-one mapping between quality and affect “Happy” was most difficult to synthesis Suggested that, in addition to f 0, VQ should be used to synthesis affectively colored speech. VQ appears to be crucial for expressive synthesis
29
Voice Quality and f 0 Test: Discussion If the scale is on a 1-7, then 3.5 should be “neutral”; however, most ratings are less than 2. Do the conclusions (see Fig 2) seem strong? In terms of VQ and f 0, the groupings in Fig 2 seem to suggest that certain affects are closely related. What are the implications of this? For example, are happy and indignant affects closer than relaxed or formal? Do you agree? Do you consider an intimate voice more “breathy” or “whispery?” Does your intuition agree with the paper? Yanushevskaya found that the VQ accounts for the highest affect ratings overall. How to compare range of voice quality with frequency? Do you think they are comparable? Is there a different way to describe these qualities?
30
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.