Liz Shriberg* Andreas Stolcke* Jeremy Ang+ * SRI International

Slides:



Advertisements
Similar presentations
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Advertisements

Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013.
AN IMPROVED AUDIO Jenn Tam Computer Science Dept. Carnegie Mellon University SOAPS 2008, Pittsburgh, PA.
Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
A Data-Driven Approach to Quantifying Natural Human Motion SIGGRAPH ’ 05 Liu Ren, Alton Patrick, Alexei A. Efros, Jassica K. Hodgins, and James M. Rehg.
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition Thurid Vogt, Elisabeth André ICME 2005 Multimedia concepts.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Using Emotion Recognition and Dialog Analysis to Detect Trouble in Communication in Spoken Dialog Systems Nathan Imse Kelly Peterson.
Emotional Grounding in Spoken Dialog Systems Jackson Liscombe Giuseppe Riccardi Dilek Hakkani-Tür
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01 1 Liz Shriberg* Andreas Stolcke* Jeremy Ang + * SRI International International.
(Slides modified from D. Jurafsky) Emotion CS 3710 / ISSP 3565.
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
1 Determining query types by analysing intonation.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
1 Computation Approaches to Emotional Speech Julia Hirschberg
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
Look who’s talking? Project 3.1 Yannick Thimister Han van Venrooij Bob Verlinden Project DKE Maastricht University.
1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.
Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Elizabeth Shriberg Andreas Stolcke Speech Technology and Research.
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Yow-Bang Wang, Lin-Shan Lee INTERSPEECH 2010 Speaker: Hsiao-Tsung Hung.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Investigating Pitch Accent Recognition in Non-native Speech
Conditional Random Fields for ASR
Pattern Recognition Sergios Theodoridis Konstantinos Koutroumbas
Spoken Dialogue Systems
Detecting Prosody Improvement in Oral Rereading
Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Linear Predictive Coding Methods
CRANDEM: Conditional Random Fields for ASR
Spoken Dialogue Systems
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Emotional Speech Julia Hirschberg CS /16/2019.
Recognizing Structure: Dialogue Acts and Segmentation
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presenter: Shih-Hsiang(士翔)
Low Level Cues to Emotion
Combination of Feature and Channel Compensation (1/2)
Automatic Prosodic Event Detection
Presentation transcript:

Prosody-Based Detection of Annoyance and Frustration in Communicator Dialogs Liz Shriberg* Andreas Stolcke* Jeremy Ang+ * SRI International International Computer Science Institute + UC Berkeley Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Introduction Prosody = rhythm, melody, “tone” of speech Largely unused in current ASU systems Prior work: prosody aids many tasks: Automatic punctuation Topic segmentation Word recognition Today’s talk: detection of user frustration in DARPA Communicator data (ROAR project suggested by Jim Bass) Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Talk Outline Data and labeling Prosodic and other features Classifier models Results Conclusions and future directions Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Key Questions How frequent is annoyance and frustration in Communicator dialogs? How reliably can humans label it? How well can machines detect it? What prosodic or other features are useful? Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Data Sources Labeled Communicator data from various sites NIST June 2000 collection: 392 dialogs, 7515 utts CMU 1/2001-8/2001 data: 205 dialogs, 5619 utts CU 11/1999-6/2001 data: 240 dialogs, 8765 utts Each site used different formats and conventions, so tried to minimize the number of sources, maximize the amount of data. Thanks to NIST, CMU, Colorado, Lucent, UW Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Data Annotation 5 undergrads with different backgrounds (emotion should be judged by ‘average Joe’). Labeling jointly funded by SRI and ICSI. Each dialog labeled by 2+ people independently in 1st pass (July-Sept 2001), after calibration. 2nd “Consensus” pass for all disagreements, by two of the same labelers (0ct-Nov 2001). Used customized Rochester Dialog Annotation Tool (DAT), produces SGML output. Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Data Labeling Emotion: neutral, annoyed, frustrated, tired/disappointed, amused/surprised, no-speech/NA Speaking style: hyperarticulation, perceived pausing between words or syllables, raised voice Repeats and corrections: repeat/rephrase, repeat/rephrase with correction, correction only Miscellaneous useful events: self-talk, noise, non-native speaker, speaker switches, etc. Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Emotion Samples Annoyed Neutral Disappointed/tired Frustrated Yes Late morning (HYP) Frustrated No No, I am … (HYP) There is no Manila... Neutral July 30 Yes Disappointed/tired No Amused/surprised 3 1 2 8 4 6 5 9 7 10 Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Emotion Class Distribution To get enough data, we grouped annoyed and frustrated, versus else (with speech) Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Prosodic Model Used CART-style decision trees as classifiers Downsampled to equal class priors (due to low rate of frustration, and to normalize across sites) Automatically extracted prosodic features based on recognizer word alignments Used automatic feature-subset selection to avoid problem of greedy tree algorithm Used 3/4 for train, 1/4th for test, no call overlap Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Prosodic Features Duration and speaking rate features Pause features duration of phones, vowels, syllables normalized by phone/vowel means in training data normalized by speaker (all utterances, first 5 only) speaking rate (vowels/time) Pause features duration and count of utterance-internal pauses at various threshold durations ratio of speech frames to total utt-internal frames Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Prosodic Features (cont.) Pitch features F0-fitting approach developed at SRI (Sönmez) LTM model of F0 estimates speaker’s F0 range Many features to capture pitch range, contour shape & size, slopes, locations of interest Normalized using LTM parameters by speaker, using all utts in a call, or only first 5 utts Fitting LTM F0 Time Log F0 Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Features (cont.) Spectral tilt features Other (nonprosodic) features average of 1st cepstral coefficient average slope of linear fit to magnitude spectrum difference in log energies btw high and low bands extracted from longest normalized vowel region Other (nonprosodic) features position of utterance in dialog whether utterance is a repeat or correction to check correlations: hand-coded style features including hyperarticulation Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Language Model Features Train 3-gram LM on data from each class LM used word classes (AIRLINE, CITY, etc.) from SRI Communicator recognizer Given a test utterance, chose class that has highest LM likelihood (assumes equal priors) In prosodic decision tree, use sign of the likelihood difference as input feature Finer-grained LM scores cause overtraining Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Results: Human and Machine Baseline Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Results (cont.) H-H labels agree 72%, complex decision task inherent continuum speaker differences relative vs. absolute judgements? H labels agree 84% with “consensus” (biased) Tree model agrees 76% with consensus-- better than original labelers with each other Prosodic model makes use of a dialog state feature, but without it it’s still better than H-H Language model features alone are not good predictors (dialog feature alone is better) Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Baseline Prosodic Tree duration feature pitch feature other feature REPCO in ec2,rr1,rr2,rex2,inc,ec1,rex1 : 0.7699 0.2301 AF | MAXF0_IN_MAXV_N < 126.93: 0.4735 0.5265 N | MAXF0_IN_MAXV_N >= 126.93: 0.8296 0.1704 AF | | MAXPHDUR_N < 1.6935: 0.6466 0.3534 AF | | | UTTPOS < 5.5: 0.1724 0.8276 N | | | UTTPOS >= 5.5: 0.7008 0.2992 AF | | MAXPHDUR_N >= 1.6935: 0.8852 0.1148 AF REPCO in 0 : 0.3966 0.6034 N | UTTPOS < 7.5: 0.1704 0.8296 N | UTTPOS >= 7.5: 0.4658 0.5342 N | | VOWELDUR_DNORM_E_5 < 1.2396: 0.3771 0.6229 N | | | MINF0TIME < 0.875: 0.2372 0.7628 N | | | MINF0TIME >= 0.875: 0.5 0.5 AF | | | | SYLRATE < 4.7215: 0.562 0.438 AF | | | | | MAXF0_TOPLN < -0.2177: 0.3942 0.6058 N | | | | | MAXF0_TOPLN >= -0.2177: 0.6637 0.3363 AF | | | | SYLRATE >= 4.7215: 0.2816 0.7184 N | | VOWELDUR_DNORM_E_5 >= 1.2396: 0.5983 0.4017 AF | | | MAXPHDUR_N < 1.5395: 0.3841 0.6159 N | | | | MINF0TIME < 0.435: 0.1 0.9 N | | | | MINF0TIME >= 0.435: 0.4545 0.5455 N | | | | | RISERATIO_DNORM_E_5 < 0.69872: 0.3284 0.6716 N | | | | | RISERATIO_DNORM_E_5 >= 0.69872: 0.6111 0.3889 AF | | | MAXPHDUR_N >= 1.5395: 0.6728 0.3272 AF Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Predictors of Annoyed/Frustrated Prosodic: Pitch features: high maximum fitted F0 in longest normalized vowel high speaker-norm. (1st 5 utts) ratio of F0 rises/falls maximum F0 close to speaker’s estimated F0 “topline” minimum fitted F0 late in utterance (no “?” intonation) Prosodic: Duration and speaking rate features long maximum phone-normalized phone duration long max phone- & speaker- norm.(1st 5 utts) vowel low syllable-rate (slower speech) Other: utterance is repeat, rephrase, explicit correction utterance is after 5-7th in dialog Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Effect of Class Definition For less ambiguous tokens, or more extreme tokens performance is significantly better than our baseline Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Error tradeoffs (ROC) Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Conclusion Emotion labeling is a complex decision task Cases that labelers independently agree on are classified with high accuracy Extreme emotion (e.g. ‘frustration’) is classified even more accurately Classifiers rely heavily on prosodic features, particularly duration and stylized pitch Speaker normalizations help, can be online Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Conclusions (cont.) Two nonprosodic features are important: utterance position and repeat/correction Even if repeat/correction not used, prosody still good predictor (better than human-human) Language model is an imperfect surrogate feature for the underlying important feature repeat/correction Look for other useful dialog features! Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Future Directions Use realistic data to get more real frustration Improve features: use new F0 fitting, capture voice quality base on ASR output (1-best straightforward) optimize online normalizations Extend modeling: model frustration sequences, include dialog state exploit speaker ‘habits’ Produce prosodically ‘tagged’ data, using combinations of current feature primitives Extend task to other useful emotions & domains. Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01

Thank You Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01