Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources Kate Forbes-Riley and Diane Litman Learning Research and Development Center and Computer.

Slides:

Advertisements

Similar presentations

Mihai Rotaru Diane J. Litman DoD Group Meeting Presentation

Advertisements

Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.

Detecting Certainness in Spoken Tutorial Dialogues Liscombe, Hirschberg & Venditti Using System and User Performance Features to Improve Emotion Detection.

5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.

Collection and Analysis of Multimodal Interaction in Direction Giving Dialogues Seikei University Takeo TsukamotoYumi Muroya Masashi Okamoto Yukiko Nakano.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

Uncertainty Corpus: Resource to Study User Affect in Complex Spoken Dialogue Systems Kate Forbes-Riley, Diane Litman, Scott Silliman, Amruta Purandare.

Spoken Language Processing Lab Who we are: Julia Hirschberg, Stefan Benus, Fadi Biadsy, Frank Enos, Agus Gravano, Jackson Liscombe, Sameer Maskey, Andrew.

Detecting missrecognitions Predicting with prosody.

Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.

Topics = Domain-Specific Concepts Online Physics Encyclopedia ‘Eric Weisstein's World of Physics’ Contains total 3040 terms including multi-word concepts.

Annotating Student Emotional States in Spoken Tutoring Dialogues Diane Litman and Kate Forbes-Riley Learning Research and Development Center and Computer.

Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman and Kate Forbes-Riley University of Pittsburgh Pittsburgh, PA USA.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Modeling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters Kate Forbes-Riley.

Short Introduction to Machine Learning Instructor: Rada Mihalcea.

Interactive Dialogue Systems Professor Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh Pittsburgh,

circle A Comparison of Tutor and Student Behavior in Speech Versus Text Based Tutoring Carolyn P. Rosé, Diane Litman, Dumisizwe Bhembe, Kate Forbes, Scott.

Privacy Protection for Life-log Video Jayashri Chaudhari, Sen-ching S. Cheung, M. Vijay Venkatesh Department of Electrical and Computer Engineering Center.

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

Cristina Conati Department of Computer Science University of British Columbia Plan Recognition for User-Adaptive Interaction.

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.

Comparing Synthesized versus Pre-Recorded Tutor Speech in an Intelligent Tutoring Spoken Dialogue System Kate Forbes-Riley and Diane Litman and Scott Silliman.

Adaptive Spoken Dialogue Systems & Computational Linguistics Diane J. Litman Dept. of Computer Science & Learning Research and Development Center University.

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

Collaborative Research: Monitoring Student State in Tutorial Spoken Dialogue Diane Litman Computer Science Department and Learning Research and Development.

Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman&Kate Forbes-Riley University of Pittsburgh Department of Computer Science.

Modeling Student Benefits from Illustrations and Graphs Michael Lipschultz Diane Litman Intelligent Tutoring Systems Conference (2014)

Why predict emotions? Feature granularity levels [1] uses pitch features computed at the word-level Offers a better approximation of the pitch contour.

Using Word-level Features to Better Predict Student Emotions during Spoken Tutoring Dialogues Mihai Rotaru Diane J. Litman Graduate Research Competition.

National Taiwan University, Taiwan

Speech and Language Processing for Educational Applications Professor Diane Litman Computer Science Department & Intelligent Systems Program & Learning.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Diane Litman Learning Research & Development Center

Speech and Language Processing for Adaptive Training Diane Litman Professor, Computer Science Department Senior Scientist, Learning Research & Development.

Spoken Dialog Systems Diane J. Litman Professor, Computer Science Department.

Using Prosody to Recognize Student Emotions and Attitudes in Spoken Tutoring Dialogues Diane Litman Department of Computer Science and Learning Research.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

(Speech and Affect in Intelligent Tutoring) Spoken Dialogue Systems Diane Litman Computer Science Department and Learning Research and Development Center.

Metacognition and Learning in Spoken Dialogue Computer Tutoring Kate Forbes-Riley and Diane Litman Learning Research and Development Center University.

circle Spoken Dialogue for the Why2 Intelligent Tutoring System Diane J. Litman Learning Research and Development Center & Computer Science Department.

Modeling Student Benefits from Illustrations and Graphs Michael Lipschultz Diane Litman Computer Science Department University of Pittsburgh.

A Tutorial Dialogue System that Adapts to Student Uncertainty Diane Litman Computer Science Department & Intelligent Systems Program & Learning Research.

circle Towards Spoken Dialogue Systems for Tutorial Applications Diane Litman Reprise of LRDC Board of Visitors Meeting, April 2003.

Improving (Meta)cognitive Tutoring by Detecting and Responding to Uncertainty Diane Litman & Kate Forbes-Riley University of Pittsburgh Pittsburgh, PA.

User Simulation for Spoken Dialogue Systems Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh.

Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.

Using Natural Language Processing to Analyze Tutorial Dialogue Corpora Across Domains and Modalities Diane Litman, University of Pittsburgh, Pittsburgh,

Detecting and Adapting to Student Uncertainty in a Spoken Tutorial Dialogue System Diane Litman Computer Science Department & Learning Research & Development.

Prosodic Cues to Disengagement and Uncertainty in Physics Tutorial Dialogues Diane Litman, Heather Friedberg, Kate Forbes-Riley University of Pittsburgh.

Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research

Language Identification and Part-of-Speech Tagging

Linguistic knowledge for Speech recognition

Investigating Pitch Accent Recognition in Non-native Speech

Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.

Towards Emotion Prediction in Spoken Tutoring Dialogues

Conditional Random Fields for ASR

For Evaluating Dialog Error Conditions Based on Acoustic Information

Dialogue-Learning Correlations in Spoken Dialogue Tutoring

Spoken Dialogue Systems

Comparing American and Palestinian Perceptions of Charisma Using Acoustic-Prosodic and Lexical Analysis Fadi Biadsy, Julia Hirschberg, Andrew Rosenberg,

Recognizing Structure: Sentence, Speaker, andTopic Segmentation

High Frequency Word Entrainment in Spoken Dialogue

Spoken Dialogue Systems

iSRD Spam Review Detection with Imbalanced Data Distributions

Recognizing Structure: Dialogue Acts and Segmentation

Machine Learning in Practice Lecture 27

Low Level Cues to Emotion

Automatic Prosodic Event Detection

Presentation transcript:

Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources Kate Forbes-Riley and Diane Litman Learning Research and Development Center and Computer Science Department University of Pittsburgh

Overview Motivation spoken dialogue tutoring systems Emotion Annotation positive, negative and neutral student states Machine Learning Experiments extract linguistic features from student speech use different feature sets to predict emotions best-performing feature set: speech & text, turn & context 84.75% accuracy, 44% error reduction

Motivation  Bridge Learning Gap between Human Tutors and Computer Tutors  (Aist et al., 2002): Adding human-provided emotional scaffolding to a reading tutor increases student persistence  Our Approach: Add emotion prediction and adaptation to ITSPOKE, our Intelligent Tutoring SPOKEn dialogue system (demo paper)

Experimental Data  Human Tutoring Spoken Dialogue Corpus 128 dialogues (physics problems), 14 subjects 45 average student and tutor turns per dialogue Same physics problems, subject pool, web interface, and experimental procedure as ITSPOKE

Emotion Annotation Scheme (Sigdial’04)  Perceived “Emotions”  Task- and Context-Relative  3 Main Emotion Classes: negative  neutral  positive  3 Minor Emotion Classes: weak negative, weak positive, mixed

Example Annotated Excerpt (weak, mixed -> neutral) Tutor: Uh let us talk of one car first. Student: ok. (EMOTION = NEUTRAL) Tutor: If there is a car, what is it that exerts force on the car such that it accelerates forward? Student: The engine. (EMOTION = POSITIVE) Tutor: Uh well engine is part of the car, so how can it exert force on itself? Student: um… (EMOTION = NEGATIVE)

Emotion Annotated Data  453 student turns, 10 dialogues, 9 subjects  2 annotators, 3 main emotion classes  385/453 agreed (84.99%, Kappa 0.68) NegativeNeutralPositive Negative9064 Neutral Positive0515

Feature Extraction per Student Turn Five feature types – acoustic-prosodic (1) – non acoustic-prosodic lexical (2) other automatic (3) manual (4) –identifiers (5) Research questions –utility of different features –speaker and task dependence

Feature Types (1) Acoustic-Prosodic Features (normalized)  4 pitch (f0) : max, min, mean, standard dev.  4 energy (RMS) : max, min, mean, standard dev.  4 temporal: turn duration (seconds) pause length preceding turn (seconds) tempo (syllables/second) internal silence in turn (zero f0 frames)  available to ITSPOKE in real time

Feature Types (2) Lexical Items  word occurrence vector

Feature Types (3) Other Automatic Features: available from ITSPOKE logs  Turn Begin Time (seconds from dialog start)  Turn End Time (seconds from dialog start)  Is Temporal Barge-in (student turn begins before tutor turn ends)  Is Temporal Overlap (student turn begins and ends in tutor turn)  Number of Words in Turn  Number of Syllables in Turn

Feature Types (4 ) Manual Features: (currently) available only from human transcription  Is Prior Tutor Question (tutor turn contains “?”)  Is Student Question (student turn contains “?”)  Is Semantic Barge-in (student turn begins at tutor word/pause boundary)  Number of Hedging/Grounding Phrases (e.g. “mm- hm”, “um”)  Is Grounding (canonical phrase turns not preceded by a tutor question)  Number of False Starts in Turn (e.g. acc-acceleration)

Feature Types (5) Identifier Features  subject ID  problem ID  subject gender

Machine Learning (ML) Experiments  Weka software: boosted decision trees give best results (Litman&Forbes, ASRU 2003)  Baseline: Predicts Majority Class (neutral) Accuracy = 72.74%  Methodology: 10 runs of 10-fold cross validation  Evaluation Metrics Mean Accuracy: %Correct Relative Improvement Over Baseline (RI): error(baseline) – error(x) error(baseline)

Acoustic-Prosodic vs. Other Features  Baseline = 72.74%; RI range = 12.69% % Feature Set-ident speech76.20% lexical78.31% lexical + automatic80.38% lexical + automatic + manual83.19%  Acoustic-prosodic features (“speech”) outperform majority baseline, but other feature types yield even higher accuracy, and the more the better

Acoustic-Prosodic plus Other Features Feature Set-ident speech + lexical79.26% speech + lexical + automatic79.64% speech + lexical + automatic + manual83.69%  Baseline = 72.74%; RI range = 23.29% %  Adding acoustic-prosodic to other feature sets doesn’t significantly improve performance

Adding Contextual Features (Litman et al. 2001, Batliner et al 2003): adding contextual features improves prediction accuracy Local Features: the values of all features for the two student turns preceding the student turn to be predicted Global Features: running averages and total for all features, over all student turns preceding the student turn to be predicted

Previous Feature Sets plus Context  Same feature set with no context: 83.69% Feature Set+context-ident speech + lexical + auto + manual local82.44 speech + lexical + auto + manual global84.75 speech + lexical + auto + manual local+global81.43  Adding global contextual features marginally improves performance, e.g.

Feature Usage Feature TypeTurn + Global Acoustic-Prosodic16.26% Temporal13.80% Energy 2.46% Pitch 0.00% Other83.74% Lexical41.87% Automatic 9.36% Manual32.51%

Accuracies over ML Experiments

Related Research in Emotional Speech uActor/Native Read Speech Corpora (Polzin & Waibel 1998; Oudeyer 2002; Liscombe et al. 2003)  more emotions; multiple dimensions  acoustic-prosodic predictors uNaturally-Occurring Speech Corpora (Ang et al. 2002; Lee et al. 2002; Batliner et al. 2003; Devillers et al. 2003; Shafran et al. 2003)  less emotions (e.g. E / -E); Kappas < 0.6  additional (non acoustic-prosodic) predictors uFew address the tutoring domain

Summary  Methodology: Annotation of student emotions in spoken human tutoring dialogues, extraction of linguistic features, and use of different feature sets to predict emotions  Our best-performing feature set contains acoustic- prosodic, lexical, automatic and hand-labeled features from turn and context (Accuracy = 85%, RI = 44%)  This research is a first step towards implementing emotion prediction and adaptation in ITSPOKE

Current Directions Address same questions in ITSPOKE computer tutoring corpus (ACL’04) Label human tutor reactions to student emotions to:  develop adaptive strategies for ITSPOKE  examine the utility of different annotation granularities  determine if greater tutor response to student emotions correlates with student learning and other performance measures

Thank You! Questions?

Prior Research: Affective Computer Tutoring (Kort and Reilly and Picard., 2001): propose a cyclical model of emotion change during learning; developing a non-dialog computer tutor that will use eye-tracking/facial features to predict emotion and support movement into positive emotions. (Aist and Kort and Reilly and Mostow and Picard, 2002): Adding human-provided emotional scaffolding to an automated reading tutor increases student persistence (Evens et al, 2002): for CIRCSIM: computer dialog tutor for physiology problems; hypothesize adaptive strategies for recognized student emotional states; e.g. if detecting frustration, system should respond to hedges and self-deprecation by supplying praise and restructuring the problem. (de Vicente and Pain, 2002): use human observation about student motivational states in videod interaction with non-dialog computer tutor to develop rules for detection (Ward and Tsukahara, 2003): spoken dialog computer “tutor-support” uses prosodic and contextual features of user turn (e.g. “on a roll”, “lively”, “in trouble”) to infer appropriate response as users remember train stations. Preferred over randomly chosen acknowledgments (e.g. “yes”, “right” “that’s it”, “that’s it ”,… ) (Conati and Zhou, 2004): use Dynamic Bayesian Networks) to reason under uncertainty about abstracted student knowledge and emotional states through time, based on student moves in non-dialog computer game, and to guide selection of “tutor” responses.  Most will be relevant to developing ITSPOKE adaptation techniques

ML Experiment 3: Other Evaluation Metrics  alltext + speech + ident: leave-one-out cross-validation (accuracy = 82.08%)  Best for neutral, better for negatives than positives  Baseline: neutral:.73, 1,.84; negatives and positives = 0, 0, 0 ClassPrecisionRecallF-Measure Negative Neutral Positive

Machine Learning (ML) Experiments  Weka machine-learning software: boosted decision trees give best results (Litman&Forbes, 2003)  Baseline: Predicts Majority Class (neutral) Accuracy = 72.74%  Methodology: 10 x 10 cross validation  Evaluation Metrics Mean Accuracy: %Correct Standard Error: SE = std(x)/sqrt(n), n=10 runs +/- 2*SE = 95% confidence interval Relative Improvement Over Baseline (RI): error(baseline) – error(x) error(baseline) error(y) = % Correct

Outline  Introduction  ITSPOKE Project  Emotion Annotation  Machine-Learning Experiments  Conclusions and Current Directions

ITSPOKE: Intelligent Tutoring SPOKEn Dialogue System  Back-end is text-based Why2-Atlas tutorial dialogue system (VanLehn et al., 2002)  Sphinx2 speech recognizer  Cepstral text-to-speech synthesizer  Try ITSPOKE during demo session !

Experimental Procedure  Students take a physics pretest  Students read background material  Students use the web and voice interface to work through up to 10 problems with either ITSPOKE or a human tutor  Students take a post-test

ML Experiment 3: 8 Feature Sets + Context  Global context marginally better than local or combined  No significant difference between +/- ident sets  e.g., speech without context: 76.20% (-ident), 77.41% (+ident) Feature Set+context-ident+ident speechlocal speechglobal speechlocal+global  Adding context marginally improves some performances

8 Feature Sets speech: normalized acoustic-prosodic features lexical: lexical items in the turn autotext: lexical + automatic features alltext: lexical + automatic + manual features +ident: each of above + identifier features