Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman and Kate Forbes-Riley University of Pittsburgh Pittsburgh, PA USA
Motivation Bridge Learning Gap between Human Tutors and Computer Tutors Our Approach: Add emotion prediction and adaptation to ITSPOKE, our Intelligent Tutoring SPOKEn dialogue system
Back-end is Why2-Atlas system (VanLehn et al., 2002) Sphinx2 speech recognition and Cepstral text-to-speech
Back-end is Why2-Atlas system (VanLehn et al., 2002) Sphinx2 speech recognition and Cepstral text-to-speech
Back-end is Why2-Atlas system (VanLehn et al., 2002) Sphinx2 speech recognition and Cepstral text-to-speech
Human-Computer Excerpt Tutor26: Let me ask my question in a more general way. What happens to the velocity of a body when there is no force acting on it? Student27:dammit (ASR: it is) Tutor28 :Could you please repeat that? Student29 :same (ASR: i same) Tutor30 :Right. And what happens to the HORIZONTAL velocity of a body when there is no HORIZONTAL force acting on it? Student31:zero (ASR: the zero) Tutor32 :Well... OK, let's consider how forces in the vertical direction affect the motion of a body in a horizontal direction in a different scenario Student33:oh gravity you already said this (ASR: beats gravity you're exert less) Tutor34:Fine. Are there any other forces acting on the apple as it falls? Student35:no why are you doing this again (ASR: no y and to it yes) Tutor36:Correct. The only force acting on the apple as it falls to the earth is the earth's gravitational force. In which direction does gravitational force act? Student37:downward you computer (ASR: downward you computer)
Outline Data and Emotion Annotation Machine Learning Experiments –extract linguistic features from student turns –use different feature sets to predict emotions 19-36% relative reduction of baseline error –comparison with human tutoring
ITSPOKE Dialogue Corpus – 100 spoken tutoring dialogues (physics problems) with ITSPOKE on average, 19.4 minutes and 25 student turns – 20 subjects university students who have never taken college physics and who are native speakers
Emotion Annotation Scheme (Sigdial’04) ‘Emotion’: emotions/attitudes that may impact learning Annotation of Student Turns Emotion Classes negative e.g. uncertain, bored, irritated, confused, sad positive e.g. confident, enthusiastic neutral no weak or strong expression of negative or positive emotion
Example Annotated Excerpt ITSPOKE: What happens to the velocity of a body when there is no force acting on it? Student: dammit (NEGATIVE) ASR: it is ITSPOKE : Could you please repeat that? Student: same (NEUTRAL) ASR: i same
Agreement Study 333 student turns, 15 dialogues 2 annotators (the authors) NegativeNeutralPositive Negative89306 Neutral Positive619
Emotion Classification Tasks Negative, Neutral, Positive Kappa =.4, Weighted Kappa =.5 Focus of this talk Negative, Non-Negative Kappa =.5 Emotional, Non-Emotional Kappa =.3 Results on par with prior research Kappas of in (Ang et al. 2002; Narayanan 2002; Shafran et al. 2003)
Feature Extraction per Student Turn Three feature types 1.Acoustic-prosodic 2.Lexical 3.Identifiers Research questions –Relative utility of acoustic-prosodic, lexical and identifier features –Impact of speech recognition –Comparison with human tutoring (HLT/NAACL, 2004)
Feature Types (1) Acoustic-Prosodic Features 4 pitch (f0) : max, min, mean, standard dev. 4 energy (RMS) : max, min, mean, standard dev. 4 temporal: turn duration (seconds) pause length preceding turn (seconds) tempo (syllables/second) internal silence in turn (zero f0 frames) available to ITSPOKE in real time
Feature Types (2) Word Occurrence Vectors Human-transcribed lexical items in the turn ITSPOKE-recognized lexical items
Feature Types (3) Identifier Features student id student gender problem id
Machine Learning Experiments Weka software: Boosted decision trees –gave best results in pilot studies (ASRU 2003) Baseline: Majority class (neutral) Methodology: 10 runs of 10-fold cross validation Evaluation Metric: Accuracy Datasets: –Agreed (202/333 turns where annotators agreed) –Consensus (all 333 turns after annotators resolved disagreements)
Acoustic-Prosodic vs. Lexical Features (Agreed Turns) Baseline = 46.52% Feature Set-ident speech55.49% lexical52.66% speech+lexical62.08% Both acoustic-prosodic (“speech”) and lexical features significantly outperform the majority baseline Combining feature types yields an even higher accuracy
Adding Identifier Features (Agreed Turns) Baseline = 46.52% Feature Set-ident+ident speech55.49%62.03% lexical52.66%67.84% speech+lexical62.08%63.52% Adding identifier features improves all results With identifier features, lexical information now yields the highest accuracy
Using Automatic Speech Recognition (Agreed Turns) Baseline = 46.52% Feature Set-ident+ident lexical52.66%67.84% ASR57.95%65.70% speech+lexical62.08%63.52% speech+ASR61.22%62.23% Surprisingly, using ASR output rather than human transcriptions does not particularly degrade accuracy
Summary of Results (Agreed Turns)
Summary of Results (Consensus Turns) - Using consensus rather than agreed data decreases predictive accuracy for all feature sets, but other observations generally hold
Comparison with Human Tutoring (Agreed Turns) - In human tutoring dialogues, emotion prediction (and annotation) is more accurate and based on somewhat different features
Related Research in Emotional Speech uElicited Speech (Polzin & Waibel 1998; Oudeyer 2002; Liscombe et al. 2003) uNaturally-Occurring Speech (Ang et al. 2002; Lee et al. 2002; Batliner et al. 2003; Devillers et al. 2003; Shafran et al. 2003) uOur Work naturally-occurring tutoring data analysis of comparable human and computer corpora
Current Directions Develop adaptive strategies for ITSPOKE –annotate human tutor turns –evaluate ITSPOKE with emotion adaptation Co-training to address annotation bottleneck –Maeireizo, Litman, and Hwa: Saturday poster
Summary Recognition of annotated student emotions in spoken computer tutoring dialogues Feature sets containing acoustic-prosodic, lexical, and/or identifier features yield significant improvements in predictive accuracy compared to majority class baselines –role of differing feature types and speech recognition errors –comparable analysis of human tutoring dialogues –paper contains details regarding two other emotion prediction tasks This research is a first step towards implementing emotion prediction and adaptation in ITSPOKE
Thank You! Questions?
Example Annotated Excerpt ITSPOKE: What else do you need to know to find the box's acceleration? Student: the direction (NEGATIVE, UNCERTAIN) ASR: add directions ITSPOKE : If you see a body accelerate, what caused that acceleration? Student: force (POSITIVE, CONFIDENT) ASR: force ITSPOKE : Good job. Say there is only one force acting on the box. How is this force, the box's mass, and its acceleration related? Student: velocity (NEGATIVE, UNCERTAIN) ASR: velocity ITSPOKE : Could you please repeat that? Student: velocity (NEGATIVE, IRRITATED) ASR: velocity
Prior Research: Affective Computer Tutoring (Kort and Reilly and Picard., 2001): propose a cyclical model of emotion change during learning; developing a non-dialog computer tutor that will use eye-tracking/facial features to predict emotion and support movement into positive emotions. (Aist and Kort and Reilly and Mostow and Picard, 2002): Adding human-provided emotional scaffolding to an automated reading tutor increases student persistence (Evens et al, 2002): for CIRCSIM: computer dialog tutor for physiology problems; hypothesize adaptive strategies for recognized student emotional states; e.g. if detecting frustration, system should respond to hedges and self-deprecation by supplying praise and restructuring the problem. (de Vicente and Pain, 2002): use human observation about student motivational states in videod interaction with non-dialog computer tutor to develop rules for detection (Ward and Tsukahara, 2003): spoken dialog computer “tutor-support” uses prosodic and contextual features of user turn (e.g. “on a roll”, “lively”, “in trouble”) to infer appropriate response as users remember train stations. Preferred over randomly chosen acknowledgments (e.g. “yes”, “right” “that’s it”, “that’s it ”,… ) (Conati and Zhou, 2004): use Dynamic Bayesian Networks) to reason under uncertainty about abstracted student knowledge and emotional states through time, based on student moves in non-dialog computer game, and to guide selection of “tutor” responses. Most will be relevant to developing ITSPOKE adaptation techniques
Experimental Procedure Students take a physics pretest Students read background material Students use the web and voice interface to work through up to 10 problems with either ITSPOKE or a human tutor Students take a post-test