A Tutorial Dialogue System that Adapts to Student Uncertainty Diane Litman Computer Science Department & Intelligent Systems Program & Learning Research and Development Center
Outline Motivation The ITSPOKE System and Corpora Detecting and Adapting to Student Uncertainty (joint work with Kate Forbes-Riley) – Uncertainty Detection and Adaptation – Experimental Evaluation »Wizard-of-Oz »Fully-Automated Summing Up
Tutorial Dialogue Systems Why is one-on-one tutoring so effective? “...there is something about discourse and natural language (as opposed to sophisticated pedagogical strategies) that explains the effectiveness of unaccomplished human [tutors].” [Graesser, Person et al. 2001] Goal: improve Intelligent Tutoring Systems using Natural Language Processing
More generally... Natural Language Processing and Tools for Learning
More generally... Natural Language Processing and Tools for Learning Learning Language (reading, writing, speaking) Tutors Scoring
More generally... Natural Language Processing and Tools for Learning Learning Language (reading, writing, speaking) Using Language (to teach everything else) Tutors Scoring Conversational Tutors / Peers CSCL
More generally... Natural Language Processing and Tools for Learning Learning Language (reading, writing, speaking) Using Language (to teach everything else) Tutors Scoring Readability Processing Language Conversational Tutors / Peers CSCL Discourse Coding Lecture Retrieval Questioning & Answering
Outline Motivation The ITSPOKE System and Corpora Detecting and Adapting to Student Uncertainty – Uncertainty Detection and Adaptation – Experimental Evaluation Summing Up
ITSPOKE: Intelligent Tutoring Spoken Dialogue System Back-end is Why2-Atlas [VanLehn, Jordan, Rose et al. 2002] Speech Enhanced – Sphinx2 speech recognition – Cepstral text-to-speech Reimplemented, other changes
10
ITSPOKE Corpora Wizard Tutoring (ITSPOKE-WOZ) –81 students / 405 dialogues –human performs speech recognition, semantic analysis –computer performs dialogue management Computer Tutoring (ITSPOKE-AUTO) –72 students / 360 dialogues
Experimental Procedure College students without physics –Read a small background document –Took a multiple-choice Pretest –Worked 5 problems (dialogues) with ITSPOKE –Took an isomorphic Posttest Goal was to optimize Learning Gain – e.g., Posttest – Pretest
Outline Motivation The ITSPOKE System and Corpora Detecting and Adapting to Student Uncertainty – Uncertainty Detection and Adaptation – Experimental Evaluation Summing Up
Why Uncertainty? Most frequent student state in our dialogue corpora [Litman and Forbes-Riley 2004] Focus of other learning sciences, speech and language processing, and psycholinguistic studies [Craig et al. 2004; Liscombe et al. 2005; Pon-Barry et al. 2006; Dijkstra et al. 2006] .73 Kappa [Forbes-Riley et al. 2008]
Corpus-Based Detection Methodology Learn detection models from training corpora –Use spoken language processing to automatically extract features from user turns –Use extracted features (e.g., prosodic, lexical) to predict uncertainty annotations Evaluate learned models on testing corpora –Significant reduction of error compared to baselines [Litman and Forbes-Riley 2006; Litman et al. 2007]
System Adaptation: How to Respond? Theory-based –[VanLehn et al. 2003; Craig et al. 2004] Corpus-based –How do humans respond? e.g. [Forbes-Riley, Rotaru, Litman, and Tetreault 2007] * –What are optimal responses? e.g. [Chi, VanLehn and Litman 2010] * * Best paper awards
Theory-Based Adaptation: Uncertainty as Learning Opportunity Uncertainty represents one type of learning impasse, and is also associated with cognitive disequilibrium – An impasse motivates a student to take an active role in constructing a better understanding of the principle. [VanLehn et al. 2003] –A state of failed expectations causing deliberation aimed at restoring equilibrium. [Craig et al. 2004] Hypothesis: The system should adapt to uncertainty in the same way it responds to other impasses (e.g., incorrectness)
Outline Motivation The ITSPOKE System and Corpora Detecting and Adapting to Student Uncertainty – Uncertainty Detection and Adaptation – Experimental Evaluation Summing Up
Adaptation to Student Uncertainty in ITSPOKE Most systems respond only to (in)correctness Literature suggests uncertain as well as incorrect student answers signal learning impasses Experimentally manipulate tutor responses to student uncertainty, over and above correctness, and investigate impact on learning –Platform: Adaptive version(s) of ITSPOKE
Normal (non-adaptive) ITSPOKE System Initiative Dialogue Format: –Tutor Question – Student Answer – Tutor Response Tutor Response Types: –to Corrects (C): positive feedback (e.g. “Fine”) –to Incorrects (I): negative feedback (e.g. “Well…”) and »Bottom Out: correct answer with reasoning »Subdialogue: questions walk through reasoning
Our Prior Work: Rank correctness (C, I) + uncertainty (U, nonU) states in terms of impasse severity State:I+nonUI+UC+UC+nonU Severity:mostlessleastnone Adaptive ITSPOKE
Our Prior Work: Rank correctness (C, I) + uncertainty (U, nonU) states in terms of impasse severity State:I+nonUI+UC+UC+nonU Severity:mostlessleastnone Adaptation Hypothesis: –ITSPOKE already resolves I impasses (I+nonU, I+U), but it ignores one type of U impasse (C+U) –Performance improvement if ITSPOKE provides additional content to resolve all impasses Adaptive ITSPOKE(s)
Simple Adaptation –Same response for all 3 impasses –Feedback on only (in)correctness Complex Adaptation –Different responses for the 3 impasses –Feedback on both uncertainty and (in)correctness Two Uncertainty Adaptations
Simple Adaptation Example: C+U TUTOR1: By the same reasoning that we used for the car, what’s the overall net force on the truck equal to? STUDENT1: The force of the car hitting it?? [C+U] TUTOR2: Fine. [FEEDBACK] We can derive the net force on the truck by summing the individual forces on it, just like we did for the car. First, what horizontal force is exerted on the truck during the collision? [SUBDIALOGUE] Same TUTOR2 subdialogue if student was I+U or I+nonU
Experiment 1: ITSPOKE-WOZ Wizard of Oz version of ITSPOKE –Human recognizes speech, annotates correctness and uncertainty –Provides upper-bound language performance Conditions –Simple Adaptation: used same response for all impasses –Complex Adaptation: used different responses for each impasse –Normal Control: used original system (no adaptation) –Random Control: gave Simple Adaptation to random 20% of correct answers (to control for additional tutoring)
Results I: Learning MetricConditionNMeanDiffp Learning Gain (Posttest – Pretest) Normal Control21.183< Simple Adaptation.03 Random Control Simple Adaptation Complex Adaptation F(3, 77) = 3.275, p = 0.02
Results I: Learning MetricConditionNMeanDiffp Learning Gain (Posttest – Pretest) Normal Control21.183< Simple Adaptation.03 Random Control Simple Adaptation Complex Adaptation Simple Adaptation yields more student learning than Normal Control (original ITSPOKE) [Forbes-Riley and Litman 2010] F(3, 77) = 3.275, p = 0.02
Results I: Learning MetricConditionNMeanDiffp Learning Gain (Posttest – Pretest) Normal Control21.183< Simple Adaptation.03 Random Control Simple Adaptation Complex Adaptation Simple Adaptation yields more student learning than Normal Control (original ITSPOKE) [Forbes-Riley and Litman 2010] Similar results for learning efficiency [Forbes-Riley and Litman 2009] F(3, 77) = 3.275, p = 0.02
Additional Evaluations - Metacognition Do metacognitive performance measures differ across experimental conditions? –e.g., Monitoring Accuracy [Nietfield et al. 2006] Do metacognitive and cognitive performance measures (i.e. learning) correlate?
Metacognitive Results Simple (and random) increased monitoring accuracy compared to normal (p <.06 in paired contrasts) Monitoring Accuracy is positively correlated with learning [Litman and Forbes-Riley 2009]
Experiment 2: ITSPOKE-AUTO Fully automated ITSPOKE –Sphinx2 speech recognizer / TuTalk semantic analyzer »Correctness Accuracy of 85% –Weka uncertainty model »Logistic regression (includes lexical, prosodic, dialogue features) »Uncertainty Accuracy of 80% Only 3 Conditions –Simple Adaptation –Normal Control –Random Control
Preliminary Results: ITSPOKE-AUTO Simple Adaptation yields more student learning than Normal and Random Controls Differences only significant for a subset of students Noisy uncertainty detection is the system bottleneck 3 of the 4 metacognitive metrics remain correlated with learning [Forbes-Riley and Litman, 2010]
Current and Future Research More sophisticated ITSPOKE adaptations –User modeling (domain knowledge, gender) –Multiple student states (disengagement) –Motivation [Ward 2010] Remediate metacognition, not just domain content
Summing Up Spoken dialogue contributes to the success of human tutors Using presently available technology, successful tutorial dialogue systems can also be built Adapting to uncertainty can further improve performance –Learning gains, efficiency, metacognition Tutors can serve as platforms for learning science studies
Related Projects Natural Language Processing and Tools for Learning Learning Language (reading, writing, speaking) Using Language (to teach everything else) Processing Language Conversational Tutors
Related Projects Natural Language Processing and Tools for Learning Learning Language (reading, writing, speaking) Using Language (to teach everything else) Processing Language Conversational Tutors Tutor Abstraction and Specialization during Reflective Conversation [Katz/Jordan/Litman poster]
Related Projects Natural Language Processing and Tools for Learning Learning Language (reading, writing, speaking) Using Language (to teach everything else) Processing Language Conversational Tutors Tutor Abstraction and Specialization during Reflective Conversation [Katz/Jordan/Litman poster] Semantic Class Acquisition via Web-Learning [Lipschultz/Litman poster]
Related Projects Natural Language Processing and Tools for Learning Learning Language (reading, writing, speaking) Using Language (to teach everything else) Processing Language Computer-Supported Peer Review for Writing [Xiong/Litman/Schunn poster]
Acknowledgements ITSPOKE group past and present –Hua Ai, Min Chi, Joanna Drummond, Kate Forbes-Riley, Heather Friedberg, Alison Huettner, Michael Lipschultz, Beatriz Maeireizo-Tokeshi, Greg Nicholas, Amruta Purandare, Mihai Rotaru, Scott Silliman, Joel Tetreault, Art Ward, Wenting Xiong –Jan Wiebe, Rebecca Hwa, Wendy Chapman Why2-Atlas and Human Tutoring groups –Kurt Vanlehn, Pamela Jordan, Carolyn Rose –Micki Chi, Scotty Craig, Bob Hausmann, Margueritte Roy, Sandra Katz
Thank You! Questions? Further Information –
The End
Example Student States in ITSPOKE ITSPOKE: What else do you need to know to find the box‘s acceleration? Student: the direction [UNCERTAIN] ITSPOKE : If you see a body accelerate, what caused that acceleration? Student: force [CERTAIN] ITSPOKE : Good job. Say there is only one force acting on the box. How is this force, the box's mass, and its acceleration related? Student: velocity [UNCERTAIN] ITSPOKE : Could you please repeat that? Student: velocity [ANNOYED]
WOZ-TUT Screenshot
Bigram Dependency Analysis EXPECTED Tutor IncludePos Tutor OmitsPos neutral certain uncertain mixed OBSERVED Tutor IncludesPos Tutor OmitsPos neutral certain uncertain mixed71161 χ2 = (critical χ2 value at p =.001 is 16.27) - “Student Certainness – Tutor Positive Feedback” Bigrams
Bigram Dependency Analysis (cont.) EXPECTED Includes Pos Omits Pos neutral OBSERVED Includes Pos Omits Pos neutral Less Tutor Positive Feedback after Student Neutral turns
Bigram Dependency Analysis (cont.) EXPECTED Includes Pos Omits Pos neutral certain uncertain mixed OBSERVED Includes Pos Omits Pos neutral certain uncertain mixed Less Tutor Positive Feedback after Student Neutral turns - More Tutor Positive Feedback after “Emotional” turns
Survey Tutoring Uncertainty Spoken Dialogue
Learning Efficiency Results MetricConditionNMeanDiffp Normalized learning gain / total tutoring time in minutes Normal Control21.010< Simple Adapt.004 Random Control Simple Adaptation Complex Adaptation20.011< Simple Adapt.013 Given same amount of tutoring time, Simple Adaptation yields more student learning than either Normal Control or Complex Adaptation Results also hold using raw learning gain, and total number of student turns F(3, 77) = 3.56, p = 0.02
Bias CorrectIncorrect NonUncertainCnonUInonU UncertainCUIU Bias scores greater than and less than zero indicate over-confidence and under-confidence, with zero indicating best performance
Discrimination CorrectIncorrect NonUncertainCnonUInonU UncertainCUIU Discrimination scores greater than zero indicate higher metacognitive performance, in terms of certainty for correct responses and uncertainty for incorrect responses
Results I: Means across Conditions Metacognitive Measure Complex Adaptation (20) Simple Adaptation (20) Random Control (20) Normal Control (21) Average Impasse Severity Monitoring Accuracy Bias Discrimination No statistically significant differences or trends for bias
Results I: Means across Conditions Metacognitive Measure Complex Adaptation (20) Simple Adaptation (20) Random Control (20) Normal Control (21) Average Impasse Severity Monitoring Accuracy Bias Discrimination Trend for discrimination differences overall (p =.09) However, contrary to our predictions, complex reduced discrimination ability, compared to random and simple (p <.04 in paired contrasts)
Intelligent Tutoring
Corpus-Based Adaptation: How Do Human Tutors Respond? An empirical method for designing dialogue systems adaptive to student state –extraction of “dialogue bigrams” from annotated human tutoring corpora –χ 2 analysis to identify dependent bigrams –generalizable to any domain with corpora labeled for user state and system response
Example Human Tutoring Excerpt S: So the- when you throw it up the acceleration will stay the same? [Uncertain] T: Acceleration uh will always be the same because there is- that is being caused by force of gravity which is not changing. [Restatement, Expansion] S: mm-k. [Neutral] T: Acceleration is– it is in- what is the direction uh of this acceleration- acceleration due to gravity? [Short Answer Question] S: It’s- the direction- it’s downward. [Certain] T: Yes, it’s vertically down. [Positive Feedback, Restatement]
Findings Statistically significant dependencies exist between students’ state of certainty and the responses of an expert human tutor –After uncertain, tutor Bottoms Out and avoids expansions –After certain, tutor Restates –After any emotion, tutor increases Feedback Dependencies suggest adaptive strategies for implementation in our computer tutor [Forbes-Riley and Litman 2010]
TUTOR2: That’s exactly right, but you seem unsure, so let’s sum up. [FEEDBACK] The net force on the truck is equal to the impact force on it. We can prove this just like we did for the car. First, we know that gravity and the normal force on the truck must cancel each other, otherwise the truck would not be at rest vertically. Second, we know that the impact force is the only horizontal force exerted on the truck. [LONG BOTTOM OUT] Complex Adaptation Example: C+U
TUTOR1: By the same reasoning that we used for the car, what’s the overall net force on the truck equal to? STUDENT1: Zero?? [I+U] TUTOR2: That’s not correct, but don’t worry. You seem to be aware of your mistake. Let’s resolve it. [FEEDBACK] The net force on the truck is equal to the impact force on it. Let’s walk through this answer step by step. [SHORT BOTTOM OUT] We can derive the net force on the truck by summing the individual forces on it, just like we did for the car. First, what horizontal force is exerted on the truck during the collision? [SUBDIALOGUE] Complex Adaptation Example: I+U
TUTOR1: By the same reasoning that we used for the car, what’s the overall net force on the truck equal to? STUDENT1: The force of gravity. [I+nonU] TUTOR2: I’m sorry, but I see an error in your answer. Let’s fix it. [FEEDBACK] We can derive the net force on the truck by summing the individual forces on it, just like we did for the car. First, what horizontal force is exerted on the truck during the collision? [SUBDIALOGUE] Complex Adaptation Example: I+nonU
Discussion u Predictions versus results: - Complex Adaptation > Simple Adaptation > Random Control > Normal Control u Why didn’t Simple Adaptation and Complex Adaptation outperform Random Control? –Random Control adapted to some C+U, diminishing differences –Adapting to C+nonU may increase certainty u Why didn’t Complex Adaptation outperform Simple Adaptation? –Complex Adaptation’s human-based content responses were based on frequency, not effectiveness
Depending on if answer is C+U, I+U, I+nonU: –ITSPOKE gives same content but varies dialogue act »Based on human tutor responses significantly associated with C+U, I+U, I+nonU answers –ITSPOKE gives complex feedback on uncertainty and (in)correctness »Based on empathetic computer tutor literature (Wang et al., 2005; Hall et al., 2004; Burleson et al., 2004) Complex Adaptation to Uncertainty
Impasse Severity Use the scalar value associated with each student turn to compute an average impasse severity, per student Nominal State:I+nonUI+UC+UC+nonU Scalar State: Severity:mostlessleastnone
Results II Metacognitive Measure (n=81)Rp Average Impasse Severity Monitoring Accuracy Correlations of Metacognitive Measures with Posttest, after controlling for Pretest Average Impasse Severity (where smaller is better) is negatively correlated with learning [Litman and Forbes-Riley 2009]
Additional Results II Metacognitive Measure (n=81)Rp Average Impasse Severity Monitoring Accuracy Monitoring Accuracy (where higher is better) is positively correlated with learning [Litman and Forbes-Riley 2009]
Preliminary Results: ITSPOKE-AUTO Metacognitive Measure WOZAUTO RpRp Average Impasse Severity Monitoring Accuracy Impasse Severity and Monitoring Accuracy remain correlated with learning in ITSPOKE-AUTO corpus [Forbes-Riley and Litman, submitted]
Monitoring Accuracy CorrectIncorrect NonUncertainCnonUInonU UncertainCUIU The wizard's annotations for each student are first represented in an array, where each cell represents a mutually exclusive option motivated by Feeling of (Another’s) Knowing [Smith and Clark 1993; Brennan and Williams 1995] which is closely related to uncertainty [Dijkstra et al. 2006] The array is then used to compute monitoring accuracy
Monitoring Accuracy CorrectIncorrect NonUncertainCnonUInonU UncertainCUIU Ranges from -1 (no monitoring accuracy) to 1 (perfect monitoring accuracy)
Knowledge monitoring accuracy (HC) (Nietfeld et al., 2006) Monitoring one’s own knowledge ≈ one’s Certainty level ≈ one’s Feeling of Knowing (FOK) –HC has been used to measure FOK accuracy (Smith & Clark, 1993): the accuracy with which one’s certainty corresponds to correctness Feeling of Another’s Knowing (FOAK): inferring the FOK of someone else (Brennan & Williams, 1995) –We use HC to measure FOAK accuracy (our certainty is inferred) HC = (COR_CER + INC_UNC) – (INC_CER + COR_UNC) (COR_CER + INC_UNC) + (INC_CER + COR_UNC) Metacognitive Performance Metrics
Knowledge monitoring accuracy (HC) (Nietfeld et al., 2006) Monitoring one’s own knowledge ≈ one’s Certainty level ≈ one’s Feeling of Knowing (FOK) –HC has been used to measure FOK accuracy (Smith & Clark, 1993): the accuracy with which one’s certainty corresponds to correctness Feeling of Another’s Knowing (FOAK): inferring the FOK of someone else (Brennan & Williams, 1995) –We use HC to measure FOAK accuracy (our certainty is inferred) HC = (COR_CER + INC_UNC) – (INC_CER + COR_UNC) (COR_CER + INC_UNC) + (INC_CER + COR_UNC) Metacognitive Performance Metrics Denominator sums over all cases
Knowledge monitoring accuracy (HC) (Nietfeld et al., 2006) Monitoring one’s own knowledge ≈ one’s Certainty level ≈ one’s Feeling of Knowing (FOK) –HC has been used to measure FOK accuracy (Smith & Clark, 1993): the accuracy with which one’s certainty corresponds to correctness Feeling of Another’s Knowing (FOAK): inferring the FOK of someone else (Brennan & Williams, 1995) –We use HC to measure FOAK accuracy (our certainty is inferred) HC = (COR_CER + INC_UNC) – (INC_CER + COR_UNC) (COR_CER + INC_UNC) + (INC_CER + COR_UNC) Metacognitive Performance Metrics cases where (un)certainty and (in)correctness agree
Knowledge monitoring accuracy (HC) (Nietfeld et al., 2006) Monitoring one’s own knowledge ≈ one’s Certainty level ≈ one’s Feeling of Knowing (FOK) –HC has been used to measure FOK accuracy (Smith & Clark, 1993): the accuracy with which certainty corresponds to correctness Feeling of Another’s Knowing (FOAK): inferring the FOK of someone else (Brennan & Williams, 1995) –We use HC to measure FOAK accuracy (our uncertainty is inferred) HC = (COR_CER + INC_UNC) – (INC_CER + COR_UNC) (COR_CER + INC_UNC) + (INC_CER + COR_UNC) Metacognitive Performance Metrics cases where (un)certainty and (in)correctness are at odds
Knowledge monitoring accuracy (HC) (Nietfeld et al., 2006) Monitoring one’s own knowledge ≈ one’s Certainty level ≈ one’s Feeling of Knowing (FOK) –HC has been used to measure FOK accuracy (Smith & Clark, 1993): the accuracy with which certainty corresponds to correctness Feeling of Another’s Knowing (FOAK): inferring the FOK of someone else (Brennan & Williams, 1995) –We use HC to measure FOAK accuracy (our uncertainty is inferred) HC = (COR_CER + INC_UNC) – (INC_CER + COR_UNC) (COR_CER + INC_UNC) + (INC_CER + COR_UNC) Metacognitive Performance Metrics Scores range from -1 (no accuracy) to 1 (perfect accuracy)