Correlations with Learning in Spoken Tutoring Dialogues Diane Litman Learning Research and Development Center and Computer Science Department University of Pittsburgh
Motivation An empirical basis for optimizing dialogue behaviors in spoken tutorial dialogue systems What aspects of dialogue correlate with learning? –Student behaviors –Tutor behaviors –Interacting student and tutor behaviors Do correlations generalize across tutoring situations? –Human-human tutoring –Human-computer tutoring
Approach Initial: learning correlations with superficial dialogue characteristics [ Litman et al., Intelligent Tutoring Systems Conf., 2004] –Easy to compute automatically and in real-time, but… –Correlations in the literature did not generalize to our spoken or human-computer corpora –Results were difficult to interpret »e.g., do longer student turns contain more explanations? Current: learning correlations with deeper “dialogue act” codings [ Forbes-Riley et al., Artificial Intelligence and Education Conf., 2005]
ITSPOKE (Version1): Intelligent Tutoring SPOKEn Dialogue System Back-end is text-based Why2-Atlas tutorial dialogue system (VanLehn et al., 2002) Student speech digitized from microphone input; Sphinx2 speech recognizer Tutor speech played via headphones/speakers; Cepstral text-to-speech synthesizer Other additions: access to Why2-Atlas “internals”, speech recognition repairs, etc.
Two Spoken Tutoring Corpora Human-Human Corpus –14 students –1 human tutor –128 physics problems (dialogues) –5948 student turns, 5505 tutor turns Computer-Human Corpus –20 students –ITSPOKE (Version1) tutor –100 physics problems (dialogues) –2445 student turns, 2967 tutor turns
Dialogue Acts Dialogue Acts represent intentions behind utterances –Both domain-independent and tutoring-specific tagsets »e.g., Graesser and Person, 1994; Graesser et al., 1995; Chi et al., 2001 –Used in prior studies of correlations with learning »e.g., tutor acts in AutoTutor (Jackson et al., 2004), dialogue acts in human tutoring (Chi et al., 2001) ITSPOKE Study –Student and tutor dialogue acts –Unigrams and bigrams of dialogue acts –Human and computer tutoring –Spoken input and output
Tagset (1): (Graesser and Person, 1994) Tutor and Student Question Acts Short Answer Question: basic quantitative relationships Long Answer Question: definition/interpretation of concepts Deep Answer Question: reasoning about causes/effects
Tagset (2): inspired by (Graesser et al., 1995) Tutor Feedback Acts Positive Feedback: overt positive response Negative Feedback: overt negative response Tutor State Acts Restatement: repetitions and rewordings Recap: restating earlier-established points Request/Directive: directions for argument Bottom Out: complete answer after problematic response Hint: partial answer after problematic response Expansion: novel details
Tagset (3): inspired by (Chi et al., 2001) Student Answer Acts Deep Answer: at least 2 concepts with reasoning Novel/Single Answer: one new concept Shallow Answer: one given concept Assertion: answers such as “I don’t know” Tutor and Student Non-Substantive Acts: do not contribute to physics discussion
Annotated Human-Human Excerpt T: Which one will be faster? [Short Answer Question] S: The feathers. [Novel/Single Answer] T: The feathers - why? [Restatement, Deep Answer Question] S: Because there’s less matter. [Deep Answer] All turns in both corpora were manually coded for dialogue acts (Kappa >.6)
Correlations with Unigram Measures (student and tutor-centered analyses) For each student, and each student and tutor dialogue act tag, compute –Tag Total: number of turns containing the tag –Tag Percentage: (tag total) / (turn total) –Tag Ratio: (tag total) / (turns containing tag of that type) Correlate measures with posttest, after regressing out pretest
Human-Computer Results (20 students) Student Dialogue ActsMeanRp # Deep Answer
Human-Computer Results (continued) Tutor Dialogue ActsMeanRp # Deep Answer Question % Deep Answer Question6.27% % Question Act76.89% (Short Answer Question)/Question (Deep Answer Question) /Question # Positive Feedback
Human-Human Results (14 students) Student Dialogue ActsMeanRp # Novel/Single Answer # Deep Answer (Novel/Single Answer)/Answer (Short Answer Question)/Question (Long Answer Question) /Question
Human-Human Results (continued) Tutor Dialogue ActsMeanRp # Request/Directive %Request/Directive5.65% # Restatement # Negative Feedback
Discussion Computer Tutoring: knowledge construction –Positive correlations »Student answers displaying reasoning »Tutor questions requiring reasoning Human Tutoring: more complex –Positive correlations »Student utterances introducing a new concept –Mostly negative correlations »Student attempts at deeper reasoning »Tutor attempts to direct the dialogue
Correlations with Bigram Measures (interaction-centered analyses) For each student, and each tag sequence containing both a tutor and a student dialogue act, compute –[Student Act_n - Tutor Act_n+1] Totals »all bigrams constructed by pairing each Student Dialogue Act in turn n with each Tutor Dialogue Act in turn n+1 –[Tutor Act_n - Student Act_n+1] Totals »all bigrams constructed by pairing each Tutor Dialogue Act in turn n with each Student Dialogue Act in turn n+1 Correlate measures with posttest, after regressing out pretest
Bigram Results Many bigrams incorporate, as either the first or second element, a dialogue act corresponding to one of the unigram results, e.g. –[Student Deep Answer – Tutor Deep Answer Question] –[Tutor Recap - Student Deep Answer] Other dialogue acts only correlate with learning as part of a larger dialogue pattern, e.g. – [Student Shallow Answer - Tutor Restatement] –[Tutor Restatement – Student Shallow Answer]
Discussion Computer Tutoring –n-grams seem able to capture effective learning patterns in this simpler corpus Human Tutoring –despite mostly negative correlations, students are learning! –suggests effective learning patterns are too complicated to be captured with n-grams
Current Directions “Correctness” annotation –Are more Deep Answers “incorrect” or “partially correct” in the human-human corpus? –Do correct answers positively correlate with learning? Beyond the turn level –Correlations with larger dialogue act patterns (e.g., tri- grams, n-grams) –Computation and use of hierarchical discourse structure
Research Question Does the performance of a system’s speech recognizer and/or text-to-speech system relate to learning ? –Speech recognition accuracy correlates with user satisfaction in non-tutoring systems (Litman and Pan, 2002; Walker et al., 2002) –The nature of a computer’s voice relates to learning and motivation in pedagogical agents (Baylor et al., 2003; Atkinson)
ASR Performance Measures (1): Rejections ITSPOKE: Therefore, what is the magnitude of this gravitational force in the horizontal direction? STUDENT:significant ASR: significant (False Rejection) ITSPOKE: Could you please repeat that? STUDENT: great ASR: crate (True Rejection) ITSPOKE: I'm sorry, I'm having trouble understanding you. Please try again.
ASR Performance Measures (2): “Transcription” Misrecognitions ITSPOKE: Yeah. Does the packet have an acceleration? If yes, please specify its direction. STUDENT: yes downward ASR: is downward word error rate:50 % binary word error:True ITSPOKE: How would you describe the vertical component of the packet's velocity? (e.g., decreasing, zero, etc.) STUDENT: increasing ASR: decreasing word error rate:100% binary word error: True
ASR Performance Measures (3): “Semantic” Misrecognitions ITSPOKE: Yeah. Does the packet have an acceleration? If yes, please specify its direction. STUDENT: yes downward NLU(student): downward ASR: is downward NLU(ASR): downward word error rate:50 %semantic error rate: 0% binary word error: True binary semantic error: False ITSPOKE: How would you describe the vertical component of the packet's velocity? (e.g., decreasing, zero, etc.) STUDENT: increasing NLU(student): increase ASR: decreasing NLU(ASR): decrease word error rate: 100 % semantic error rate: 100% binary word error: True binary semantic error: True
Correlations with Learning? Computed totals, percentages, and ratios of: –Rejections »False, True, Both –Misrecognitions »Transcription and Semantic »Word and Binary –ASR Problems (Rejections + Misrecognitions) »Transcription and Semantic »Word and Binary Found no significant correlations or trends!
ITSPOKE (Version2) Prerecorded Output Synthesized Output (human voice) (text-to-speech) ITSPOKE: Terrific. Let's try the original question again. If gravity is the only force acting on an object, will it be moving or staying still? STUDENT:moving (ASR: moving) ITSPOKE: Yes. Not only are the person, keys, and elevator moving, they have only gravitational forces acting on them. When an object is falling and has only gravitational force on it, it is said to be in what? STUDENT: free fall(ASR: free fall) Prerecorded (human voice) Synthesized (text-to-speech)
New Computer Tutoring Experiment Same subject pool, physics problems, web interface, and experimental procedure as before, except –ITSPOKE (Version2) Pre-recorded voice condition –30 students (150 dialogues) Text-to-speech condition –29 students (145 dialogues)
Summary Many dialogue act correlations – positive correlations with deep reasoning and questioning in computer tutoring – correlations in human tutoring more complex – student, tutor, and interactive perspectives all useful No correlations with ASR problems Stay tuned … –New dialogue act patterns and “correctness” analysis –Pre-recorded versus text-to-speech
Acknowledgments The ITSPOKE Group –Staff:Scott Silliman –Research Associates:Kate Forbes-Riley Joel Tetreault Alison Huettner (consultant) –Graduate Students:Ai Hua Beatriz Maeireizo Amruta Purandare Mihai Rotaru Arthur Ward Kurt VanLehn and the Why2 Team
Hypotheses Compared to typed dialogues, spoken interactions will yield better learning gains, and will be more efficient and natural Different student behaviors will correlate with learning in spoken versus typed dialogues, and will be elicited by different tutor actions Findings in human-human and human-computer dialogues will vary as a function of system performance
Motivation Working hypothesis regarding learning gains –Human Dialogue > Computer Dialogue > Text Most human tutoring involves face-to-face spoken interaction, while most computer dialogue tutors are text-based –Evens et al., 2001; Zinn et al., 2002; Vanlehn et al., 2002; Aleven et al., 2001 Can the effectiveness of dialogue tutorial systems be further increased by using spoken interactions?
Potential Benefits of Speech Self-explanation correlates with learning and occurs more in speech – Hausmann and Chi, 2002 Speech contains prosodic information, providing new sources of information for dialogue adaptation –Forbes-Riley and Litman, 2004 Spoken computational environments may prime a more social interpretation that enhances learning –Moreno et al., 2001; Graesser et al., 2003 Potential for hands-free interaction –Smith, 1992; Aist et al., 2003
Spoken Tutorial Dialogue Systems Recent tutoring systems have begun to add spoken language capabilities –Rickel and Johnson, 2000; Graesser et al. 2001; Mostow and Aist, 2001; Aist et al., 2003; Fry et al., 2001; Schultz et al., 2003 However, little empirical analysis of the learning ramifications of using speech
Architecture Cepstral www server www browser java ITSpoke Text Manager Spoken Dialogue Manager essay dialogue student text (xml) tutor turn (xml) html xml text Speech Analysis (Sphinx) dialogue repair goals Essay Analysis (Carmel, Tacitus- lite+) Content Dialogue Manager (Ape, Carmel) Why2 tutorial goals text essay
Speech Recognition: Sphinx2 (CMU) Probabilistic language models for different dialogue states Initial training data –typed student utterances from Why2-Atlas corpora Later training data –spoken utterances obtained during development and pilot testing of ITSPOKE Total vocabulary – 1240 unique words “Semantic Accuracy” Rate = 92.4%
Speech Synthesis: Cepstral Commercial outgrowth of Festival text-to- speech synthesizer (Edinburgh, CMU) Required additional processing of Why2-Atlas prompts (e.g., f=m*a)
Common Experimental Aspects Students take a physics pretest Students read background material Students use web interface to work through up to 10 problems with either a computer or a human tutor Students take a posttest –40 multiple choice questions, isomorphic to pretest
ITSPOKE Corpora Comparison Human-Human Human-Computer …1.3 minutes into session… Student: Can I ask you questions? Tutor: Yes Student: Is that ok? Tutor: Mm-hm Student: Um from what we learned from the last problem they're gonna have the same- the keys and you have the same acceleration right? Tutor: Yes Student: So that means they'll both hit the bottom at the same time. But I don't understand where- I don't understand if you're in the elevator- Tutor: You see Student: Where are you going to-? Tutor: The uh let me uh the key uh- the person holds the key in front of- Student: Their face yeah- Tutor: Uh his or her face and then lets it uh uh it is let go so the the the question relates to the relative position of the key and the face uh- Student: So the key and the face- Tutor: Uh how will they- as they go- as they both fall what will be their relative position? That is the question. Student: (sigh) …3.5 minutes into session… Tutor: I didn't understand. What force acts on all objects within close proximity to earth? Student: uh the force of gravity Tutor: Fine. Besides the Earth's gravitational force, are there any other forces acting on the person, keys, and elevator after the release? Student: no Tutor: Yep. Thus, keep in mind that after the release, the only force on the person, keys, and elevator is the force of gravity. If gravity is the only force acting on an object, will it be moving or staying still? Student: it will be moving Tutor: Fine. Not only are the person, keys, and elevator moving, they have only gravitational forces on them. When an object is falling and has only gravitational force on it, it is said to be in what? Student: freefall Tutor: Good. Thus, after the release the only force on the keys and man is the downward force of earth's gravity, so they are in freefall. Before the release, what is the relationship between the man's velocity and the keys' velocity? Student: they're equal Tutor: Correct. We know that after the release, the keys, the person, and the elevator are in freefall. Is freefall acceleration near earth a constant? Or does it vary depending upon which object is in freefall? Student: it's a constant
Learning Correlations after Controlling for Pretest Dependent Measure Human Spoken (14) Human Typed (20) RpRp Ave. Stud. Words/Turn Intercept: Stud. Words/Turn Ave. Tut. Words/Turn
Learning Correlations after Controlling for Pretest Dependent Measure Spoken (ITSPOKE) Typed (Why2-Atlas) RpRp Tot. Stud. Words Tot. Subdialogues/KCD
ASR-Learning Correlations Key ASR MIS: ASR Misrecognition SEM MIS: Semantic Misrecognition TRUE REJ: True Rejection FALSE REJ: False Rejection REJ: Total Rejections (true or false) ASR PROB: ASR Misrecognition or Rejection SEM PROB: Semantic Misrecognition TIMEOUT: Timeout
Correlation Results (20 students) LearningTime SP MeasureMeanRpRp # ASR MIS # SEM MIS # TRUE REJ # FALSE REJ # REJ # ASR PROB # SEM PROB # TIMEOUT
Current Directions Online dialogue act annotation during computer tutoring –Tutor acts can be authored –Student acts need to be recognized “Correctness” annotation –Are more Deep Answers “incorrect” or “partially correct” in the human-human corpus? –Do correct answers positively correlate with learning? Beyond the turn level –Learning correlations with dialogue act patterns (e.g., bigrams) –Computation and use of discourse structure
Primary Research Question How does speech-based dialogue interaction impact the effectiveness of tutoring systems for student learning?
Spoken Versus Typed Human and Computer Dialogue Tutoring (ITS 2004) Human Tutoring: spoken dialogue yielded learning and efficiency gains –Many differences in superficial dialogue characteristics Computer Tutoring: spoken dialogue made less difference Learning Correlations: few results –Different dialogue characteristics correlate in human versus computer, and in spoken versus typed
Motivation An empirical basis for authoring (or learning) optimal dialogue behaviors in spoken tutorial dialogue systems Previous Approach: learning correlations with superficial dialogue characteristics –Easy to compute automatically and in real-time, but… –Correlations in the literature did not generalize to our spoken or human- computer corpora –Results were difficult to interpret »e.g., do longer student turns contain more explanations? Current Approach: –learning correlations with measures based on deeper dialogue codings
Current Empirical Studies Does learning correlate with measures of automatic speech recognition (ASR) performance? –ITSPOKE (Version1) corpus Does the use of pre-recorded audio rather than text-to- speech improve learning and other measures? –ITSPOKE (Version 2) corpus Can speech recognition “goats” be screened in advance?
User Survey: after (Baylor et al. 2003), (Walker et al. 2002) It was easy to learn from the tutor The tutor did not interfere with my understanding of the content The tutor believed I was knowledgeable The tutor was useful The tutor was effective on conveying ideas The tutor was precise in providing advice The tutor helped me to concentrate It was easy to understand the tutor I knew what I could say or do at each point in the conversations with the tutor The tutor worked the way I expected it to Based on my experience using the tutor to learn physics, I would like to use such a tutor regularly. Response options: almost always, often, sometimes, rarely, almost never
Summary Dialog Act Annotation and Learning Correlations –Human tutoring and ITSPOKE (Version 1) corpora Speech Recognition and Learning Correlations –ITSPOKE (Version 1) corpus ITSPOKE (Version 2) and New Corpus Collection –Pre-recorded versus synthesized speech (in progress)