Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Slides:



Advertisements
Similar presentations
What is Differentiation?
Advertisements

Evaluation Procedures
Click to add title The Flipped Classroom Beyond the Buzz… Dr. Cynthia Furse Electrical & Computer Engineering.
1 COMM 301: Empirical Research in Communication Lecture 10 Kwan M Lee.
Randomized Experimental Design
Orange Lesson: Math: Identifying Student Misconceptions- High School School Certification.
College Algebra Course Redesign Southeast Missouri State University.
1 In-vivo research on learning Charles Perfetti PSLC Summer School 2009.
Longitudinal Experiments Larry V. Hedges Northwestern University Prepared for the IES Summer Research Training Institute July 28, 2010.
QA on Anderson et al Intro to CTAT CPI 494 & 598 Jan 27, 2009 Kurt VanLehn.
1 User Centered Design and Evaluation. 2 Overview Why involve users at all? What is a user-centered approach? Evaluation strategies Examples from “Snap-Together.
Chan & Chou’s system Chan, T.-W., & Chou, C.-Y. (1997). Exploring the design of computer supports for reciprocal tutoring. International Journal of Artificial.
MATH IN THE MIDDLE MICHAEL A. COBELENS. Problem Solving Identify Learning Experiences Purpose: Methods of Teaching Problem Solving and Computational Skills.
Formative and Summative Evaluations
An introduction to intelligent interactive instructional systems
1 User Centered Design and Evaluation. 2 Overview My evaluation experience Why involve users at all? What is a user-centered approach? Evaluation strategies.
Using MyMathLab Features You must already be registered or enrolled in a current MyMathLab class in order to use MyMathLab. If you are not registered or.
Today Concepts underlying inferential statistics
Project Design and Data Collection Methods: A quick and dirty introduction to classroom research Margaret Waterman September 21, 2005 SoTL Fellows
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Chapter 14: Usability testing and field studies
How does an interactive learning environment affect the students’ learning? Marina Issakova University of Tartu, Institute of Computer Science Estonia.
Quasi-Experimental Designs Manipulated Treatment Variable but Groups Not Equated.
Marsha Lovett, Oded Meyer and Candace Thille Presented by John Rinderle More Students, New Instructors: Measuring the Effectiveness of the OLI Statistics.
Newton’s Classroom: It’s what’s for teachers, too! November 2007.
New Teachers’ Induction January 20, 2011 Office of Curriculum and Instruction.
Cognitive Apprenticeship “Mastering knowledge” CLICK TO START.
Aptitude by Treatment Interactions and Gagné’s Nine Events of Instruction Dr. K. A. Korb University of Jos.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Usability Testing Chapter 6. Reliability Can you repeat the test?
Diagnostic Pathfinder for Instructors. Diagnostic Pathfinder Local File vs. Database Normal operations Expert operations Admin operations.
Assessing Student Learning to Improve Teaching Jeff Bell and Jim Bidlack.
CAHSEE REVOLUTION PREP Training Presentation Lesson 3 CAHSEE Revolution Prep An effective study tool for CAHSEE preparation. Instructor: Lori Cummings.
Classifying Designs of MSP Evaluations Lessons Learned and Recommendations Barbara E. Lovitts June 11, 2008.
California Educational Research Association Annual Meeting Rancho Mirage, CA – December 5, 2008 Hoky Min, Gregory K. W. K. Chung, Rebecca Buschang, Lianna.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Threats to Validity. overview of threats to validity confounding variables faulty manipulation loose procedures order effects experimenter effects Hawthorne.
Will it work for us? Dan Clune IT596 Spring 2005.
Please CLOSE YOUR LAPTOPS, and turn off and put away your cell phones, and get out your note- taking materials.
Experiments. The essential feature of the strategy of experimental research is that you… Compare two or more situations (e.g., schools) that are as similar.
Patrik Hultberg Kalamazoo College
Analysis of Covariance ANCOVA Chapter 11. ANOVA Terminology The purpose of this experiment was to compare the effects of the dose of ginseng 人蔘 (placebo,
Leena Razzaq and Neil T. Heffernan Venu Babu Thati Hints: Is It Better to Give or Wait to Be Asked? 1.
Assessment embedded in step- based tutors (SBTs) CPI 494 Feb 12, 2009 Kurt VanLehn ASU.
Chapter 10 Copyright © Allyn & Bacon 2008 This multimedia product and its contents are protected under copyright law. The following are prohibited by law:
The Power of Comparison in Learning & Instruction Learning Outcomes Supported by Different Types of Comparisons Dr. Jon R. Star, Harvard University Dr.
From July 9 Feedback A Message from Karen Hi, Everyone! Every other day, I will address the Needs you have and give you what we hope.
Steps in Planning a Usability Test Determine Who We Want To Test Determine What We Want to Test Determine Our Test Metrics Write or Choose our Scenario.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Analysis. This involves investigating what is required from the new system and what facilities are available. It would probably include:
Joshua Miller Blytheville Middle School 7 th Grade Math Student Engagement.
Principles of Demonstrative Instructional Video Peyton R. Glore Assistant Professor School of Information Technology Macon State College October 17, 2007.
Experimental Design. Talking While Driving: Are Cell Phone Conversations Different from Passenger Conversations? Research Problems Talking on a cellphone.
AEGIS training systems CPI 494/598, April 21 Kurt VanLehn.
Partnership Beginning-Level Strategic Tutoring Advanced-Level Strategic Tutoring Building Strategic Tutoring Partnerships Partnership Beginning-Level Strategic.
Evaluation through user participation
CS-411 : Digital Education & Learning Analytics
Constraint-based tutoring
Presented by Nguyen Kieu Oanh (M.A)
12 Inferential Analysis.
Research on Geoscience Learning
The purposes of grading student work
CS 139 – Programming Fundamentals
The Behavior of Tutoring Systems
12 Inferential Analysis.
Mike Timms and Cathleen Kennedy University of California, Berkeley
Research on Geoscience Learning
Julie Booth, Robert Siegler, Ken Koedinger & Bethany Rittle-Johnson
Understanding Statistical Inferences
Presentation transcript:

Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006

Outline u Formative evaluations: How to improve the tutor? u Summative evaluations: How well does the tutor work? u Parametric evaluations: Why does the tutor work?

User interface testing (usually the first formative eval) u When –Early, before the rest of the tutor is built u Who –Students –Instructors u How –Talk aloud with headset mikes and Camtasia –Sit with the user and ask them about every frown –Other? u Also great for finding good problems to assign

The Andes user interface

Pyrenee’s GUI for same problem

Andes’ GUI for defining a velocity vector

Pyrenees’ method T: What would you like to define? a) displacement b) velocity c) acceleration S: b T: Velocity of… a) Hailstone b) Cloud c) Earth S: a T: Type? a) instantaneous b) Average S: a T: Time point? a)T0 b)T1 S: a T: Orientation? S: 270 deg T: Name? S: vi

Wizard of Oz (A formative evaluation) u Format –Human in the next room watches a copy of screen –Human responds when student presses Hint button or makes an error –Human must type very fast! Paste from stock answers? u User interface evaluation –Does the human have enough information? –Can the human intervene early enough? u Knowledge elicitation –What tutoring tactics were used? –What conditions determine when each tactic is used?

Snapshot critiques (A late formative evaluation) u Procedure –ITS keeps log file –Afterwards, randomly select events in log where student got help –Print context leading up to the help message –Expert tutors write their help on the paper u How frequently does expert’s help match ITS’s? –How frequently do two expert’s help match? u Add to the ITS the help that experts agree on

Outline u Formative evaluations: How to improve the tutor? u Summative evaluations: How well does the tutor work? u Parametric evaluations: Why does the tutor work? Next

Summative evaluations u Question: Is the tutor more effective than a control? u Typical design –Experimental group uses the tutor –Control group learns via the “traditional” method –Pre & post tests u Data analysis –Did the tutor group “do better” than the control?

Three feasible designs. One factor is forced to be equal. Two factors vary. Like homework Like seatwork Mastery learning Training problems Tutor = control Tutor > control? Tutor < control? Training duration Tutor < control? Tutor = control Tutor < control? Post-test score Tutor > control? Tutor = control

Control conditions u Typical control conditions –Existing classroom instruction –Textbook & exercise problems (feedback?) –Another tutoring system –Human tutoring »Null result does not “prove” computer tutor = human tutor u Define your control condition early –Drives the design of the tutor

Assessments (tests) u Your tests u Instructor’s normal tests u Standardized tests

When to test u Pre-test u Immediate post-test u Delayed post-test –Measures retention u Learning (pre-test, training, post-test) –Measures acceleration of future learning (also called preparation for learning)

Example of acceleration of future learning (Min Chi & VanLehn, in prep.) u Design –Training on probability then physics –During probability only, »Half students taught an explicit strategy »Half not taught a strategy (normal instruction) PrePost Probability Training Score PrePost Physics Training Score Preparation for learning Ordinary transfer

Content of post-tests u Some problems from the pre-test –Determines if any learning occurred at all u Some problem similar to training problems –Measures near-transfer u Some problems dissimilar to training problems –Measures far-transfer u Use your cognitive task analysis!

Bad tests happen, so Pilot Pilot Pilot! u Blatant mistakes (shows up in means) –Too hard (floor effect) –Too easy (ceiling effect) –Too long (mental attrition) u Subtle mistakes (check variance) –Test doesn’t cover some training content –Test over-covers some training content –Test is too sensitive to background knowledge »e.g., reading, basic math

Did the conditions differ? u My advice: Always do ANCOVAs –Condition is independent variable –Post-test score is dependent –Pre-test score is co-variate u Others advice: –Do ANOVAs on gains –If pre-test scores are not significantly different, do ANOVAs on post-test scores

Effect sizes: Cohen’s d u Should be based on post-test scores: [mean(experimental)-mean(control)] / standard_deviation(control) u Common but misleading usage: [mean(post-test) – mean(pre-test)] / standard_deviation(pre-test)

Error bars help visualize results

Scatter plots help visualize results Pre-test score (or GPA) Post-test score Andes Control

If slopes were different, would have aptitude-treatment interaction (ATI) Pre-test score (or GPA) Post-test score Andes Control

Which students did the tutor help? u Divide subjects into high/low pretest u Plot gains u Called “aptitude- treatment interaction” (ATI) u Need more subjects

Which topics did the tutor teach best? u Divide test items (e.g., into deep/shallow knowledge) u Plot gains u Need more items

Log file analyses u Did students use the tutor as expected? –Using help too much (help abusers) –Using help too little (help refusers) –Copying a solution from someone else (exclude?) u Correlations with gain –Errors corrected with or without help –Proportion of bottom-out hints –Time spent thinking before/after a hint u Learning curves for productions –If not a smooth curve, is it really a single production?

Practical issues u All experiments –Human subjects institutional review board (IRB) u Lab experiments –Recruiting subjects over a whole semester; knowledge varies –Attrition: Students quit before they finish u Field (classroom) experiments –Access to classrooms and teachers –Instructors’ enthusiasm, technosavy, agreement with pedagogy –Ethics of requiring use/non-use of the tutor in high-stakes classes –Their tests vs. your tests u Web experiments –Insuring random assignment vs. attrition

Outline u Formative evaluations: How to improve the tutor? u Summative evaluations: How well does the tutor work? u Parametric evaluations: Why does the tutor work? Next

Parametric evaluations: Why does the tutor work? u Hypothesize sources of benefit, such as –Explication of hidden problem solving skill –Novel reification (GUI) »E.g., showing goals on the screen –Novel sequence of exercises/topics »E.g., story problems first, then equations –Immediate feedback & help u Plan an experiment or sequence of experiments –Don’t try to do all 2^N combinations in one study –Vary only 1 or 2 factors

Two types of parametric experiments u Removing a putative benefit from the tutor –Two conditions: 1. Tutor 2. Tutor minus a benefit (e.g., immediate feedback) u Add a putative benefit to the control, e.g., –Three conditions: 1. Control 2. Control plus a benefit (e.g,. explication of a hidden skill) 3. Tutor

In vivo experimentation u High internal validity required –Helps us understand human learning –All but a few factors are controlled –Summative eval of tutoring usually varies too many u Often done in context of tutoring systems –Parametric –Off line, but tutoring system serves as pre/post test

Evaluations of tutoring systems u Formative evaluations: How to improve the tutor? –Pilot test user interface alone –Wizard of Oz –Hybrids u Summative evaluations: How well does the tutor work? –2 conditions: with and without the tutor –Many supplementary analyses are possible u Parametric evaluations: Why does the tutor work? –Compare different versions of the tutor –Try putative benefits of the tutor with the control