Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Similar presentations


Presentation on theme: "Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006."— Presentation transcript:

1 Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006

2 Outline u Formative evaluations: How to improve the tutor? u Summative evaluations: How well does the tutor work? u Parametric evaluations: Why does the tutor work?

3 User interface testing (usually the first formative eval) u When –Early, before the rest of the tutor is built u Who –Students –Instructors u How –Talk aloud with headset mikes and Camtasia –Sit with the user and ask them about every frown –Other? u Also great for finding good problems to assign

4 The Andes user interface

5 Pyrenee’s GUI for same problem

6 Andes’ GUI for defining a velocity vector

7 Pyrenees’ method T: What would you like to define? a) displacement b) velocity c) acceleration S: b T: Velocity of… a) Hailstone b) Cloud c) Earth S: a T: Type? a) instantaneous b) Average S: a T: Time point? a)T0 b)T1 S: a T: Orientation? S: 270 deg T: Name? S: vi

8 Wizard of Oz (A formative evaluation) u Format –Human in the next room watches a copy of screen –Human responds when student presses Hint button or makes an error –Human must type very fast! Paste from stock answers? u User interface evaluation –Does the human have enough information? –Can the human intervene early enough? u Knowledge elicitation –What tutoring tactics were used? –What conditions determine when each tactic is used?

9 Snapshot critiques (A late formative evaluation) u Procedure –ITS keeps log file –Afterwards, randomly select events in log where student got help –Print context leading up to the help message –Expert tutors write their help on the paper u How frequently does expert’s help match ITS’s? –How frequently do two expert’s help match? u Add to the ITS the help that experts agree on

10 Outline u Formative evaluations: How to improve the tutor? u Summative evaluations: How well does the tutor work? u Parametric evaluations: Why does the tutor work? Next

11 Summative evaluations u Question: Is the tutor more effective than a control? u Typical design –Experimental group uses the tutor –Control group learns via the “traditional” method –Pre & post tests u Data analysis –Did the tutor group “do better” than the control?

12 Three feasible designs. One factor is forced to be equal. Two factors vary. Like homework Like seatwork Mastery learning Training problems Tutor = control Tutor > control? Tutor < control? Training duration Tutor < control? Tutor = control Tutor < control? Post-test score Tutor > control? Tutor = control

13 Control conditions u Typical control conditions –Existing classroom instruction –Textbook & exercise problems (feedback?) –Another tutoring system –Human tutoring »Null result does not “prove” computer tutor = human tutor u Define your control condition early –Drives the design of the tutor

14 Assessments (tests) u Your tests u Instructor’s normal tests u Standardized tests

15 When to test u Pre-test u Immediate post-test u Delayed post-test –Measures retention u Learning (pre-test, training, post-test) –Measures acceleration of future learning (also called preparation for learning)

16 Example of acceleration of future learning (Min Chi & VanLehn, in prep.) u Design –Training on probability then physics –During probability only, »Half students taught an explicit strategy »Half not taught a strategy (normal instruction) PrePost Probability Training Score PrePost Physics Training Score Preparation for learning Ordinary transfer

17 Content of post-tests u Some problems from the pre-test –Determines if any learning occurred at all u Some problem similar to training problems –Measures near-transfer u Some problems dissimilar to training problems –Measures far-transfer u Use your cognitive task analysis!

18 Bad tests happen, so Pilot Pilot Pilot! u Blatant mistakes (shows up in means) –Too hard (floor effect) –Too easy (ceiling effect) –Too long (mental attrition) u Subtle mistakes (check variance) –Test doesn’t cover some training content –Test over-covers some training content –Test is too sensitive to background knowledge »e.g., reading, basic math

19 Did the conditions differ? u My advice: Always do ANCOVAs –Condition is independent variable –Post-test score is dependent –Pre-test score is co-variate u Others advice: –Do ANOVAs on gains –If pre-test scores are not significantly different, do ANOVAs on post-test scores

20 Effect sizes: Cohen’s d u Should be based on post-test scores: [mean(experimental)-mean(control)] / standard_deviation(control) u Common but misleading usage: [mean(post-test) – mean(pre-test)] / standard_deviation(pre-test)

21 Error bars help visualize results

22 Scatter plots help visualize results Pre-test score (or GPA) Post-test score Andes Control

23 If slopes were different, would have aptitude-treatment interaction (ATI) Pre-test score (or GPA) Post-test score Andes Control

24 Which students did the tutor help? u Divide subjects into high/low pretest u Plot gains u Called “aptitude- treatment interaction” (ATI) u Need more subjects

25 Which topics did the tutor teach best? u Divide test items (e.g., into deep/shallow knowledge) u Plot gains u Need more items

26 Log file analyses u Did students use the tutor as expected? –Using help too much (help abusers) –Using help too little (help refusers) –Copying a solution from someone else (exclude?) u Correlations with gain –Errors corrected with or without help –Proportion of bottom-out hints –Time spent thinking before/after a hint u Learning curves for productions –If not a smooth curve, is it really a single production?

27 Practical issues u All experiments –Human subjects institutional review board (IRB) u Lab experiments –Recruiting subjects over a whole semester; knowledge varies –Attrition: Students quit before they finish u Field (classroom) experiments –Access to classrooms and teachers –Instructors’ enthusiasm, technosavy, agreement with pedagogy –Ethics of requiring use/non-use of the tutor in high-stakes classes –Their tests vs. your tests u Web experiments –Insuring random assignment vs. attrition

28 Outline u Formative evaluations: How to improve the tutor? u Summative evaluations: How well does the tutor work? u Parametric evaluations: Why does the tutor work? Next

29 Parametric evaluations: Why does the tutor work? u Hypothesize sources of benefit, such as –Explication of hidden problem solving skill –Novel reification (GUI) »E.g., showing goals on the screen –Novel sequence of exercises/topics »E.g., story problems first, then equations –Immediate feedback & help u Plan an experiment or sequence of experiments –Don’t try to do all 2^N combinations in one study –Vary only 1 or 2 factors

30 Two types of parametric experiments u Removing a putative benefit from the tutor –Two conditions: 1. Tutor 2. Tutor minus a benefit (e.g., immediate feedback) u Add a putative benefit to the control, e.g., –Three conditions: 1. Control 2. Control plus a benefit (e.g,. explication of a hidden skill) 3. Tutor

31 In vivo experimentation u High internal validity required –Helps us understand human learning –All but a few factors are controlled –Summative eval of tutoring usually varies too many u Often done in context of tutoring systems –Parametric –Off line, but tutoring system serves as pre/post test

32 Evaluations of tutoring systems u Formative evaluations: How to improve the tutor? –Pilot test user interface alone –Wizard of Oz –Hybrids u Summative evaluations: How well does the tutor work? –2 conditions: with and without the tutor –Many supplementary analyses are possible u Parametric evaluations: Why does the tutor work? –Compare different versions of the tutor –Try putative benefits of the tutor with the control


Download ppt "Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006."

Similar presentations


Ads by Google