Ryan S.J.d. Baker Adam B. Goldstein Neil T. Heffernan Detecting the Moment of Learning.

Slides:



Advertisements
Similar presentations
Bayesian Knowledge Tracing and Discovery with Models
Advertisements

Chapter 8 Flashcards.
Educational Data Mining Overview Ryan S.J.d. Baker PSLC Summer School 2010.
Knowledge Inference: Advanced BKT Week 4 Video 5.
MSc Epidemiology Exams what, why, when, how. Paper 1 Covers extended epidemiology, STEPH and clinical trials Purpose of today’s talk: –Explain format.
G. Alonso, D. Kossmann Systems Group
Title You can use your question as your title
HUDM4122 Probability and Statistical Inference March 30, 2015.
Bayesian Knowledge Tracing and Other Predictive Models in Educational Data Mining Zachary A. Pardos PSLC Summer School 2011 Bayesian Knowledge Tracing.
Math 3680 Lecture #19 Correlation and Regression.
Models with Discrete Dependent Variables
Meta-Cognition, Motivation, and Affect PSY504 Spring term, 2011 April 13, 2011.
1 Validation and Verification of Simulation Models.
+ Doing More with Less : Student Modeling and Performance Prediction with Reduced Content Models Yun Huang, University of Pittsburgh Yanbo Xu, Carnegie.
Berkeley Parlab 1. INTRODUCTION A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing 2. CORRELATIONS TO THE GROUND.
Today Concepts underlying inferential statistics
Educational Data Mining Overview John Stamper PSLC Summer School /25/2011 1PSLC Summer School 2011.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Chapter 7 Correlational Research Gay, Mills, and Airasian
Lorelei Howard and Nick Wright MfD 2008
Chapters 10 and 11: Using Regression to Predict Math 1680.
Educational Data Mining Ryan S.J.d. Baker PSLC/HCII Carnegie Mellon University Richard Scheines Professor of Statistics, Machine Learning, and Human-Computer.
Smith/Davis (c) 2005 Prentice Hall Chapter Eight Correlation and Prediction PowerPoint Presentation created by Dr. Susan R. Burns Morningside College.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 6, 2012.
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
Case Study – San Pedro Week 1, Video 6. Case Study of Classification  San Pedro, M.O.Z., Baker, R.S.J.d., Bowers, A.J., Heffernan, N.T. (2013) Predicting.
Density Curves Normal Distribution Area under the curve.
Near East University Department of English Language Teaching Advanced Research Techniques Correlational Studies Abdalmonam H. Elkorbow.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Statistical Analysis A Quick Overview. The Scientific Method Establishing a hypothesis (idea) Collecting evidence (often in the form of numerical data)
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 13, 2012.
User Study Evaluation Human-Computer Interaction.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 2, 2012.
90288 – Select a Sample and Make Inferences from Data The Mayor’s Claim.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory variables.
TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 January 28, 2013.
Advanced BKT February 11, Classic BKT Not learned Two Learning Parameters p(L 0 )Probability the skill is already known before the first opportunity.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 February 4, 2013.
Hypothesis Testing An understanding of the method of hypothesis testing is essential for understanding how both the natural and social sciences advance.
CORRELATION. Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson’s coefficient of correlation.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Machine Learning 5. Parametric Methods.
Core Methods in Educational Data Mining HUDK4050 Fall 2015.
Feature Engineering Studio September 9, Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.
Core Methods in Educational Data Mining HUDK4050 Fall 2015.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 January 25, 2012.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Modeling Student Benefits from Illustrations and Graphs Michael Lipschultz Diane Litman Computer Science Department University of Pittsburgh.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Key Stage 2 SATs Willand School. Key Stage 2 SATs Changes In 2014/15 a new national curriculum framework was introduced by the government for Years 1,
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Assistant Instructor Nian K. Ghafoor Feb Definition of Proposal Proposal is a plan for master’s thesis or doctoral dissertation which provides the.
McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 3 Forecasting.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Simulation-based inference beyond the introductory course Beth Chance Department of Statistics Cal Poly – San Luis Obispo
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
Data-Driven Education
Core Methods in Educational Data Mining
What is a Hidden Markov Model?
Michael V. Yudelson Carnegie Mellon University
Selecting the Best Measure for Your Study
Special Topics in Educational Data Mining
Predict Failures with Developer Networks and Social Network Analysis
The Scientific Method.
Educational Data Mining Success Stories
Core Methods in Educational Data Mining
Density Curves Normal Distribution Area under the curve
Density Curves Normal Distribution Area under the curve
Presentation transcript:

Ryan S.J.d. Baker Adam B. Goldstein Neil T. Heffernan Detecting the Moment of Learning

Talk Outline Introduction Data P(J) model –Labeling Process –Features –ML Procedure –Results Spikiness Models Conclusions

In recent years… There has been work towards developing better and better models that can predict if a student has learned a skill up to a certain time [Corbett & Anderson, 1995; Martin & VanLehn, 1995; Shute, 1995; Conati et al, 2002; Beck et al, 2007, 2008; Pardos et al, 2008; Baker et al, 2008, 2010; Pavlik et al, 2009]

E.g. W R W W R W W R R W R R R The student has a 84% chance of now knowing the skill

In this paper… We go a step further, and try to assess not just –Whether a student knows the skill But also –When the student learned it

E.g. The student probably learned the skill at W R W W R W W R R W R R R

Why is this useful? Better understand the conditions and antecedents of learning May be possible to change style of practice after these inflection points, from focusing on learning skill to focusing on gaining fluency

Why is this useful? Better understand the conditions and antecedents of learning May be possible to change style of practice after these inflection points, from focusing on learning skill to focusing on gaining fluency –Even if we’re just catching an inflection point in the strength of association rather than an actual “eureka” moment, this still might be relevant and useful

How do we do it? Very much like the models that detected contextual probability of guessing and slipping (Baker, Corbett, & Aleven, 2008)

How do we do it? We take an action, and the probability the student knows the skill at that point, according to Bayesian Knowledge Tracing (Corbett & Anderson, 1995) We look at the next two actions We apply Bayes’ Theorem This gives us training labels; we then develop a model that uses only features from the current action and the past

High-Level 5% probability student knew skill W W W –Skill was probably not learned at red action

High-Level 90% probability student knew skill R R R –Skill was probably not learned at red action

High-Level 30% probability student knew skill R R R –Skill was quite possibly learned at red action (or previous action)

High-Level 30% probability student knew skill W R R –Skill was quite possibly learned at red action (or next action)

High-Level 30% probability student knew skill W W R –Skill was probably not learned at red action

Now, for more details… My co-author, Adam Goldstein

Talk Outline Introduction Data P(J) model –Labeling Process –Features –ML Procedure –Results Spikiness Models Conclusions

Data used 232 students’ use of CMU’s Middle School Cognitive Tutor Math classes in one middle school in Pittsburgh suburbs during Used tutor twice a week as part of their regular curriculum 581,785 transactions 171,987 problem steps over 253 skills

Talk Outline Introduction Data P(J) model –Labeling Process –Features –ML Procedure –Results Spikiness Models Conclusions

Labeling P(J) Bear with me, it’s worth it Primarily considered with this statement: P(J) = P(~L n ^ T | A +1+2 ) *Note how it is distinct from T P(T) = P(T | ~L n ) P(J) = P(~L n ^ T)

P(J) is distinct from P(T) Bear with me, it’s worth it Primarily considered with this statement: P(J) = P(~L n ^ T | A +1+2 ) *Note how it is distinct from T P(T) = P(T | ~L n ) P(J) = P(~L n ^ T)

Labeling P(J) We can better understand P(~L n ^ T | A +1+2 ) with an application of Bayes’ rule P(~L n ^ T | A +1+2 ) = P(A +1+2 | ~L n ^ T) * P(~L n ^ T) P (A +1+2 )

Labeling P(J) Base probability P(~L n ^ T ) computed using a student’s current P(~L n ) and P(T) from BKT P(A +1+2 ) is a function of the only three relevant scenarios, {L n, ~L n ^ T, ~L n ^ ~T}, and their contingent probabilities P(A +1+2 ) = P(A +1+2 | L n )P(L n ) + P(A +1+2 | ~L n ^ T) P(~L n ^ T) + P(A +1+2 | ~L n ^ ~T) P(~L n ^ ~T)

Labeling P(J) And finally: Probability of actions at N+1 and N+2 is a function of BKT’s probabilities for guessing (G), slipping (S), and learning the skill (T) (Correct answers are notated with a C and incorrect answers are notated with a ~C) (A full list of equations is available in the paper) P(A +1+2 = C, C | L n ) = P(~S)P(~S) P(A +1+2 = C, ~C | L n ) = P(S)P(~S) P(A +1+2 = ~C, C | L n ) = P(S)P(~S) P(A +1+2 = ~C, ~C | L n ) = P(S)P(S)

Labeling P(J) P(A +1+2 = C, C | L n ) = P(~S) 2 P(A +1+2 = C, ~C | L n ) = P(S)P(~S) P(A +1+2 = ~C, C | L n ) = P(S)P(~S) P(A +1+2 = ~C, ~C | L n ) = P(S) 2 Future data is used only in training.

Labeling P(J) But don’t forget: P(J) = P(~L n ^ T | A +1+2 )

Talk Outline Introduction Data P(J) model –Labeling Process –Features –ML Procedure –Results Spikiness Models Conclusions

Features of P(J) Used log information on data from already completed student usage of the tutor Defined behavior that may be indicative of knowledge acquisition Developed a means to quantify or observe that behavior Used same set of features as seen in [Baker, Corbett, and Aleven 2008]

Features of P(J) In training –The label P(J) uses future data from logs –We machine learn weights for each feature to predict P(J), using only past/present data In test –To predict P(J) we calculate these features and apply the learned weights using only information available at run time

Example Features All features use only first actions

What some of those numbers mean P(J) is higher following incorrect responses –[Citation] P(J) decreases as the total number of times student got this skill wrong increases –Might need intervention not available in the tutor

What some of those numbers mean P(J) is lower following help requests –Stands out in contrast to [Beck et al 2008] P(J) is higher when help has been used recently, i.e. in the last 5 and/or 8 steps

Talk Outline Introduction Data P(J) model –Labeling Process –Features –ML Procedure –Results Spikiness Models Conclusions

Features of P(J) In RapidMiner, ran linear regression to make a model for correlation between our features and the P(J) label Two feature sets run through 6-fold student-level cross validation –25 including L n and L n-1 :.446 correlation to labels –23 not including L n and L n-1 :.301 correleation

Features of P(J) Argument could be made that using BKT probabilities (L n ) in the definition of the label (~L n ^ T) is wrong –We consider this to be valid - Interesting part is the T, not the L n Even if you don’t buy it, a.301 correlation coefficient is certainly still something

Back to Ryan For some discussion of analysis of P(J)

Talk Outline Introduction Data P(J) model –Labeling Process –Features –ML Procedure –Results Spikiness Models Conclusions

Research question Does learning in intelligent tutors have more of a character of gradual learning (such as strengthening of a memory association [cf. Pavlik & Anderson, 2008]) or learning given to “eureka” moments, where a skill is understood suddenly? [Lindstrom & Gulz, 2008] Does this vary by skill?

To answer this We can plot P(J) over time, and see how “spiky” the graph is Note that this is effectively the derivative of the more standard theoretical learning curve (cf. Corbett & Anderson, 1995; Koedinger et al, 2008)

Real Data for One Student (Two different skills) OPTOPRAC P(J)

Real Data for One Student (Two different skills) OPTOPRAC P(J)

As you can see… One skill was learned gradually, the other skill was learned suddenly Note that the first graph had *two* spikes This was actually very common in the data, even more common than single spikes

As you can see… One skill was learned gradually, the other skill was learned suddenly Note that the first graph had *two* spikes This was actually very common in the data, even more common than single spikes –I would very much appreciate hypotheses for why this happens, as I don’t have a good theoretical explanation for this

We can quantify the difference between these graphs We can quantify the degree to which a learning sequence involves a “eureka” moment, through a metric we call “spikiness” For a given student/skill pair, spikiness = Max P(J)/Avg P(J) –Scaled from 1 to infinity

Looking at spikiness We only consider action sequences at least 6 problem steps long –(Shorter sequences tend to more often look spiky, which is a mathematical feature of using a within-sequence average) We only consider the first 20 problem steps –After that, the student is probably floundering

Spikiness by skill Min: 1.12 Max: Avg: 8.55 SD: Future work: What characterizes spiky skills and gradually-learned skills?

Spikiness by student Min: 2.22 Max: Avg: 6.81 SD: 3.09 Students are less spiky than skills

Interestingly The correlation between a student’s spikiness, and their final average P(Ln) across skills is a high 0.71, statistically significantly different than chance Suggests that learning spikes may be an early predictor of whether a student is going to achieve good learning of specific material –May someday be the basis of better knowledge tracing

One of… One of many analyses potentially enabled by this model

Worth Noting Generally across all actions on a skill, P(J) levels don’t quite add up to a total of 1 across all actions In general, our model is representative of P(J) at lower levels but tends to underestimate the height of spikes –May be a result of using linear modeling approach for a fundamentally non-linear phenomenon –May also be that P(J) is actually too high in the training labels (where it often ends up significantly above a total of 1) –Could be normalized, for the purposes of spikiness analyses, we believe the model biases towards seeing less total spikiness

Talk Outline Introduction Data P(J) model –Labeling Process –Features –ML Procedure –Results Spikiness Models Conclusions

In this paper We have advanced a new model that is able to infer with moderate accuracy exactly when a student learns a skill We’ve now gotten the model to correlate about twice as well to the training labels, by looking at more than just the first attempt at the problem step –For more details, see our poster at EDM2010

In this paper We have used this model to analyze the differences in spikiness of different skills (and different students) Discovering that double-spikes are common (a finding we don’t yet understand) And that student spikiness predicts the post-test

Other future work Why are some skills spikier than others? How can we use spikiness to improve student knowledge modeling? What are the conditions and antecedents of learning spikes?

Tools sharing We would be very happy to share the spreadsheets and code that we used to calculate the P(J) labels with any interested colleague Please us

Thanks! Any Questions?