Michael V. Yudelson Carnegie Mellon University

Slides:

Advertisements

Similar presentations

Bayes rule, priors and maximum a posteriori

Advertisements

Bayesian Knowledge Tracing and Discovery with Models

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 January 23, 2012.

Regression and correlation methods

Next Semester CSCI 5622 – Machine learning (Matt Wilder)  great text by Hastie, Tibshirani, & Friedman great text ECEN 5018 – Game Theory ECEN 5322 –

Educational data mining overview & Introduction to Exploratory Data Analysis Ken Koedinger CMU Director of PSLC Professor of Human-Computer Interaction.

Navigating the parameter space of Bayesian Knowledge Tracing models Visualizations of the convergence of the Expectation Maximization algorithm Zachary.

Knowledge Inference: Advanced BKT Week 4 Video 5.

Bayesian Knowledge Tracing and Other Predictive Models in Educational Data Mining Zachary A. Pardos PSLC Summer School 2011 Bayesian Knowledge Tracing.

Modeling Student Knowledge Using Bayesian Networks to Predict Student Performance By Zach Pardos, Neil Heffernan, Brigham Anderson and Cristina Heffernan.

Ryan S.J.d. Baker Adam B. Goldstein Neil T. Heffernan Detecting the Moment of Learning.

Week 8 Video 4 Hidden Markov Models.

Effective Skill Assessment Using Expectation Maximization in a Multi Network Temporal Bayesian Network By Zach Pardos, Advisors: Neil Heffernan, Carolina.

Visual Recognition Tutorial

Computer Science Department Jeff Johns Autonomous Learning Laboratory A Dynamic Mixture Model to Detect Student Motivation and Proficiency Beverly Woolf.

Introduction  Bayesian methods are becoming very important in the cognitive sciences  Bayesian statistics is a framework for doing inference, in a principled.

Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.

Searching for Patterns: Sean Early PSLC Summer School 2007 Question: Which is a better predictor of performance in a cognitive tutor, error rate or assistance.

Berkeley Parlab 1. INTRODUCTION A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing 2. CORRELATIONS TO THE GROUND.

Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.

Educational Data Mining Ryan S.J.d. Baker PSLC/HCII Carnegie Mellon University Richard Scheines Professor of Statistics, Machine Learning, and Human-Computer.

Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013.

1 Psych 5500/6500 Statistics and Parameters Fall, 2008.

Item Response Theory. What’s wrong with the old approach? Classical test theory –Sample dependent –Parallel test form issue Comparing examinee scores.

Determining the Significance of Item Order In Randomized Problem Sets Zachary A. Pardos, Neil T. Heffernan Worcester Polytechnic Institute Department of.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

PSLC DataShop Introduction Slides current to DataShop version John Stamper DataShop Technical Director.

Applications of Bayesian sensitivity and uncertainty analysis to the statistical analysis of computer simulators for carbon dynamics Marc Kennedy Clive.

Model Inference and Averaging

Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Educational Data Mining: Discovery with Models Ryan S.J.d. Baker PSLC/HCII Carnegie Mellon University Ken Koedinger CMU Director of PSLC Professor of Human-Computer.

Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 January 28, 2013.

Advanced BKT February 11, Classic BKT Not learned Two Learning Parameters p(L 0 )Probability the skill is already known before the first opportunity.

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 22, 2012.

M Machine Learning F# and Accord.net.

Item Factor Analysis Item Response Theory Beaujean Chapter 6.

Data mining with DataShop Ken Koedinger CMU Director of PSLC Professor of Human-Computer Interaction & Psychology Carnegie Mellon University.

Core Methods in Educational Data Mining HUDK4050 Fall 2015.

Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Heriot Watt University 12th February 2003.

Core Methods in Educational Data Mining HUDK4050 Fall 2015.

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 January 25, 2012.

Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 7: Regression.

Core Methods in Educational Data Mining HUDK4050 Fall 2014.

Core Methods in Educational Data Mining HUDK4050 Fall 2014.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Core Methods in Educational Data Mining

Group Analyses Guillaume Flandin SPM Course London, October 2016

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

ICS 280 Learning in Graphical Models

Ch3: Model Building through Regression

Item Analysis: Classical and Beyond

10701 / Machine Learning.

Special Topics In Scientific Computing

Using Bayesian Networks to Predict Test Scores

Scoring: Measures of Central Tendency

Detecting the Learning Value of Items In a Randomized Problem Set

Addressing the Assessing Challenge with the ASSISTment System

Neil T. Heffernan, Joseph E. Beck & Kenneth R. Koedinger

A Hierarchical Bayesian Look at Some Debates in Category Learning

Biointelligence Laboratory, Seoul National University

Item Analysis: Classical and Beyond

Core Methods in Educational Data Mining

Core Methods in Educational Data Mining

Mathematical Foundations of BME Reza Shadmehr

Item Analysis: Classical and Beyond

Uncertainty Propagation

Presentation transcript:

Michael V. Yudelson Carnegie Mellon University Individualizing BKT. Are Skill Parameters More Important Than Student Parameters? Michael V. Yudelson Carnegie Mellon University “Important” is a provocation, more of “capturing more variance of performance” hence, if accounted for, more useful for modeling

Modeling Student Learning (1) Sources of performance variability Time – learning happens with repetition Knowledge quanta – learning is better (only) visible when aggregated by skill Students – people learn differently Content – there are easier/harder chunks of material Michael V. Yudelson (C) 2016

Modeling Student Learning (2) Accounting for variability Time – implicitly present in time-series data Knowledge quanta –skills are frequently used as units of transfer Content – many models address modality and/or instances of content Students – significant attention is given to accounting for individual differences Michael V. Yudelson (C) 2016

Of Students and Skills Without accounting for skills (the what) there is little chance to see learning Component theory of transfer reliably defeats faculty theory and item-based models (Koedinger, et al., 2016) Student (the who) is, arguably, the runner up/contestant for the most potent factor Skill and student-level factors in models of learning Which one is more influential when predicting performance? Michael V. Yudelson (C) 2016

Focus if this work Subject: mathematics Model: Bayesian Knowledge Tracing (BKT) Investigation: adding per-student parameters Extension: [globally] weighting skill vs. student parameters Question: which [global] weight is larger – per-skills or per-student? If I jump to the last slide now, we will be done, but let me still go through the slides in the middle Michael V. Yudelson (C) 2016

Bayesian Knowledge Tracing Unrolled view of single-skill BKT Parameters Values ∈[0,1] Rows sum to 1 Forgetting (pF) = 0 Why 4 parameters Rows sum to 1, every last value can be omitted Forgetting is 0, no 5th parameter Michael V. Yudelson (C) 2016

Individualization BKT Individualization – student-level parameters 1PL IRT, AFM – “student ability” intercept Split BKT parameters into student/skill components Corbett & Anderson, 1995; Yudelson et al., 2013 Multiplex Init for different student cohorts Pardos & Heffernan, 2010 Parameters are only set within student, not across students Lee & Brunskill, 2012 Michael V. Yudelson (C) 2016

Additive/Compensatory BKT Individualization BKT parameter P∈{Init, Learn, Slip, Guess) iBKT: splitting parameter P (Yudelson et al., 2013) P = f(Puser,Pskill)=Sigmoid( logit(Puser) + logit(Pskill) ) Per-student and per-skill parameters are added on the logit scale and converted back to probability scale Setting all Puser = 0.5 converts iBKT to standard BKT iBKT model fit using block coordinate descent iBKT-W: making parameter split more interesting P = f(Pu,Pk,W0,Wu,Wk,Wuk)= Sigmoid( W0 + Wu𐄁logit(Pu) + Wk𐄁logit(Pk) + Wuk𐄁logit(Pu)𐄁logit(Pk) ) W0 – bias, hopefully low Wu vs. Wk – student vs. skill weight Wuk – interaction of student and skill components Michael V. Yudelson (C) 2016

Fitting BKT Models iBKT: HMM-scalable Public version Standard BKT only https://github.com/IEDMS/standard-bkt Fits standard and individualized BKT models using a suite of gradient-based solvers Exact inference of student/skill parameters (via block coordinate descent) iBKT-W: JAGS/WinBUGS via R’s rjags package Hierarchical Bayesian Model Flexible hyper-parameterization Skill parameters – drawn from uniform distribution Student parameters – drawn from Gaussian distribution Only individualize Init and Learn parameters. Michael V. Yudelson (C) 2016

Data KDD Cup 2010 Educational Data Mining Challenge. Carnegie Learning’s Cognitive Tutor data http://pslcdatashop.web.cmu.edu/KDDCup One curriculum unit Linear Inequalities JAGS/WibBUGS is less computationally efficient 336 students 66,307 transactions 30 skills Michael V. Yudelson (C) 2016

BKT Models. Statistical Fit. Parameters Hyper Parameters RMSE Accuracy Majority Class (predict correct) 0.52516 0.7242 Standard BKT hmm- scalable *4N 0.40571 0.7561 Standard BKT HBM 4N 0.40299 0.7569 iBKT hmm-scalable **4N+2M 0.39376 0.7680 iBKT HBM 4N+2M 4 0.39287 0.7692 iBKT-W HBM 4N+2M+4 12 0.39236 0.7687 iBKT-W-2G HBM*** 16 0.39252 0.7689 * N – number of skills ** M – number of students *** Init and Learn fit as a mixture of 2 Gaussians Michael V. Yudelson (C) 2016

The Story of Two Gaussians iBKT-W HBM iBKT-W-2G* HBM * Fitting a mixture of 3-Gaussians results in bimodal distribution as well Michael V. Yudelson (C) 2016

Student vs. Skill The [global] bias W0 is low Model W0 Wskill Wstudent Wstudent*skill iBKT-W HBM 0.012 0.565 1.420 0.004 iBKT-W-2G HBM 0.019 0.700 1.274 0.007 The [global] bias W0 is low The [global] interaction term Wstudent*skill is even lower Student [global] weight in the additive function is visibly higher Michael V. Yudelson (C) 2016

Discussion (1) Student parameters v. skill parameters Bias and interaction terms effectively 0 A little disappointed about the interaction Student parameters weighted higher (2 reported + 7 additional models tried) Only small chance of over-fit despite random-factor treatment 30 skills (uniform distr.) 336 students (Gaussian distr.) Wk and Wu weights could be compensating/shifting the individual student/skill parameters Michael V. Yudelson (C) 2016

Discussion (2) iBKT via hmm-scalable Per-student Init(x)~Learn(y) Exact inference (fixed effect) iBKT-W-2G via HBM Per-student Init(x)~Learn(y) Regularization via setting priors Wouldn’t it be nice to have students here? W R R R R R, BKT: Init0, Learn1, Logistic: Sigmoid(intercept)0, Sigmoid(slope)1 Michael V. Yudelson (C) 2016

Discussion (3) Small differences in statistical fits Models with similar accuracies could be vastly different Significant differences in the amount of practice they would prescribe Prescribed practice time hh:mm Michael V. Yudelson (C) 2016

Discussion (4) iBKT-W-2G: what do 2 Gaussians represent? Problems, time, hints, errors, % correct, {time,errors,hints}/problem – none of these explain the membership Lower Init&Learn vs. higher Init&Learn – does explain membership latent uni-dimensional student ability Michael V. Yudelson (C) 2016

Thank you! Michael V. Yudelson (C) 2016