Deep Knowledge Tracing

Slides:



Advertisements
Similar presentations
Knowledge Inference: Advanced BKT Week 4 Video 5.
Advertisements

Effective Skill Assessment Using Expectation Maximization in a Multi Network Temporal Bayesian Network By Zach Pardos, Advisors: Neil Heffernan, Carolina.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
+ Doing More with Less : Student Modeling and Performance Prediction with Reduced Content Models Yun Huang, University of Pittsburgh Yanbo Xu, Carnegie.
Determining the Significance of Item Order In Randomized Problem Sets Zachary A. Pardos, Neil T. Heffernan Worcester Polytechnic Institute Department of.
بسم الله الرحمن الرحیم. Ehsan Khoddam Mohammadi M.J.Mahzoon Koosha K.Moogahi.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Discovering Deformable Motifs in Time Series Data Jin Chen CSE Fall 1.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 January 25, 2012.
Predicting the dropouts rate of online course using LSTM method
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Core Methods in Educational Data Mining
Lecture 7: Constrained Conditional Models
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
What is a CAT? What is a CAT?.
Chapter 7. Classification and Prediction
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Deep Feedforward Networks
Deep Learning for Bacteria Event Identification
How to interact with the system?
Randomness in Neural Networks
LECTURE 11: Advanced Discriminant Analysis
Recursive Neural Networks
Computer vision: models, learning and inference
Deep Learning: Model Summary
Intro to NLP and Deep Learning
Multimodal Learning with Deep Boltzmann Machines
Introductory Seminar on Research: Fall 2017
Probabilistic Robotics
Intelligent Information System Lab
Neural networks (3) Regularization Autoencoder
Dynamical Statistical Shape Priors for Level Set Based Tracking
Using Bayesian Networks to Predict Test Scores
Efficient Estimation of Word Representation in Vector Space
Attentional Neural Network: Feature Selection Using Cognitive Feedback
Teaching with Instructional Software
Neural Language Model CS246 Junghoo “John” Cho.
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
Intelligent Tutoring Systems
Learning with information of features
Hidden Markov Models Part 2: Algorithms
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
Detecting the Learning Value of Items In a Randomized Problem Set
Probabilistic Models with Latent Variables
A First Look at Music Composition using LSTM Recurrent Neural Networks
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 8, 2018.
Dr. Unnikrishnan P.C. Professor, EEE
How to interact with the system?
Other Classification Models: Recurrent Neural Network (RNN)
ML – Lecture 3B Deep NN.
Resource Recommendation for AAN
Mike Timms and Cathleen Kennedy University of California, Berkeley
Learning linguistic structure with simple recurrent neural networks
Graph Neural Networks Amog Kamsetty January 30, 2019.
Speech recognition, machine learning
Learning From Observed Data
Neural networks (3) Regularization Autoencoder
Restructuring Sparse High Dimensional Data for Effective Retrieval
Core Methods in Educational Data Mining
Core Methods in Educational Data Mining
Word representations David Kauchak CS158 – Fall 2016.
Modeling IDS using hybrid intelligent systems
Claim 1: Concepts and Procedures
Speech recognition, machine learning
LHC beam mode classification
CS249: Neural Language Model
Presentation transcript:

Deep Knowledge Tracing Chris Piech∗ , Jonathan Spencer∗ , Jonathan Huang∗‡, Surya Ganguli∗ , Mehran Sahami∗ , Leonidas Guibas∗ , Jascha Sohl-Dickstein∗† ∗Stanford University, †Khan Academy, ‡Google Deep Knowledge Tracing 4/9/2019

Contributions A novel application of recurrent neural networks (RNN) to tracing student knowledge. (Model Introduction) Demonstration that our model does not need expert annotations. (Previous Work) A 25% gain in AUC over the best previous result. (Experimental results) Power a number of other applications. (Other Applications) 4/9/2019

Thanks! 4/9/2019

Knowledge Tracing Def: The task of modeling student knowledge over time. Usually by observing the correctness of doing exercises. Exercises Concept/skills so that we can accurately predict how students will perform on future interactions attempt to monitor the student’s changing knowledge state during practice Interactions: 𝑥 𝑡 =( 𝑞 𝑡 , 𝑎 𝑡 ) 𝑞 𝑡 : the exercise being answered; 𝑎 𝑡 : being answered correctly or not. In most previous work, exercise tags denote the single “concept” that human experts assign to an exercise. 4/9/2019

Motivation Develop computer-assisted education by building models of large scale student trace data on MOOCs. Resources can be suggested to students based on their individual needs. Content which is predicted to be too easy or too hard can be skipped or delayed. Formal testing is no longer necessary if a student’s ability undergoes continuous assessment The knowledge tracing problem is inherently difficult. Most previous work in education relies on first order Markov models with restricted functional forms. The knowledge tracing problem is inherently difficult. as human learning is grounded in the complexity of both the human brain and human knowledge. 4/9/2019

Dataset (1) : ASSISTment An online tutor that simultaneously teaches and assesses students in grade school mathematics. ASSISTment: an online tutor that simultaneously teaches and assesses students in grade school mathematics. An example of an ASSISTments item where the student answers incorrectly and is given tutorial help. Tutorial help will be given if the student answers the question wrong or asks for help. The tutorial help assists the student in learning the required knowledge by breaking each problem into sub questions called scaffolding or giving the student hints on how to solve the question. A question is only marked as correct if the student answers it correctly on the first attempt without requesting help. 4/9/2019

Dataset (1) The items in the problem set all have the exact same skill tagging 0 corresponds to incorrect, 1 is a correct Each student completed an average of 3 of the 42 problem sets in this dataset Each problem set has on average of 312 students 4/9/2019

Dataset(2) KDD Cup 2010 Task: Predicting the value (between 0 and 1) that indicates the probability of a correct first attempt for this student-step. The data matrix is sparse. There is a strong temporal dimension to the data. The whole collection of steps comprises the solution. Every knowledge skill/concept is associated with one or more exercises. A single exercise can be associated with one or more knowledge skill/concept. not all students are given every problem, and some problems have only 1 or 2 students who completed each item.  students improve over the course of the school year, students must master some skills before moving on to others, and incorrect responses to some items lead to incorrect assumptions in other items. So, contestants must pay attention to temporal relationships as well as conceptual relationships among items. The last step can be considered the "answer", and the others are "intermediate" steps. 4/9/2019

Dataset(2) KDD Cup 2010 4/9/2019

Recurrent Neural Networks(RNN) 4/9/2019

Recurrent Neural Networks(RNN) 𝑊 𝑦ℎ 𝑊 𝑦ℎ 𝑊 ℎℎ 𝑊 ℎℎ 𝑊 ℎ𝑥 𝑊 ℎ𝑥 Traditional Recurrent Neural Networks (RNNs) map an input sequence of vectors x1; : : : ; xT , to an output sequence of vectors y1; : : : ; yT . This is achieved by computing a sequence of ‘hidden’ states h1; : : : ; hT which can be viewed as successive encodings of relevant information from past observations that will be useful for future predictions. where both tanh and the sigmoid function () are applied to each dimension of the input. The model is parameterized by an input weight matrixWhx, recurrent weight matrixWhh, initial state h0, and readout weight matrixWyh. Biases for latent and readout units are given by bh and by. 4/9/2019

Long Short Term Memory (LSTM) a more complex variant of RNNs that often prove more powerful In the variant of LSTMs we use, latent units retain their values until explicitly cleared by the action of a ‘forget gate’. They thus more naturally retain information for many time steps, which is believed to make them easier to train. Additionally, hidden units are updated using multiplicative interactions, and they can thus perform more complicated transformations for the same number of latent units. The update equations for an LSTM are significantly more complicated than for an RNN 4/9/2019

Previous Work Bayesian Knowledge Tracing (BKT) Learning Factors Analysis (LFA) Performance Factors Analysis (PFA) SPARse Factor Analysis (SPARFA) 4/9/2019

Standard Bayesian Knowledge Tracing (BKT) (1995) State: {unlearned, learned} Observation: {incorrect, correct} Bayesian Knowledge Tracing (BKT) is the most popular approach for building temporal models of student learning. BKT models a learner’s latent knowledge state as a set of binary variables Bayesian Knowledge Tracing (BKT) is a user modeling method extensively used in the area of Intelligent Tutoring Systems Assumes a two-state learning model {learned, unlearned} Unlearned  learned | practice through reading the text/ at each opportunity to apply the rule in practice No forgetting: Learned  unlearned (NO) If learned , may make a mistake If unlearned, maybe guess correctly [2008] Using four parameters per skill, fit using student performance data, to relate performance to learning Statistically, equivalent to the two-node dynamic Bayesian network At the beginning of using the tutor, each student has an initial probability (L0) of knowing each skill, and at each opportunity to practice a skill the student does not know, the student has a certain probability (T) of learning the skill, regardless of whether their answer is correct. The system’s estimate that a student knows a skill is continually updated, every time the student gives a first response (correct, incorrect, or a help request) to a problem step. First, the system re-calculates the probability that the student knew the skill before the response, using the evidence from the response (help requests are treated as evidence that the student does not know the skill), using the first two equations of 4/9/2019

Extensions and Drawbacks Contextualization of guessing and slipping estimates Estimating prior knowledge for individual Learners Estimating problem difficulty Drawbacks: The binary representation of student understanding may be unrealistic. The meaning of the hidden variables and their mappings onto exercises can be ambiguous, rarely meeting the model’s expectation of a single concept per exercise. The binary response data used to model transitions imposes a limit on the kinds of exercises that can be modeled. 4/9/2019

SPARse Factor Analysis (SPARFA) (JMLR 2014) The probabilities that the learners answer the questions correctly: 𝑊𝐶 + 𝑀 4/9/2019

SPARse Factor Analysis (SPARFA) (JMLR 2014) Three observations: Typical educational domains of interest involve only a small number of key concept. Each question involves only a small subset of the abstract concepts. The entries of 𝑊 should be non-negative. Drawback: They are both more restricted in functional form and more expensive (due to inference of latent variables) than the method we present here. 4/9/2019

RNN Model In contrast to hidden Markov models as they appear in education, which are also dynamic, RNNs have a high dimensional, continuous, representation of latent state. A notable advantage of the richer representation of RNNs is their ability to use information from an input in a prediction at a much later point in time. 4/9/2019

RNN Model For datasets with a small number 𝑀 of unique exercises Input 𝑥 𝑡 ={ 𝑞 𝑡 , 𝑎 𝑡 }∈ {0,1} 2𝑀 one-hot encoding ( 𝑞 𝑡 which exercise was answered and 𝑎 𝑡 if the exercise was answered correctly) Output 𝑦 𝑡 ∈ (0,1) 𝑀 where each entry represents the predicted probability that the student would answer that particular problem correctly The prediction of 𝑎 𝑡+1 can then be read from the entry in 𝑦 𝑡 corresponding to 𝑞 𝑡+1 . Loss: To prevent overfitting, during training, dropout was applied to ℎ 𝑡 when computing 𝑦 𝑡 . The training objective is the negative log likelihood of the observed sequence of student responses under the model. t 4/9/2019

RNN Model 4/9/2019

Experiment results Dataset 0.86−0.69 0.69 =25% Dataset Simulated Data: generate 2000 students who answer 50 exercises drawn from k ∈ 1 … 5 concepts. Each student has a latent knowledge state for each concept, and each exercise has both a single concept and a difficulty. Khan Academy Data: 1.4 million exercises completed by 47,495 students across 69 different exercise types. ASSISTments Dataset: the largest publicly available knowledge tracing dataset 4/9/2019

Experiment results 4/9/2019

Other Applications(1): Improving Curricula Choosing the best sequence of learning items to present to a student. Given a student with an estimated hidden knowledge state, we can query our RNN to calculate what their expected knowledge state would be if we were to assign them a particular exercise. 4/9/2019

Experimental Results: Expectimax V.S. mixing (exercises from different topics are intermixed) 、blocking (students answer series of exercises of the same Type). Tested different curricula for selecting exercises on a subset of five concepts over the span of 30 exercises from the ASSISTment dataset. 4/9/2019

Other Applications(2): Discovering Exercise Relationships The task of discovering latent structure or concepts in the data. By assigning an influence 𝐽 𝑖𝑗 to every directed pair of exercises 𝑖 and 𝑗 𝐽 𝑖𝑗 = 𝑦(𝑗|𝑖) 𝑘 𝑦(𝑗|𝑘) 𝑦 (𝑗|𝑖) is the correctness probability assigned by the RNN to exercise 𝑗 when exercise 𝑖 is answered correctly in the first time step. This characterization of the dependencies captured by the RNN recovers the pre-requisites associated with exercises. 4/9/2019

Experimental Results: 4/9/2019

Future Work Incorporate other features as inputs (such as time taken) Explore other educational impacts (such as hint generation, dropout prediction) Validate hypotheses posed in education literature (such as spaced repetition, modeling how students forget) To track knowledge over more complex learning activities 4/9/2019