Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 13, 2012.

Slides:



Advertisements
Similar presentations
Week 1, video 2: Regressors. Prediction Develop a model which can infer a single aspect of the data (predicted variable) from some combination of other.
Advertisements

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 March 12, 2012.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Feature Engineering Studio Special Session October 23, 2013.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 February 18, 2013.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Knowledge Inference: Advanced BKT Week 4 Video 5.
Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 27, 2012.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 7, 2013.
Educational Data Mining March 3, Today’s Class EDM Assignment#5 Mega-Survey.
Meta-Cognition, Motivation, and Affect PSY504 Spring term, 2011 January 26, 2010.
Discovery with Models Week 8 Video 1. Discovery with Models: The Big Idea  A model of a phenomenon is developed  Via  Prediction  Clustering  Knowledge.
Data Synchronization and Grain-Sizes Week 3 Video 2.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Three kinds of learning
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Educational Data Mining Overview John Stamper PSLC Summer School /25/2011 1PSLC Summer School 2011.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Educational Data Mining Ryan S.J.d. Baker PSLC/HCII Carnegie Mellon University Richard Scheines Professor of Statistics, Machine Learning, and Human-Computer.
Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Machine Learning CS 165B Spring 2012
Feature Engineering Week 3 Video 3. Feature Engineering.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 6, 2012.
Classifiers, Part 1 Week 1, video 3:. Prediction  Develop a model which can infer a single aspect of the data (predicted variable) from some combination.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Prediction (Classification, Regression) Ryan Shaun Joazeiro de Baker.
Feature Engineering Studio September 23, Welcome to Mucking Around Day.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 2, 2012.
Chapter 9 – Classification and Regression Trees
Scaling up Decision Trees. Decision tree learning.
Educational Data Mining: Discovery with Models Ryan S.J.d. Baker PSLC/HCII Carnegie Mellon University Ken Koedinger CMU Director of PSLC Professor of Human-Computer.
Advanced BKT February 11, Classic BKT Not learned Two Learning Parameters p(L 0 )Probability the skill is already known before the first opportunity.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Feature Engineering Studio March 1, Let’s start by discussing the HW.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 22, 2012.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Feature Engineering Studio October 7, Welcome to Bring Me a Rock Day 2.
1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.
Core Methods in Educational Data Mining HUDK4050 Fall 2015.
Feature Engineering Studio September 9, Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features.
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.
Feature Engineering Studio Special Session September 25, 2013.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Core Methods in Educational Data Mining HUDK4050 Fall 2015.
LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 January 25, 2012.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 6, 2012.
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
LECTURE 11: LINEAR MODEL SELECTION PT. 1 March SDS 293 Machine Learning.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Core Methods in Educational Data Mining
Core Methods in Educational Data Mining
Advanced data mining with TagHelper and Weka
Special Topics in Educational Data Mining
Prediction (Classification, Regression)
Core Methods in Educational Data Mining
Feature Engineering Studio Special Session
Prepared by: Mahmoud Rafeek Al-Farra
Overview of Machine Learning
Support Vector Machine _ 2 (SVM)
Core Methods in Educational Data Mining
Core Methods in Educational Data Mining
Presentation transcript:

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 13, 2012

Today’s Class Classification and Behavior Detection

Prediction Pretty much what it says A student is using a tutor right now. Is he gaming the system or not? A student has used the tutor for the last half hour. How likely is it that she knows the skill in the next step? A student has completed three years of high school. What will be her score on the college entrance exam?

Two Key Types of Prediction This slide adapted from slide by Andrew W. Moore, Google

Classification There is something you want to predict (“the label”) The thing you want to predict is categorical – The answer is one of a set of categories, not a number – CORRECT/WRONG (sometimes expressed as 0,1) – HELP REQUEST/WORKED EXAMPLE REQUEST/ATTEMPT TO SOLVE – WILL DROP OUT/WON’T DROP OUT – WILL SELECT PROBLEM A,B,C,D,E,F, or G

Where do those labels come from? Field observations (take PSY503) Text replays (take PSY503) Post-test data (take PSY503) Tutor performance Survey data School records Where else?

Classification Associated with each label are a set of “features”, which maybe you can use to predict the label Skillpknowtimetotalactionsright ENTERINGGIVEN WRONG ENTERINGGIVEN RIGHT USEDIFFNUM WRONG ENTERINGGIVEN RIGHT REMOVECOEFF WRONG REMOVECOEFF RIGHT USEDIFFNUM RIGHT ….

Classification The basic idea of a classifier is to determine which features, in which combination, can predict the label Skillpknowtimetotalactionsright ENTERINGGIVEN WRONG ENTERINGGIVEN RIGHT USEDIFFNUM WRONG ENTERINGGIVEN RIGHT REMOVECOEFF WRONG REMOVECOEFF RIGHT USEDIFFNUM RIGHT ….

Classification Of course, usually there are more than 4 features And more than 7 actions/data points These days, 800,000 student actions, and 26 features, would be a medium-sized data set

Classification One way to classify is with a Decision Tree (like J48) PKNOW TIMETOTALACTIONS RIGHT WRONG <0.5>=0.5 <6s.>=6s.<4>=4

Classification One way to classify is with a Decision Tree (like J48) PKNOW TIMETOTALACTIONS RIGHT WRONG <0.5>=0.5 <6s.>=6s.<4>=4 Skillpknowtimetotalactionsright COMPUTESLOPE ?

J48/C4.5 Can handle both numerical and categorical predictor variables – Tries to find optimal split in numerical variable Repeatedly looks for variable which best splits the data in terms of predictive power for each variable Later prunes out branches that turned out to have low predictive power

Step Regression Linear regression (discussed in detail in a later class), with a cut-off Essentially assigns a weight to each parameter, and then computes a numerical value Then all values below 0.5 are treated as 0, and all values >= 0.5 are treated as 1

And of course… There are lots of other classification algorithms you can use... K* (instance-based classification) JRip (rule-based classification using trees) PART (rule-based classification using trees) Neural Network Logistic Regression SMO (support vector machine) In your favorite Machine Learning package

If there’s time at the end of class… We could go through some of these algorithms

Comments? Questions?

What data set should you generally test on? A vote… – Raise your hands as many times as you like

What data set should you generally test on? The data set you trained your classifier on A data set from a different tutor Split your data set in half (by students), train on one half, test on the other half Split your data set in ten (by actions). Train on each set of 9 sets, test on the tenth. Do this ten times. Votes?

What data set should you generally test on? The data set you trained your classifier on A data set from a different tutor Split your data set in half (by students), train on one half, test on the other half Split your data set in ten (by actions). Train on each set of 9 sets, test on the tenth. Do this ten times. What are the benefits and drawbacks of each?

The dangerous one (though still sometimes OK) The data set you trained your classifier on If you do this, there is serious danger of over- fitting

The dangerous one (though still sometimes OK) You have ten thousand data points. You fit a parameter for each data point. “If data point 1, RIGHT. If data point 78, WRONG…” Your accuracy is 100% Your kappa is 1 Your model will neither work on new data, nor will it tell you anything.

The dangerous one (though still sometimes OK) The data set you trained your classifier on When might this one still be OK?

The dangerous one (though still sometimes OK) The data set you trained your classifier on When might this one still be OK? – Computing complexity-based goodness metrics such as BiC – Determine maximum possible performance of modeling approach

K-fold cross validation (standard) Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times. What can you infer from this?

K-fold cross validation (standard) Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times. What can you infer from this? – Your detector will work with new data from the same students

K-fold cross validation (standard) Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times. What can you infer from this? – Your detector will work with new data from the same students How often do we really care about this?

K-fold cross validation (student-level) Split your data set in half (by student), train on one half, test on the other half What can you infer from this?

K-fold cross validation (student-level) Split your data set in half (by student), train on one half, test on the other half What can you infer from this? – Your detector will work with data from new students from the same population (whatever it was) – Possible to do in RapidMiner – Not possible to do in Weka

K-fold or leave-one-out Really not clear which one is best (as discussed in previous lecture) Certain kinds of re- sampling/bootstrapping/etc. are easier to do with k-fold cross-validation

A data set from a different tutor The most stringent test When your model succeeds at this test, you know you have a good/general model When it fails, it’s sometimes hard to know why

An interesting alternative Leave-out-one-tutor-cross-validation (cf. Baker, Corbett, & Koedinger, 2006) – Train on data from 3 or more tutors – Test on data from a different tutor – (Repeat for all possible combinations) – Good for giving a picture of how well your model will perform in new lessons

Worth noting If you want to know if your model will work on new populations Cross-validate at the population level rather than the student level

Comments? Questions?

Homework 3 Let’s look at some of the homework 3 solutions Please comment on what’s right and wrong, what’s clever, etc. We’ll look at the approaches, the goodness, the final models

Homework 3 Now let’s take the best homework Any other ideas for how to come up with a better model? – Let’s try them!

Feature Engineering There are lots of fancy algorithms But typically your detector is no better than your features – Features that have good construct validity are more likely to produce a good model – Particularly nice example of this in Sao Pedro et al. (under review) In the next assignment, you’ll create your own features to try to produce a better model

Assignment 4 Let’s review Assignment 4

Comments? Questions?

Next Class Wednesday, February 15 3pm-5pm AK232 Feature engineering and feature distillation SPECIAL GUEST LECTURER: SUJITH GOWDA Assignments Due: 4. Feature Engineering

The End

Bonus Slides If there’s time

BKT with Multiple Skills

Conjunctive Model (Pardos et al., 2008) The probability a student can answer an item with skills A and B is P(CORR|A^B) = P(CORR|A) * P(CORR|B) But how should credit or blame be assigned to the various skills?

Koedinger et al.’s (2011) Conjunctive Model Equations for 2 skills

Koedinger et al.’s (2011) Conjunctive Model Generalized equations

Koedinger et al.’s (2011) Conjunctive Model Handles case where multiple skills apply to an item better than classical BKT

Other BKT Extensions? Additional parameters? Additional states?

Many others Compensatory Multiple Skills (Pardos et al., 2008) Clustered Skills (Ritter et al., 2009)