Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 13, 2012
Today’s Class Classification and Behavior Detection
Prediction Pretty much what it says A student is using a tutor right now. Is he gaming the system or not? A student has used the tutor for the last half hour. How likely is it that she knows the skill in the next step? A student has completed three years of high school. What will be her score on the college entrance exam?
Two Key Types of Prediction This slide adapted from slide by Andrew W. Moore, Google
Classification There is something you want to predict (“the label”) The thing you want to predict is categorical – The answer is one of a set of categories, not a number – CORRECT/WRONG (sometimes expressed as 0,1) – HELP REQUEST/WORKED EXAMPLE REQUEST/ATTEMPT TO SOLVE – WILL DROP OUT/WON’T DROP OUT – WILL SELECT PROBLEM A,B,C,D,E,F, or G
Where do those labels come from? Field observations (take PSY503) Text replays (take PSY503) Post-test data (take PSY503) Tutor performance Survey data School records Where else?
Classification Associated with each label are a set of “features”, which maybe you can use to predict the label Skillpknowtimetotalactionsright ENTERINGGIVEN WRONG ENTERINGGIVEN RIGHT USEDIFFNUM WRONG ENTERINGGIVEN RIGHT REMOVECOEFF WRONG REMOVECOEFF RIGHT USEDIFFNUM RIGHT ….
Classification The basic idea of a classifier is to determine which features, in which combination, can predict the label Skillpknowtimetotalactionsright ENTERINGGIVEN WRONG ENTERINGGIVEN RIGHT USEDIFFNUM WRONG ENTERINGGIVEN RIGHT REMOVECOEFF WRONG REMOVECOEFF RIGHT USEDIFFNUM RIGHT ….
Classification Of course, usually there are more than 4 features And more than 7 actions/data points These days, 800,000 student actions, and 26 features, would be a medium-sized data set
Classification One way to classify is with a Decision Tree (like J48) PKNOW TIMETOTALACTIONS RIGHT WRONG <0.5>=0.5 <6s.>=6s.<4>=4
Classification One way to classify is with a Decision Tree (like J48) PKNOW TIMETOTALACTIONS RIGHT WRONG <0.5>=0.5 <6s.>=6s.<4>=4 Skillpknowtimetotalactionsright COMPUTESLOPE ?
J48/C4.5 Can handle both numerical and categorical predictor variables – Tries to find optimal split in numerical variable Repeatedly looks for variable which best splits the data in terms of predictive power for each variable Later prunes out branches that turned out to have low predictive power
Step Regression Linear regression (discussed in detail in a later class), with a cut-off Essentially assigns a weight to each parameter, and then computes a numerical value Then all values below 0.5 are treated as 0, and all values >= 0.5 are treated as 1
And of course… There are lots of other classification algorithms you can use... K* (instance-based classification) JRip (rule-based classification using trees) PART (rule-based classification using trees) Neural Network Logistic Regression SMO (support vector machine) In your favorite Machine Learning package
If there’s time at the end of class… We could go through some of these algorithms
Comments? Questions?
What data set should you generally test on? A vote… – Raise your hands as many times as you like
What data set should you generally test on? The data set you trained your classifier on A data set from a different tutor Split your data set in half (by students), train on one half, test on the other half Split your data set in ten (by actions). Train on each set of 9 sets, test on the tenth. Do this ten times. Votes?
What data set should you generally test on? The data set you trained your classifier on A data set from a different tutor Split your data set in half (by students), train on one half, test on the other half Split your data set in ten (by actions). Train on each set of 9 sets, test on the tenth. Do this ten times. What are the benefits and drawbacks of each?
The dangerous one (though still sometimes OK) The data set you trained your classifier on If you do this, there is serious danger of over- fitting
The dangerous one (though still sometimes OK) You have ten thousand data points. You fit a parameter for each data point. “If data point 1, RIGHT. If data point 78, WRONG…” Your accuracy is 100% Your kappa is 1 Your model will neither work on new data, nor will it tell you anything.
The dangerous one (though still sometimes OK) The data set you trained your classifier on When might this one still be OK?
The dangerous one (though still sometimes OK) The data set you trained your classifier on When might this one still be OK? – Computing complexity-based goodness metrics such as BiC – Determine maximum possible performance of modeling approach
K-fold cross validation (standard) Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times. What can you infer from this?
K-fold cross validation (standard) Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times. What can you infer from this? – Your detector will work with new data from the same students
K-fold cross validation (standard) Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times. What can you infer from this? – Your detector will work with new data from the same students How often do we really care about this?
K-fold cross validation (student-level) Split your data set in half (by student), train on one half, test on the other half What can you infer from this?
K-fold cross validation (student-level) Split your data set in half (by student), train on one half, test on the other half What can you infer from this? – Your detector will work with data from new students from the same population (whatever it was) – Possible to do in RapidMiner – Not possible to do in Weka
K-fold or leave-one-out Really not clear which one is best (as discussed in previous lecture) Certain kinds of re- sampling/bootstrapping/etc. are easier to do with k-fold cross-validation
A data set from a different tutor The most stringent test When your model succeeds at this test, you know you have a good/general model When it fails, it’s sometimes hard to know why
An interesting alternative Leave-out-one-tutor-cross-validation (cf. Baker, Corbett, & Koedinger, 2006) – Train on data from 3 or more tutors – Test on data from a different tutor – (Repeat for all possible combinations) – Good for giving a picture of how well your model will perform in new lessons
Worth noting If you want to know if your model will work on new populations Cross-validate at the population level rather than the student level
Comments? Questions?
Homework 3 Let’s look at some of the homework 3 solutions Please comment on what’s right and wrong, what’s clever, etc. We’ll look at the approaches, the goodness, the final models
Homework 3 Now let’s take the best homework Any other ideas for how to come up with a better model? – Let’s try them!
Feature Engineering There are lots of fancy algorithms But typically your detector is no better than your features – Features that have good construct validity are more likely to produce a good model – Particularly nice example of this in Sao Pedro et al. (under review) In the next assignment, you’ll create your own features to try to produce a better model
Assignment 4 Let’s review Assignment 4
Comments? Questions?
Next Class Wednesday, February 15 3pm-5pm AK232 Feature engineering and feature distillation SPECIAL GUEST LECTURER: SUJITH GOWDA Assignments Due: 4. Feature Engineering
The End
Bonus Slides If there’s time
BKT with Multiple Skills
Conjunctive Model (Pardos et al., 2008) The probability a student can answer an item with skills A and B is P(CORR|A^B) = P(CORR|A) * P(CORR|B) But how should credit or blame be assigned to the various skills?
Koedinger et al.’s (2011) Conjunctive Model Equations for 2 skills
Koedinger et al.’s (2011) Conjunctive Model Generalized equations
Koedinger et al.’s (2011) Conjunctive Model Handles case where multiple skills apply to an item better than classical BKT
Other BKT Extensions? Additional parameters? Additional states?
Many others Compensatory Multiple Skills (Pardos et al., 2008) Clustered Skills (Ritter et al., 2009)