Classifiers, Part 1 Week 1, video 3:. Prediction  Develop a model which can infer a single aspect of the data (predicted variable) from some combination.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Week 1, video 2: Regressors. Prediction Develop a model which can infer a single aspect of the data (predicted variable) from some combination of other.
Design of Experiments Lecture I
DECISION TREES. Decision trees  One possible representation for hypotheses.
Brief introduction on Logistic Regression
Automated Regression Modeling Descriptive vs. Predictive Regression Models Four common automated modeling procedures Forward Modeling Backward Modeling.
Feature Engineering Studio Special Session October 23, 2013.
Decision Trees with Numeric Tests
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 February 18, 2013.
K-NEAREST NEIGHBORS AND DECISION TREE Nonparametric Supervised Learning.
Deriving rules from data Decision Trees a.j.m.m (ton) weijters.
Logistic Regression.
Week 3. Logistic Regression Overview and applications Additional issues Select Inputs Optimize complexity Transforming Inputs.
Knowledge Engineering Week 3 Video 5. Knowledge Engineering  Where your model is created by a smart human being, rather than an exhaustive computer.
Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 27, 2012.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 7, 2013.
Educational Data Mining March 3, Today’s Class EDM Assignment#5 Mega-Survey.
Discovery with Models Week 8 Video 1. Discovery with Models: The Big Idea  A model of a phenomenon is developed  Via  Prediction  Clustering  Knowledge.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Decision Tree Rong Jin. Determine Milage Per Gallon.
Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)
Educational Data Mining Overview John Stamper PSLC Summer School /25/2011 1PSLC Summer School 2011.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor.
Prelude of Machine Learning 202 Statistical Data Analysis in the Computer Age (1991) Bradely Efron and Robert Tibshirani.
Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Decision Trees Prepared by David Douglas, University of ArkansasHosted.
Educational Data Mining Ryan S.J.d. Baker PSLC/HCII Carnegie Mellon University Richard Scheines Professor of Statistics, Machine Learning, and Human-Computer.
A Presentation on the Implementation of Decision Trees in Matlab
Introduction to Directed Data Mining: Decision Trees
Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.
Case Study – San Pedro Week 1, Video 6. Case Study of Classification  San Pedro, M.O.Z., Baker, R.S.J.d., Bowers, A.J., Heffernan, N.T. (2013) Predicting.
Prediction (Classification, Regression) Ryan Shaun Joazeiro de Baker.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 13, 2012.
Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic.
Chapter 9 – Classification and Regression Trees
K Nearest Neighbors Classifier & Decision Trees
Feature Engineering Studio September 23, Let’s start by discussing the HW.
Business Intelligence and Decision Modeling Week 9 Customer Profiling Decision Trees (Part 2) CHAID CRT.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Christoph Eick: Learning Models to Predict and Classify 1 Learning from Examples Example of Learning from Examples  Classification: Is car x a family.
M Machine Learning F# and Accord.net.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Feature Engineering Studio September 9, Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.
LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 8 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Ensemble Classifiers.
EDUCAUSE Annual Conference
Core Methods in Educational Data Mining
Chapter 7. Classification and Prediction
k-Nearest neighbors and decision tree
Introduction to Machine Learning and Tree Based Methods
Lecture 17. Boosting¶ CS 109A/AC 209A/STAT 121A Data Science: Harvard University Fall 2016 Instructors: P. Protopapas, K. Rader, W. Pan.
Christoph Eick: Learning Models to Predict and Classify 1 Learning from Examples Example of Learning from Examples  Classification: Is car x a family.
Prediction (Classification, Regression)
Big Data, Education, and Society
Overview of Machine Learning
Machine Learning in Practice Lecture 7
Core Methods in Educational Data Mining
Logistic Regression.
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Classifiers, Part 1 Week 1, video 3:

Prediction  Develop a model which can infer a single aspect of the data (predicted variable) from some combination of other aspects of the data (predictor variables)  Sometimes used to predict the future  Sometimes used to make inferences about the present

Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical  The answer is one of a set of categories, not a number  CORRECT/WRONG (sometimes expressed as 0,1) We’ll talk about this specific problem later in the course within latent knowledge estimation  HELP REQUEST/WORKED EXAMPLE REQUEST/ATTEMPT TO SOLVE  WILL DROP OUT/WON’T DROP OUT  WILL ENROLL IN MOOC A,B,C,D,E,F, or G

Where do those labels come from?  In-software performance  School records  Test data  Survey data  Field observations or video coding  Text replays

Classification  Associated with each label are a set of “features”, which maybe you can use to predict the label Skillpknowtimetotalactions right ENTERINGGIVEN WRONG ENTERINGGIVEN RIGHT USEDIFFNUM WRONG ENTERINGGIVEN RIGHT REMOVECOEFF WRONG REMOVECOEFF RIGHT USEDIFFNUM RIGHT ….

Classification  The basic idea of a classifier is to determine which features, in which combination, can predict the label Skillpknowtimetotalactions right ENTERINGGIVEN WRONG ENTERINGGIVEN RIGHT USEDIFFNUM WRONG ENTERINGGIVEN RIGHT REMOVECOEFF WRONG REMOVECOEFF RIGHT USEDIFFNUM RIGHT ….

Classifiers  There are hundreds of classification algorithms  A good data mining package will have many implementations  RapidMiner  SAS Enterprise Miner  Weka  KEEL

Classification  Of course, usually there are more than 4 features  And more than 7 actions/data points

Domain-Specificity  Specific algorithms work better for specific domains and problems  We often have hunches for why that is  But it’s more in the realm of “lore” than really “engineering”

Some algorithms I find useful  Step Regression  Logistic Regression  J48/C4.5 Decision Trees  JRip Decision Rules  K* Instance-Based Classifiers  There are many others!

Step Regression  Not step-wise regression  Used for binary classification (0,1)

Step Regression  Fits a linear regression function  (as discussed in previous class)  with an arbitrary cut-off  Selects parameters  Assigns a weight to each parameter  Computes a numerical value  Then all values below 0.5 are treated as 0, and all values >= 0.5 are treated as 1

Example  Y= 0.5a + 0.7b – 0.2c + 0.4d  Cut-off 0.5 abcdY

Example  Y= 0.5a + 0.7b – 0.2c + 0.4d  Cut-off 0.5 abcdY

Example  Y= 0.5a + 0.7b – 0.2c + 0.4d  Cut-off 0.5 abcdY

Example  Y= 0.5a + 0.7b – 0.2c + 0.4d  Cut-off 0.5 abcdY

Quiz  Y= 0.5a + 0.7b – 0.2c + 0.4d  Cut-off 0.5 abcdY 201

Note  Step regression is used in RapidMiner by using linear regression with binary data  Other functions in different packages

Step regression: should you use it?  Step regression is not preferred by statisticians due to lack of closed-form expression  But often does better in EDM, due to lower over- fitting

Logistic Regression  Another algorithm for binary classification (0,1)

Logistic Regression  Given a specific set of values of predictor variables  Fits logistic function to data to find out the frequency/odds of a specific value of the dependent variable

Logistic Regression

m = a0 + a1v1 + a2v2 + a3v3 + a4v4…

Logistic Regression m = 0.2A + 0.3B + 0.4C

Logistic Regression m = 0.2A + 0.3B + 0.4C ABCMP(M) 000

Logistic Regression m = 0.2A + 0.3B + 0.4C ABCMP(M)

Logistic Regression m = 0.2A + 0.3B + 0.5C ABCMP(M)

Logistic Regression m = 0.2A + 0.3B + 0.5C ABCMP(M) 0.27

Logistic Regression m = 0.2A + 0.3B + 0.5C ABCMP(M)

Logistic Regression m = 0.2A + 0.3B + 0.5C ABCMP(M)

Logistic Regression m = 0.2A + 0.3B + 0.5C ABCMP(M) 50 ~1

Relatively conservative  Thanks to simple functional form, is a relatively conservative algorithm  I’ll explain this in more detail later in the course

Good for  Cases where changes in value of predictor variables have predictable effects on probability of predicted variable class  m = 0.2A + 0.3B + 0.5C  Higher A always leads to higher probability  But there are some data sets where this isn’t true!

What about interaction effects?  A = Bad  B = Bad  A+B = Good

What about interaction effects?  Ineffective Educational Software = Bad  Off-Task Behavior = Bad  Ineffective Educational Software PLUS Off-Task Behavior = Good

Logistic and Step Regression are good when interactions are not particularly common  Can be given interaction effects through automated feature distillation  We’ll discuss this later  But is not particularly optimal for this

What about interaction effects?  Fast Responses + Material Student Already Knows - > Associated with Better Learning  Fast Responses + Material Student Does not Know - > Associated with Worse Learning

Decision Trees  An approach that explicitly deals with interaction effects

Decision Tree KNOWLED GE TIM E TOTALACTI ONS RIG HT WRO NG <0. 5 >=0.5 <6s.>=6 s. <4>=4 Skillknowledgetime totalactionsright? COMPUTESLOPE ?

Decision Tree KNOWLED GE TIM E TOTALACTI ONS RIG HT WRO NG <0. 5 >=0.5 <6s.>=6 s. <4>=4 Skillknowledgetime totalactionsright? COMPUTESLOPE RIGHT

Quiz KNOWLED GE TIM E TOTALACTI ONS RIG HT WRO NG <0. 5 >=0.5 <6s.>=6 s. <4>=4 Skillknowledgetime totalactionsright? COMPUTESLOPE ?

Decision Tree Algorithms  There are several  I usually use J48, which is an open-source re- implementation in Weka/RapidMiner of C4.5 (Quinlan, 1993)

J48/C4.5  Can handle both numerical and categorical predictor variables  Tries to find optimal split in numerical variables  Repeatedly looks for variable which best splits the data in terms of predictive power for each variable  Later prunes out branches that turn out to have low predictive power  Note that different branches can have different features!

Can be adjusted…  To split based on more or less evidence  To prune based on more or less predictive power

Relatively conservative  Thanks to pruning step, is a relatively conservative algorithm  We’ll discuss conservatism in a later class

Good when data has natural splits

Good when multi-level interactions are common

Good when same construct can be arrived at in multiple ways  A student is likely to drop out of college when he  Starts assignments early but lacks prerequisites  OR when he  Starts assignments the day they’re due

Later Lectures  More classification algorithms  Goodness metrics for comparing classifiers  Validating classifiers  What does it mean for a classifier to be conservative?

Next Lecture  Building regressors and classifiers in RapidMiner