Using Random Forest as a Tool for Policy Analysis

Slides:



Advertisements
Similar presentations
Is Random Model Better? -On its accuracy and efficiency-
Advertisements

1 Correlation and Simple Regression. 2 Introduction Interested in the relationships between variables. What will happen to one variable if another is.
Chapter 5 Multiple Linear Regression
Multiple Regression and Model Building
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Assumption MLR.3 Notes (No Perfect Collinearity)
Evaluation.
Sparse vs. Ensemble Approaches to Supervised Learning
Statistics for Managers Using Microsoft® Excel 5th Edition
BA 555 Practical Business Analysis
Additional Topics in Regression Analysis
Multiple Regression Models: Some Details & Surprises Review of raw & standardized models Differences between r, b & β Bivariate & Multivariate patterns.
Evaluation.
© 2003 Prentice-Hall, Inc.Chap 14-1 Basic Business Statistics (9 th Edition) Chapter 14 Introduction to Multiple Regression.
Clustered or Multilevel Data
Simple Regression correlation vs. prediction research prediction and relationship strength interpreting regression formulas –quantitative vs. binary predictor.
Lecture 6: Multiple Regression
Three kinds of learning
Biol 500: basic statistics
Bivariate & Multivariate Regression correlation vs. prediction research prediction and relationship strength interpreting regression formulas process of.
Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%
An Introduction to Logistic Regression
Today Concepts underlying inferential statistics
Determining the Size of
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Objectives of Multiple Regression
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
by B. Zadrozny and C. Elkan
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Chapter 9 – Classification and Regression Trees
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Multiple regression - Inference for multiple regression - A case study IPS chapters 11.1 and 11.2 © 2006 W.H. Freeman and Company.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
Today Ensemble Methods. Recap of the course. Classifier Fusion
SW388R6 Data Analysis and Computers I Slide 1 Multiple Regression Key Points about Multiple Regression Sample Homework Problem Solving the Problem with.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Reserve Variability – Session II: Who Is Doing What? Mark R. Shapland, FCAS, ASA, MAAA Casualty Actuarial Society Spring Meeting San Juan, Puerto Rico.
Right Hand Side (Independent) Variables Ciaran S. Phibbs June 6, 2012.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference.
Right Hand Side (Independent) Variables Ciaran S. Phibbs.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 13, 2013.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Data Analytics CMIS Short Course part II Day 1 Part 3: Ensembles Sam Buttrey December 2015.
Brian Lukoff Stanford University October 13, 2006.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Multiple Regression Learning Objectives n Explain the Linear Multiple Regression Model n Interpret Linear Multiple Regression Computer Output n Test.
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Bagging and Random Forests
Introduction to Machine Learning and Tree Based Methods
Part III – Gathering Data
ECE 5424: Introduction to Machine Learning
Data Mining Practical Machine Learning Tools and Techniques
Introduction to Predictive Modeling
Ensemble learning.
Inferential Statistics
Regression Forecasting and Model Building
Multiple Regression – Split Sample Validation
Presentation transcript:

Using Random Forest as a Tool for Policy Analysis Reuben Ternes November 2012

Overview Part 1: Policy Analysis – Test Optional Part 2: The Weaknesses of Parametric Statistics Part 3: Data Mining: Random Forest as an Alternative Part 4: Real world example

Policy Analysis – Test Optional Part 1 Policy Analysis – Test Optional Reuben Ternes November 2012

Test Optional? Lots of institutions are considering going ‘test-optional’ these days. Should yours? How can we use IR data to figure out a reasonable policy recommendation? Just as a caveat: OU is not considering changing our admissions rules, this is more of a theoretical exercise. We’re already partially test-optional

Test-Optional Literature 2003 (Geiser with Studley) suggested: HSGPA was better than SAT in predicting FY GPAs. 80,000+ sample size, University of Cali Data Used regression, logistic regression, and some HLM to reach their conclusions. Fairly rigorous methodology. 2007 follow up study (Gesier & Satnelices) found same pattern with 4-year outcomes. Very influential

Test Optional Literature Since The literature on the topic is vast. Most of it supports the notion that HSGPA is a better predictor than SAT/ACT. Many find that ACT/SATs add predictive validity. Some do not, or find that the addition is trivial. Almost all of the literature uses a parametric regression (of some kind) to estimate SAT/ACT’s predictive validity.

The Weaknesses of Parametric Statistics Part 2 The Weaknesses of Parametric Statistics Reuben Ternes November 2012

What’s Wrong with Regression? OLS Regression is a fantastic tool. But, it’s failings as far as a predictive tool are well known: Missing data is difficult to deal with Categorical data is difficult to deal with Interactions must be modeled by hand Non-linearities must be modeled by hand Poorly handles data sets with lots of variables Overfitting is common It is not a good tool to understand the predictive contribution of ACT scores.

Regression is a Parametric Technique All parametric statistical techniques make certain assumptions about the data. In Regression: Normality Heteroscedasticity Linearity multicollinearity Among others…

Parametric Assumptions In practice, these assumptions are often incorrect. We still use parametric statistics because they are useful. But they are not perfect estimators of the predictive contributions of different variables! And, they don’t always make good predictions!

Regression: Categorical Data Imagine that you have 1 categorical variable with 10 categories. In regression, you have to code this as 10 dummy variables (0,1). If you have 10 such variables, then you have 100 additional variables in your regression model. This reduces your degrees of freedom! Now, imagine that you have interaction terms with 10 other potential continuous variables. That’s 1000 different variables!

Regression: Interaction & Non-Linearities Now imagine that you have 10 continuous variables. You should, at the very least, include quadratic and cubic versions of these variables in your model, just in case they are not linearly related. Now you have 30 variables Don’t forget your interaction terms!

Regression: Overfitting But if you actually model all of this: You’ve probably went to far. Eventually, you’ll start modeling noise, not ‘real’ patterns. If is difficult to figure out when you’ve overfitted your data when using regression. What will happen? Your test data will look good. Actual predictions will be low.

Regression: Missing Data Must model missing data, or data is lost. Common to add median or mean value for continuous data. Or most common response for categorical data. Or code them as missing. If you don’t impute, then every case that has missing data, even if it is mostly accurate, won’t be used in the final analysis. If the data isn’t missing at random, you could be in series trouble. Often, you don’t know why data is missing.

Data Mining: Random Forest as an Alternative Part 3 Data Mining: Random Forest as an Alternative Reuben Ternes November 2012

Recent History of Data Mining Netflix Prize Target Yahoo Amazon Etc. All are using prediction algorithms to match customers with products. The prediction tools they are using are much more sophisticated than simple regression!

How Data Mining Can Help Inform Policy There are other ways to understand predictive contributions Data Mining/Machine Learning Algorithms Have improved greatly over the past decade Are now recognized to be much better predictors than many standard regression techniques Random Forest, in particular, stands out.

Random Forest Random Forest Deals with missing data well Robust to over-fitting Relatively easy to use Can handle hundreds of different variables Categorical (i.e. non-numerical) data is OK Makes no assumptions (non-parametric) Overall good performance

How Does It Work? It builds lots of (decision) trees Randomly (That’s why it’s called Random Forest)

An Example Decision Tree

How Random Forest Works: Overview Step 1) Build tree from a random subset of predictor variables. Size of tree = sqrt (classification outcome) or 1/3 (continuous outcome) of the number of predictor variables Step 2) Use N random cases from the dataset, drawing with replacement For each tree, approx. 1/3 of the dataset isn’t used (Bootstrapping) Step 3) Record the result of each unused case After building the tree, ‘run’ the unused cases: record result. Step 4) Repeat this process 500-1000 times Probabilities are generated by the total proportion of yes votes. Regression generated by average prediction.

The Random Part Is Important You could build a giant decision tree with dozens of variables. But it would be big. Too big. Suffers from some of the same problems as standard regression techniques (it overfits, poorly models interactions effects, etc.) Instead, Random Forest uses random elements to its advantage. 1) It builds many smaller trees (500-1000) using a random sample of the predictors. 2) It samples N cases with replacement.

Why Make Many Random Trees? The trees are smaller. Smaller trees are easier to deal with. That means you can make a lot of them Aggregating lots of small trees do a better job of capturing interaction effects without overfitting Ditto with non-linearities (The split on any continuous predictor will be different for every tree)

Why Sample with Replacement It keeps N high, but creates a hold out set. This hold out set is used to create an (unbiased) estimate of the error rate. This means you don’t need training data! (Essentially, every tree is both a test data set and training data set rolled into one). There are known issues with sampling with replacement. Does not affect raw predictions. Does affect variable importance data

Pause for Questions Let me pause for questions before continuing.

Random Forest: Results Random Forest results are not like regression Variable important list Based on node purity measures (Gini coefficient) Numbers are pretty much un-interpretable No explanation of how variables interact with outcome No established method to create p-values You really only get: Prediction results Vague sense of how important each variable is Either an error rate (categorical outcomes) or a percent of total variance explained (continuous outcomes).

Testing Policy with Random Forest If you can’t get p-vales, how can you do policy analysis with Random Forest? What you can do, is run various sets of predictions, and look at the accuracy of those predictions. Systematically exclude the variables that you are interested in examining.

Real World Example of Random Forest Part 4 Real World Example of Random Forest Reuben Ternes November 2012

The Question Should your institution go test-optional? Another way to ask this question is: How much do admissions test tell us about future student outcomes? We will test just first year GPA. But you could test anything (retention, graduation, etc.)

Admissions Models I consider three models 1) Saturated An extreme (unrealistic) amount of data on incoming students. Obtained late during the admissions cycle. More information than a human could process to make a decision. About 50 variables we collect during the admissions cycle. 2) Just HS GPA and ACT scores. 3) HS GPA, ACT scores, and one measure of SES. Obtained by aggregating the % of Pell students by zip code for OU over the last 10 years. I test this model because one of the common complaints against standardized tests is that they only measure SES.

Results: Saturated Model Averaged over 5 trails (500 trees per trial) All – 29.9% of total variance explained Exclude ACT scores – 29.7% of total variance explained. Conclusion: ACT scores do not add much information to the total model. But it probably does add something. But this is an unrealistic model for admissions decisions, so it doesn’t answer our question.

Results: HS GPA + ACT Scores Only Model Saturated model Averaged over 5 trails (500 trees per trial) HS GPA + ACT Scores – 21.2% of total variance explained Exclude ACT scores – 20.2% of total variance explained. Conclusion: ACT scores improve predictions by a noticeable, but still small amount at OU.

Results: HS GPA, ACT Scores, SES Model HS GPA, ACT, + SES model Averaged over 5 trails (500 trees per trial) HS GPA, ACT, SES – 25.0% of total variance explained Exclude ACT scores – 21.6% of total variance explained. Conclusion 1: ACT scores improve predictions noticeably. Conclusion 2: There are some very important and non-trivial interaction effects going on with ACT scores and SES. If our goal is to develop predictive decision rules that correlate with academic success, we are leaving a lot of useful information out by not considering SES data.