Using a Random Forest model to predict enrollment Ward Headstrom Institutional Research Humboldt State University CAIR 2013 1.

Slides:



Advertisements
Similar presentations
Copyright , SPSS Inc. 1 Practical solutions for dealing with missing data Rob Woods Senior Consultant.
Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
PRE-SCHOOL QUANT WORKSHOP II R THROUGH EXCEL. NEW YORK TIMES INFOGRAPHICS GALARY The Jobless Rate for People Like You Home Prices in Selected Cities For.
Writing functions in R Some handy advice for creating your own functions.
Random Forest Predrag Radenković 3237/10
Plackett-Burman Design of Experiments 1. Statistics developed in the early 20 th century 2. Tool to help improve yields in farming 3. Many types of experiments/techniques.
CART: Classification and Regression Trees Chris Franck LISA Short Course March 26, 2013.
A Quick Overview By Munir Winkel. What do you know about: 1) decision trees 2) random forests? How could they be used?
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Ann Arbor ASA ‘Up and Running’ Series: SPSS Prepared by volunteers of the Ann Arbor Chapter of the American Statistical Association, in cooperation with.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Machine Learning: Ensemble Methods
Introduction to SPSS Edward A. Greenberg, PhD
Middle Eastern Women 2005 Optional Business Computer Skills Review Paula Ecklund.
Chapter 9 – Classification and Regression Trees
Generic Approaches to Model Validation Presented at Growth Model User’s Group August 10, 2005 David K. Walters.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Conducting post-hoc tests of compound coefficients using simple slopes for a categorical.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
Nurissaidah Ulinnuha. Introduction Student academic performance ( ) Logistic RegressionNaïve Bayessian Artificial Neural Network Student Academic.
Introduction to R Introductions What is R? RStudio Layout Summary Statistics Your First R Graph 17 September 2014 Sherubtse Training.
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Conducting post-hoc tests of compound coefficients using simple slopes for a categorical.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.
© 2015 by Wade Rogers Introduction to R Cytomics Workshop December, 2015.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
STA302: Regression Analysis. Statistics Objective: To draw reasonable conclusions from noisy numerical data Entry point: Study relationships between variables.
Descriptive Statistics using R. Summary Commands An essential starting point with any set of data is to get an overview of what you are dealing with You.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Data Science Credibility: Evaluating What’s Been Learned
Machine Learning: Ensemble Methods
A Smart Tool to Predict Salary Trends of H1-B Holders
Elizabeth R McMahon 14 April 2017
A Statistical Analysis Utilizing Detailed Institutional Data
Introduction to Machine Learning
Chapter 7. Classification and Prediction
Introduction to Machine Learning and Tree Based Methods
The Commute: The Battle of Finding Distance
Eco 6380 Predictive Analytics For Economists Spring 2016
V.A Student Activity Sheet 3: Growth Model
Trees, bagging, boosting, and stacking
University of Chicago School of Social Service Administration
Predict House Sales Price
NBA Draft Prediction BIT 5534 May 2nd 2018
Employee Turnover: Data Analysis and Exploration
Finding Answers through Data Collection
Exam #3 Review Zuyin (Alvin) Zheng.
Teaching Analytics with Case Studies: Finding Love in a Classification Tree Ruth Hummel, PhD JMP Academic Ambassador.
FAFSA What Students and Families Need to Know
Dealing with missing data
Just Enough to be Dangerous: Basic Statistics for the Non-Statistician
SQL for Calculating Likelihood Ratios
Experiments in Machine Learning
Azure Machine Learning Studio: Four Tips from the Pros
Statistical Inference about Regression
Introduction to Predictive Modeling
1/18/2019 ST3131, Lecture 1.
Predicting Students’ Course Success Using Machine Learning Approach
Analyzing Stability in Colorado K-12 Public Schools
Analytics – Statistical Approaches
Classification with CART
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Data analysis with R and the tidyverse
Interpreting Data for use in Charts and Graphs.
Analysis on Accelerated Learning Cohorts
Developing Honors College Admissions Rubric to Ensure Student Success
The Web and Using the VTAC Analysis Tool
STT : Intro. to Statistical Learning
Presentation transcript:

Using a Random Forest model to predict enrollment Ward Headstrom Institutional Research Humboldt State University CAIR

Overview Forecasting enrollment to assist University planning The R language The Random Forest model Binary Logistic Regression model Cautions and Conclusions The example I am going to use is projecting New enrollment. These techniques can easily be applied to predicting… Retention Graduation Other future events 2

Simple enrollment projections 1) how many student enrolled last year? 2) enhance by breaking it down into subgroups 3) possibly use linear regressions (trends) 4) enhance further by looking at to-date information 3

2014 projection = 2013 “to-date” yield * 2014 apps = 1,369/3,692*4,361 = 1,617 4

5

6

But what about… Why applicant yield might not be the best predictor: Admits more likely to enroll Confirms more likely to enroll Denied or withdrawn will not enroll Housing deposits may be good indicator of intent Local applicants more likely than distant applicants Certain majors may be more likely to enroll White applicants may be more likely than URM Ideally, we would like to use all the data we have about applicants to predict how likely they are to enroll. Variables: demographics, academics, actions to-date Model 1: Random Forest Model 2: Binary logistic regression 7

Data files All the data fields you think might help predict yields Major discipline Region of origin Sex Ethnicity Academic preparation Actions Visited campus Confirmed intent to enroll Paid housing deposit Be careful of institutional actions admission cancellation 8

The language R CAIR comment: an emphasis on R would be “limiting to institutions that used other software”. The first (only?) implementation of Random Forest models R is open source – free to use Many online tutorials: guide-for-complete-beginners/ guide-for-complete-beginners/ 9

R and RStudio overview 4 panes – help, history, import dataset, packages Function-based: function(data,options) Case-specific language Object types: data.frame, vector, scalars, factor, models Useful commands: command line as calculator assignment -> or <- functions: na.omit(), summary(), table(), tolower() subsets: dataframe[row select, column select] graphics: hist(), plot() library(), especially library(randomForest) 10

RStudio 11

Import data into R 12

Decision Trees 13

Random Forest Model Developed by Leo Brieman and Adele Cutler Plan: grow a random forest of 500 decision trees randomForest(cenreg~variable1+variable2+…,data=train) Randomly picks fields for each tree Randomly selects rows to exclude from each tree Measure of variable importance Out Of Box estimate of error rate and Confusion matrix Run new data through all 500 trees and let them vote 14

Random Forest model of applicant yield 15

varImpPlot(rf) 16

1 st tree in Random Forest For categorical predictors, the splitting point is represented by an integer, whose binary expansion gives the identities of the categories that goes to left or right. For example, if a predictor has four categories, and the split point is 13. The binary expansion of 13 is (1, 0, 1, 1) (because 13 = 1*2^0 + 0*2^1 + 1*2^2 + 1*2^3), so cases with categories 1, 3, or 4 in this predictor get sent to the left, and the rest to the right. 17

Testing and making a Projection Random Forest projects that 42% of current Spring apps will enroll, compared to 45% of last year’s apps to-date and 34% of training years’. f 18

Binary Logistic Regression 19

Binary logistic regression model of applicant yield 20

21

22

BLR model – testing and projecting Binary Logistic Regression predicted 324 of current Spring applicants will enroll, compared to 340 projected by Random Forest model. 23

Cautions and Conclusions Null or new values in variables will cause problems Beware of to-date variables (e.g. intent_td). Make sure that procedures have not changed in a way that will affect behavior. R is a very powerful tool which can be very useful if you are willing to invest some time learning it. Multivariate models may improve the accuracy of your predictions. Corroborate with simple models and consultation with involved staff. 24

Questions? Comments? This presentation: My 25