KAIR 2013 Nov 7, 2013 A Data Driven Analytic Strategy for Increasing Yield and Retention at Western Kentucky University Matt Bogard Office of Institutional Research Western Kentucky University
Purpose Are there opportunities at the applicant stage to improve our yield, implement cost savings, and shape our freshmen class to maximize retention? Is there a way of knowing which applicants are most likely to enroll and retain?
Methodology Machine Learning vs. Statistical Inference Decision Trees Emphasis on accurate predictions vs. inferences about particular roles of specific variables Decision Trees Ensemble Methods Gradient Boosting Neural Networks
Decision Tree Basics- Algorithm Chooses variables and split values creating data partitions that differ based on the outcome of interest (retention) Finds all possible splits based on an adjusted χ2 p-value Prunes the tree to derive the most accurate predictions with fewest possible splits based on validation data The final model is characterized by the split values for each explanatory variable and creates a set of rules for classifying new cases.
Basic Decision Tree Visualization
Benefits of Decision Trees "Approaching problems by looking for a data model imposes an apriori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems.“ – Leo Brieman, Statistical Modeling: The Two Cultures (Statistical Science,2001) Non-parametric and non-linear No distributional assumptions Treat the data generation process as unknown No required functional form for predictors Identify complex interactions
Ensemble Methods Generalization Error- how well does a model predict across training, validation, and test data sets Ensemble- combined predictions of several learners or models The generalization error of a weighted combination of predictors in an ensemble is equal to the average error of the individual predictors minus ‘disagreement’ among them’-Krogh (1997), Statistical Mechanics of Ensemble Learning. Physical Review. Ensemble Error is smaller than the weighted average of the error of a single optimized predictor
Gradient Boosting Boosting algorithms: ensemble of a series of weak learners. Fit a series of trees using resampled training data weighted by classification accuracy of previous tree Combined series of trees form a single model
Neural Networks A nonlinear model of complex relationships with 'hidden' layers Using logistic activation functions, NNETS can be visualized as an ensemble of logits Y= W0 + W1 H1 + W2 H2 + W3 H3 + W4 Logit H4 and H1= logit(w10 +w11 x1 + w12 x2 ) H2 = logit(w20 +w21 x1 + w22 x2 ) H3 = logit(w30 +w31 x1 + w32 x2 ) H4 = logit(w40 +w41 x1 + w42 x2 )
Gradient Boosting vs. Decision Trees vs.NNETs vs. Logistic Regression Decision Trees and Gradient Boosting are both robust to data generation process Decision Trees - more transparent model structure, which is lost in ensemble methods like gradient boosting and neural networks Neural Networks have issues with input selection and are more complex to train Decision tree posterior probability distribution may not be very smooth
Gradient Boosting vs. Decision Trees vs. Logistic Regression Logistic Regression provides Smooth posterior probability distribution Less transparent model structure than decision trees but more transparent than GB Could be used for inferences or agnostic learning algorithm based on a specified functional form *some may refuse to make this distinction and make inferences where inappropriate
Machine Learning vs. Inference Trees can guide and direct further inferential work, but can be misleading in terms of causal relationships if you are not careful
Fitting the Models
Results Focus: how well does the model predict behavior vs. inferences about the roles of specific variables Tradeoff between discrimination (measured by ROC ) & model calibration (Cook,2007) Gradient Boosting outperformed the other models based on calibration
Scorecard Using our models, we can sort applicants into 4 categories for enrollment propensity and predicted retention.
Implementation: Use advanced analytics to develop a strategic recruitment and retention strategy
Adhoc Reports Report by Counselor/Region/Territory Report by County/ School Report by Student demographics Report by Prospect Source …other??
IR-DSS
Detail Reporting
Additional Reading Bogard, M.T. (2013).A Data Driven Analytic Strategy for Increasing Yield and Retention at Western Kentucky University Using SAS Enterprise BI and SAS Enterprise Miner. Paper 044-2013. SAS Institute Inc. 2013.Proceedings of the SAS® Global Forum 2013 Conference. Cary, NC. DeVille, Barry. (2006). Decision Trees for Business Intelligence and Data Mining Using SAS® Enterprise Miner. SAS® Institute. SAS® Institute.. By Barry de Ville and Padraic Neville. SAS® Institute. 2013 Friedman, Jerome H. (2001), Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29, 1189-1232. Available at http://stat.stanford. Hasti, Tibshirani and Friedman. (2009)Elements of Statistical Learning: Data Mining,Inference, and Prediction. Second Edition. Springer-Verlag. 'Statistical Modeling: The Two Cultures' by L. Breiman (Statistical Science 2001, Vol. 16, No. 3, 199–231) Cook,Nancy R.,(2007). Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction. Circulation, 115 (7):928-35. Krogh, A. & Sollich, P. (1997, January). Statistical mechanics of ensemble learning. Physical Review E (Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics), 55 (1), 811-825.