A Core Curriculum for Undergraduate Data Science Todd Iverson Brant Deppa Silas Bergen Tisha Hooks April Kerby Chris Malone Winona State University
Supervised Learning DSCI 425 Brant Deppa, Ph.D. Professor of Statistics & Data Science Winona State University bdeppa@winona.edu
Course Topics Numeric Response (𝒀) Introduction to Supervised Learning Review of multiple regression (predictors vs. terms) Cross-Validation for numeric response (discussed throughout) Automatic term selectors (ACE/AVAS/MARS) Projection pursuit regression (PPR) ( Neural networks) Penalized/regularized regression (ridge, LASSO, ElasticNet) Dimension reduction (PCR and PLS regression) Tree-based models (CART, bagging, random forests, boosting, treed regression) Nearest neighbor regression
Course Topics (cont’d) Categorical/Nominal/Ordinal Response (𝒀) Introduction to classification problems Nearest neighbor classification Naïve Baye’s classification Tree-based models for classification Discriminant analysis Neural networks Support vector machines (SVM) Multiple logistic regression Blending/stacking models (in progress)
Supervised Learning – General Problem 𝑌 =𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 𝑖𝑓 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑖𝑠 𝑛𝑢𝑚𝑒𝑟𝑖𝑐 𝑌 =𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑐𝑙𝑎𝑠𝑠 𝑖𝑓 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑛𝑜𝑚𝑖𝑛𝑎𝑙 𝑜𝑟 𝑜𝑟𝑑𝑖𝑛𝑎𝑙 or 𝑌 = 𝑃 (𝑐𝑙𝑎𝑠𝑠 𝑘|𝑿) Cross-validation (CV) methods are generally used for this purpose.
Cross-Validation – Split Sample
K-Fold Cross-Validation Data is divided into 𝑘 roughly equal sized subgroups. Each subgroup acts a validation set in turn. The average error across all validation sets is computed and used to choose between rival models.
Bootstrap Cross-Validation Observations not selected (i.e. out-of-bootstrap OOB) constitute the validation set. We can then calculate quality of prediction metrics for iteration. 𝑫𝒂𝒕𝒂: 𝒙 1 , 𝑦 1 , 𝒙 𝟐 , 𝑦 2 ,…, ( 𝒙 𝒏 , 𝑦 𝑛 ) here the 𝒙 𝒊 ′ 𝑠 are the p-dimensional predictor vectors. 𝑩𝒐𝒐𝒕𝒔𝒕𝒓𝒂𝒑 𝑺𝒂𝒎𝒑𝒍𝒆: 𝒙 𝟏 ∗ , 𝑦 1 ∗ , 𝒙 𝟐 ∗ , 𝑦 2 ∗ ,…,( 𝒙 𝒏 ∗ , 𝑦 𝑛 ∗ ) where ( 𝒙 𝒊 ∗ , 𝑦 𝑖 ∗ ) is a random selected observation from the original data drawn with replacement. This process is then repeated a large number of times (B = 500, 1000, 5000, etc.).
Monte Carlo Cross-Validation (MCCV) As each of the cross-validation strategies have an element of randomness to them, we can expect the results will vary from CV to the next. With MCCV we can conduct split-sample CV multiple times and aggregate the results from each to quantify predictive performance for a candidate model.
Multiple Regression Suppose we have a numeric response (𝑌) and a set predictors 𝑋 1 , 𝑋 2 ,…, 𝑋 𝑝 . The multiple regression model is given by, 𝐸 𝑌 𝑿 = 𝛽 𝑜 + 𝛽 1 𝑈 1 + 𝛽 2 𝑈 2 +…+ 𝛽 𝑘−1 𝑈 𝑘−1 𝒀= 𝛽 𝑜 + 𝛽 1 𝑼 𝟏 +…+ 𝛽 𝑘−1 𝑼 𝒌−𝟏 +𝒆 where 𝑈 𝑗 = 𝑗 𝑡ℎ term. Terms are functions of predictors.
Types of Terms Hockey Stick” function 𝑈 𝑘 = 𝑋 𝑗 −𝑐 + = 𝑋 𝑗 −𝑐 𝑖𝑓 𝑋 𝑗 >𝑐 0 𝑖𝑓 𝑋 𝑗 ≤𝑐 ℎ𝑒𝑟𝑒 𝑐∈( min 𝑋 𝑗 , max 𝑋 𝑗 ) or 𝑈 𝑘 = 𝑐− 𝑋 𝑗 + = 𝑐− 𝑋 𝑗 𝑖𝑓 𝑋 𝑗 <𝑐 0 𝑖𝑓 𝑋 𝑗 ≥𝑐 These types of terms are used in fitting Multivariate Adaptive Regression Spline (MARS) models. Intercept 𝑈 𝑜 =1 Predictor terms 𝑈 𝑗 = 𝑋 𝑘 Terms can be the predictors themselves as long as the predictor is meaningfully numeric (i.e. 𝑋 𝑘 = count or a measurement). Polynomial terms 𝑈 𝑗 = 𝑋 𝑘 𝑝 These terms are integer powers of numeric predictors, e.g. 𝑋 𝑘 2 , 𝑋 𝑘 3 , 𝑒𝑡𝑐. Transformation terms 𝑈 𝑗 = 𝑋 𝑘 (𝜆) Here 𝑋 𝑘 (𝜆) is the Tukey family log 𝑋 𝑘 𝑜𝑟 log 𝑋 𝑘 +1 fit here Dummy terms 𝑈 𝑗 = 1 𝑖𝑓 𝑎 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 𝒄𝒐𝒏𝒅𝒊𝒕𝒊𝒐𝒏 𝑖𝑠 𝑚𝑒𝑡 0 𝑖𝑓 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 𝒄𝒐𝒏𝒅𝒊𝒕𝒊𝒐𝒏 𝑖𝑠 𝑛𝑜𝑡 𝑚𝑒𝑡
Types of Terms 𝑈 𝑗 = 𝛼 𝑘1 𝑋 1 + 𝛼 𝑘2 𝑋 2 +…+ 𝛼 𝑘𝑝 𝑋 𝑝 Factor terms Suppose the 𝑘 𝑡ℎ predictor ( 𝑋 𝑘 ) is a nominal/ordinal variable with 𝑚 levels (𝑚>2). Then we chose one of the levels as the reference group and create dummy terms for the remaining (𝑚−1) levels. Interaction terms 𝑈 𝑗𝑘 = 𝑈 𝑗 ∗ 𝑈 𝑘 Here the term 𝑈 𝑗𝑘 is a product of two terms ( 𝑈 𝑗 𝑎𝑛𝑑 𝑈 𝑘 ), where these two terms could be of any other term types. Linear combination terms 𝑈 𝑗 = 𝛼 𝑘1 𝑋 1 + 𝛼 𝑘2 𝑋 2 +…+ 𝛼 𝑘𝑝 𝑋 𝑝 PCR/PLS use these terms as basic building blocks. Spline basis terms 𝑈 𝑘 = 𝑋 𝑗 − 𝑘 𝑚−1 + 𝑝 𝑘 𝑚−1 ≤ 𝑋 𝑗 < 𝑘 𝑚 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Nonparametric terms 𝑈 𝑘 = 𝑓 𝑘 ( 𝑥 𝑗 ) where the 𝑓 𝑘 are estimated by smoothing an appropriate scatterplot
Types of Terms Trignometric terms 𝑈 𝑗 = sin 2𝜋𝑋 𝑚 , 𝑈 𝑘 =cos 2𝜋𝑋 𝑚 The periodicities used can be based on trial and error, knowledge of the physical phenomenon being studied (sunspots in this case), or using tools like spectral analysis to identify important periodicities. Will not delve into harmonic analysis in this course, this is only presented to illustrate that by using appropriate terms we can develop models capable of fitting complex relationships between the response and the predictors. Rob Hyndman – great forecasting online text with supporting R library (fpp2)
Activity: Building a MLR model for diamond prices
Multivariate Adaptive Regression Splines (MARS) 𝒀= 𝜷 𝒐 + 𝒎=𝟏 𝑴 𝜷 𝒎 𝒉 𝒎 ( 𝑿 𝒎 ) +𝒆 Hockey Stick Functions Interactions If we have factor terms those are handled in the usual way, dummy variables for all but one level of the factor. These dummy terms can be involved in interactions as well.
Multivariate Adaptive Regression Splines (MARS)
Multivariate Adaptive Regression Splines (MARS) (Friedman, 1991) The package earth contains functions for building and plotting MARS models. mod = earth(y~.,data=yourdata,degree= 1) Activity: Build a MARS model for the diamond price data.
Tree-Based Models: CART (Breiman, et al. 1991) For regression trees, 𝑦= 𝑓 𝒙 = 𝑚=1 𝑀 𝑐 𝑚 𝐼(𝒙∈ 𝑅 𝑚 ) Using assuming we are minimizing the RSS, the fitted value is the mean response in each terminal node region 𝑅 𝑚 . 𝑐 𝑚 =𝑎𝑣𝑒( 𝑦 𝑖 | 𝒙 𝒊 ∈ 𝑅 𝑚 ).
Tree-Based Models: CART
Tree-based Models: CART The task of determining neighborhoods 𝑅 𝑚 is solved by determining a split coordinate or variate 𝑗, i.e. which variable to split on, and split point 𝑠. A split coordinate and split point define the rectangles 𝑅 1 𝑎𝑛𝑑 𝑅 2 as 𝑅 1 𝑗,𝑠 = 𝒙 𝑥 𝑗 ≤𝑠 𝑎𝑛𝑑 𝑅 2 𝑗,𝑠 ={𝒙| 𝑥 𝑗 >𝑠} The residual sum of squares (RSS) for a split determined by (𝑗,𝑠) is 𝑅𝑆𝑆(𝑗,𝑠) = min (𝑗,𝑠) min 𝑐 1 𝑥 𝑖 ∈ 𝑅 1 (𝑗,𝑠) 𝑦 𝑖 − 𝑐 1 2 + min 𝑐 2 𝑥 𝑖 ∈ 𝑅 2 (𝑗,𝑠) 𝑦 𝑖 − 𝑐 2 2 The goal at any given stage is to find the pair (𝑗,𝑠) such that 𝑅𝑆𝑆(𝑗,𝑠) is minimal or the overall RSS is maximally reduced.
Ensemble Tree-based Models Ensemble models combine multiple models by averaging their predictions. For trees the main most common approaches or methods for ensembling models are: Bagging (Breiman, 1996) Random or Bootstrap Forests (Breiman, 2001) Boosting (Friedman, 1999)
Tree-based Models: Bagging Suppose we are interested in predicting a numeric response variable 𝑦|𝒙=response value for a given set of 𝒙 ′ 𝒔= Y x and 𝑓 𝒙 =𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑓𝑟𝑜𝑚 𝑎 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑒𝑑 𝑚𝑜𝑑𝑒𝑙 𝑏𝑎𝑠𝑒𝑑 𝑜𝑛 𝒙 For example, 𝑓 (𝒙), might come from a MLR model or from an RPART model. Letting 𝜇 𝑓 = 𝐸( 𝑓 𝒙 ), where the expectation is with respect to the distribution underlying the training sample (since, viewed as a random variable, 𝑓 (𝒙) is a function of training sample, which can be viewed as a high-dimensional random variable) and not 𝒙 (which is considered fixed).
Bagging 𝑀𝑆𝐸 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 =𝐸 𝑌 𝑥 − 𝑓 𝒙 2 =𝐸 𝑌 𝑥 − 𝜇 𝑓 + 𝜇 𝑓 − 𝑓 𝒙 2 =𝐸 𝑌 𝑥 − 𝜇 𝑓 2 +𝐸 𝑓 𝒙 − 𝜇 𝑓 2 =𝐸 𝑌 𝑥 − 𝜇 𝑓 2 +𝑉𝑎𝑟( 𝑓 𝒙 ) ≥𝐸 𝑌 𝑥 − 𝜇 𝑓 2 If we could base our prediction on 𝜇 𝑓 instead of 𝑓 𝒙 , we would shrink the MSE(prediction) and improve the predictive performance of our model. How can we approximate 𝜇 𝑓 =𝐸 𝑓 𝒙 ? 𝜇 𝑓 ∗ 𝒙 = 1 𝐵 𝑖=1 𝐵 𝑓 𝑏 ∗ (𝒙)
Bagging We will now use bagging to hopefully arrive at an even better model the price of a diamond. For simplicity we will first use smaller and simpler trees to to illustrate the idea of bagging. Below are four different trees fit to bootstrap samples drawn from the full diamonds dataset. For each tree fit I used cp =.005 and minsplit = 5. The bagged estimate of the diamond price is the mean of the predictions from the trees fit the 𝐵 bootstrap samples.
These trees do not vary much These trees do not vary much. Thus the benefit of averaging their predictions will not produce a reasonable estimate of 𝜇 𝑓 .
Bagging with 𝑩 = 𝟏𝟎
Tree-based Models: Random Forests (Breiman, 2001) This is the key feature of the random forest that breaks the correlation between trees fit to the bootstrap samples.
Random Forests How to control the growth of your random forest model: ntree – number of trees to grow in your forest, like B or nbagg in bagging mtry – number of predictors to choose randomly for each split (default = 𝑝/3 for regression problems and 𝑝 for classification problems.) nodesize – minimum size of the terminal nodes in terms of the number of observations contained in them, default is 1 for classification problems and 5 for regression problems. Larger values here speed of the fitting process because trees in the forest will not be as big. maxnodes – maximum number of terminal nodes a tree can have in the forest. Smaller values will speed up fitting.
Activity: Diamonds in the Forest Diamond.RF = randomForest(logPrice~., data=DiaNew.train, mtry=??, ntree=??, nodesize=??, maxnodes=??)
Tree-based Models: Gradient Boosting Key Features: Each tree in the sequence is trying to explain variation left over from the previous tree! Trees at each stage are simpler (less terminal nodes) More “knobs” to adjust
Training a Gradient Boosted Model Number of Layers (n.trees) interaction.depth = # of terminal nodes in each layer – 1 n.minobsinnode = minimum number of observations that can be in node and still be split. shrinkage = small, more run layers bag.fraction = fraction of cases used in fitting the next layer. train.fraction = allows for a training/validation split
Tree-based Models: Gradient Boosting
Tree-based Models: Gradient Boosting Activity: Can you train a GBM that does better than our previous models?
Treed Regression Rather than split the our data into disjoint regions in the predictor/term space then using the mean the response in these regions as the basis of a prediction, treed regression fits a linear model in the “terminal nodes”. What? Show me please…
Treed Regression
Treed Regression This approach seems like it might work well for these data as we have different qualities of diamonds based on color, clarity, cut, and we know that price is strongly associated with carat size. Thus a tree that breaks up diamonds in terms of quality first, then examines the relationship between price and carat size might work great?!?
Classification Problems Many of the methods we have considered can be used for classification problems as well, e.g. any of the tree-based methods can be used for classification. Others that are unique to classification include discriminant analysis, nearest neighbor classification, Naïve Baye’s classifiers, support vector machines (SVM), etc. (Script file with a few examples in Block 4 folder on Github)