Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Core Curriculum for Undergraduate Data Science

Similar presentations


Presentation on theme: "A Core Curriculum for Undergraduate Data Science"β€” Presentation transcript:

1 A Core Curriculum for Undergraduate Data Science
Todd Iverson Brant Deppa Silas Bergen Tisha Hooks April Kerby Chris Malone Winona State University

2 Supervised Learning DSCI 425
Brant Deppa, Ph.D. Professor of Statistics & Data Science Winona State University

3 Course Topics Numeric Response (𝒀) Introduction to Supervised Learning
Review of multiple regression (predictors vs. terms) Cross-Validation for numeric response (discussed throughout) Automatic term selectors (ACE/AVAS/MARS) Projection pursuit regression (PPR) (οƒ  Neural networks) Penalized/regularized regression (ridge, LASSO, ElasticNet) Dimension reduction (PCR and PLS regression) Tree-based models (CART, bagging, random forests, boosting, treed regression) Nearest neighbor regression

4 Course Topics (cont’d)
Categorical/Nominal/Ordinal Response (𝒀) Introduction to classification problems Nearest neighbor classification NaΓ―ve Baye’s classification Tree-based models for classification Discriminant analysis Neural networks Support vector machines (SVM) Multiple logistic regression Blending/stacking models (in progress)

5 Supervised Learning – General Problem
π‘Œ =π‘π‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘’π‘‘ π‘£π‘Žπ‘™π‘’π‘’ 𝑖𝑓 π‘Ÿπ‘’π‘ π‘π‘œπ‘›π‘ π‘’ 𝑖𝑠 π‘›π‘’π‘šπ‘’π‘Ÿπ‘–π‘ π‘Œ =π‘π‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘’π‘‘ π‘π‘™π‘Žπ‘ π‘  𝑖𝑓 π‘Ÿπ‘’π‘ π‘π‘œπ‘›π‘ π‘’ π‘›π‘œπ‘šπ‘–π‘›π‘Žπ‘™ π‘œπ‘Ÿ π‘œπ‘Ÿπ‘‘π‘–π‘›π‘Žπ‘™ or π‘Œ = 𝑃 (π‘π‘™π‘Žπ‘ π‘  π‘˜|𝑿) Cross-validation (CV) methods are generally used for this purpose.

6 Cross-Validation – Split Sample

7 K-Fold Cross-Validation
Data is divided into π‘˜ roughly equal sized subgroups. Each subgroup acts a validation set in turn. The average error across all validation sets is computed and used to choose between rival models.

8 Bootstrap Cross-Validation
Observations not selected (i.e. out-of-bootstrap OOB) constitute the validation set. We can then calculate quality of prediction metrics for iteration. 𝑫𝒂𝒕𝒂: 𝒙 1 , 𝑦 1 , 𝒙 𝟐 , 𝑦 2 ,…, ( 𝒙 𝒏 , 𝑦 𝑛 ) here the 𝒙 π’Š β€² 𝑠 are the p-dimensional predictor vectors. 𝑩𝒐𝒐𝒕𝒔𝒕𝒓𝒂𝒑 π‘Ίπ’‚π’Žπ’‘π’π’†: 𝒙 𝟏 βˆ— , 𝑦 1 βˆ— , 𝒙 𝟐 βˆ— , 𝑦 2 βˆ— ,…,( 𝒙 𝒏 βˆ— , 𝑦 𝑛 βˆ— ) where ( 𝒙 π’Š βˆ— , 𝑦 𝑖 βˆ— ) is a random selected observation from the original data drawn with replacement. This process is then repeated a large number of times (B = 500, 1000, 5000, etc.).

9 Monte Carlo Cross-Validation (MCCV)
As each of the cross-validation strategies have an element of randomness to them, we can expect the results will vary from CV to the next. With MCCV we can conduct split-sample CV multiple times and aggregate the results from each to quantify predictive performance for a candidate model.

10 Multiple Regression Suppose we have a numeric response (π‘Œ) and a set predictors 𝑋 1 , 𝑋 2 ,…, 𝑋 𝑝 . The multiple regression model is given by, 𝐸 π‘Œ 𝑿 = 𝛽 π‘œ + 𝛽 1 π‘ˆ 1 + 𝛽 2 π‘ˆ 2 +…+ 𝛽 π‘˜βˆ’1 π‘ˆ π‘˜βˆ’1 𝒀= 𝛽 π‘œ + 𝛽 1 𝑼 𝟏 +…+ 𝛽 π‘˜βˆ’1 𝑼 π’Œβˆ’πŸ +𝒆 where π‘ˆ 𝑗 = 𝑗 π‘‘β„Ž term. Terms are functions of predictors.

11 Types of Terms Hockey Stick” function π‘ˆ π‘˜ = 𝑋 𝑗 βˆ’π‘ + = 𝑋 𝑗 βˆ’π‘ 𝑖𝑓 𝑋 𝑗 >𝑐 𝑖𝑓 𝑋 𝑗 ≀𝑐 β„Žπ‘’π‘Ÿπ‘’ π‘βˆˆ( min 𝑋 𝑗 , max 𝑋 𝑗 ) or π‘ˆ π‘˜ = π‘βˆ’ 𝑋 𝑗 + = π‘βˆ’ 𝑋 𝑗 𝑖𝑓 𝑋 𝑗 <𝑐 𝑖𝑓 𝑋 𝑗 β‰₯𝑐 These types of terms are used in fitting Multivariate Adaptive Regression Spline (MARS) models. Intercept π‘ˆ π‘œ =1 Predictor terms π‘ˆ 𝑗 = 𝑋 π‘˜ οƒŸ Terms can be the predictors themselves as long as the predictor is meaningfully numeric (i.e. 𝑋 π‘˜ = count or a measurement). Β  Polynomial terms π‘ˆ 𝑗 = 𝑋 π‘˜ 𝑝 οƒŸ These terms are integer powers of numeric predictors, e.g. 𝑋 π‘˜ 2 , 𝑋 π‘˜ 3 , 𝑒𝑑𝑐. Transformation terms π‘ˆ 𝑗 = 𝑋 π‘˜ (πœ†) οƒŸ Here 𝑋 π‘˜ (πœ†) is the Tukey family log 𝑋 π‘˜ π‘œπ‘Ÿ log 𝑋 π‘˜ +1 fit here Dummy terms π‘ˆ 𝑗 = 𝑖𝑓 π‘Ž 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 π’„π’π’π’…π’Šπ’•π’Šπ’π’ 𝑖𝑠 π‘šπ‘’π‘‘ 𝑖𝑓 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 π’„π’π’π’…π’Šπ’•π’Šπ’π’ 𝑖𝑠 π‘›π‘œπ‘‘ π‘šπ‘’π‘‘

12 Types of Terms π‘ˆ 𝑗 = 𝛼 π‘˜1 𝑋 1 + 𝛼 π‘˜2 𝑋 2 +…+ 𝛼 π‘˜π‘ 𝑋 𝑝 Factor terms
Suppose the π‘˜ π‘‘β„Ž predictor ( 𝑋 π‘˜ ) is a nominal/ordinal variable with π‘š levels (π‘š>2). Then we chose one of the levels as the reference group and create dummy terms for the remaining (π‘šβˆ’1) levels. Interaction terms π‘ˆ π‘—π‘˜ = π‘ˆ 𝑗 βˆ— π‘ˆ π‘˜ οƒŸ Here the term π‘ˆ π‘—π‘˜ is a product of two terms ( π‘ˆ 𝑗 π‘Žπ‘›π‘‘ π‘ˆ π‘˜ ), where these two terms could be of any other term types. Linear combination terms π‘ˆ 𝑗 = 𝛼 π‘˜1 𝑋 1 + 𝛼 π‘˜2 𝑋 2 +…+ 𝛼 π‘˜π‘ 𝑋 𝑝 PCR/PLS use these terms as basic building blocks. Spline basis terms π‘ˆ π‘˜ = 𝑋 𝑗 βˆ’ π‘˜ π‘šβˆ’1 + 𝑝 π‘˜ π‘šβˆ’1 ≀ 𝑋 𝑗 < π‘˜ π‘š π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’ Nonparametric terms π‘ˆ π‘˜ = 𝑓 π‘˜ ( π‘₯ 𝑗 ) where the 𝑓 π‘˜ are estimated by smoothing an appropriate scatterplot

13 Types of Terms Trignometric terms π‘ˆ 𝑗 = sin 2πœ‹π‘‹ π‘š , π‘ˆ π‘˜ =cos 2πœ‹π‘‹ π‘š
The periodicities used can be based on trial and error, knowledge of the physical phenomenon being studied (sunspots in this case), or using tools like spectral analysis to identify important periodicities. Will not delve into harmonic analysis in this course, this is only presented to illustrate that by using appropriate terms we can develop models capable of fitting complex relationships between the response and the predictors. Rob Hyndman – great forecasting online text with supporting R library (fpp2)

14 Activity: Building a MLR model for diamond prices

15 Multivariate Adaptive Regression Splines (MARS)
𝒀= 𝜷 𝒐 + π’Ž=𝟏 𝑴 𝜷 π’Ž 𝒉 π’Ž ( 𝑿 π’Ž ) +𝒆 Hockey Stick Functions Interactions If we have factor terms those are handled in the usual way, dummy variables for all but one level of the factor. These dummy terms can be involved in interactions as well.

16 Multivariate Adaptive Regression Splines (MARS)

17 Multivariate Adaptive Regression Splines (MARS) (Friedman, 1991)
The package earth contains functions for building and plotting MARS models. mod = earth(y~.,data=yourdata,degree= 1) Activity: Build a MARS model for the diamond price data.

18 Tree-Based Models: CART (Breiman, et al. 1991)
For regression trees, 𝑦= 𝑓 𝒙 = π‘š=1 𝑀 𝑐 π‘š 𝐼(π’™βˆˆ 𝑅 π‘š ) Using assuming we are minimizing the RSS, the fitted value is the mean response in each terminal node region 𝑅 π‘š . 𝑐 π‘š =π‘Žπ‘£π‘’( 𝑦 𝑖 | 𝒙 π’Š ∈ 𝑅 π‘š ).

19 Tree-Based Models: CART

20 Tree-based Models: CART
The task of determining neighborhoods 𝑅 π‘š is solved by determining a split coordinate or variate 𝑗, i.e. which variable to split on, and split point 𝑠. A split coordinate and split point define the rectangles 𝑅 1 π‘Žπ‘›π‘‘ 𝑅 2 as 𝑅 1 𝑗,𝑠 = 𝒙 π‘₯ 𝑗 ≀𝑠 π‘Žπ‘›π‘‘ 𝑅 2 𝑗,𝑠 ={𝒙| π‘₯ 𝑗 >𝑠} The residual sum of squares (RSS) for a split determined by (𝑗,𝑠) is 𝑅𝑆𝑆(𝑗,𝑠) = min (𝑗,𝑠) min 𝑐 π‘₯ 𝑖 ∈ 𝑅 1 (𝑗,𝑠) 𝑦 𝑖 βˆ’ 𝑐 min 𝑐 π‘₯ 𝑖 ∈ 𝑅 2 (𝑗,𝑠) 𝑦 𝑖 βˆ’ 𝑐 2 2 The goal at any given stage is to find the pair (𝑗,𝑠) such that 𝑅𝑆𝑆(𝑗,𝑠) is minimal or the overall RSS is maximally reduced.

21 Ensemble Tree-based Models
Ensemble models combine multiple models by averaging their predictions. For trees the main most common approaches or methods for ensembling models are: Bagging (Breiman, 1996) Random or Bootstrap Forests (Breiman, 2001) Boosting (Friedman, 1999)

22 Tree-based Models: Bagging
Suppose we are interested in predicting a numeric response variable 𝑦|𝒙=response value for a given set of 𝒙 β€² 𝒔= Y x and 𝑓 𝒙 =π‘π‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘–π‘œπ‘› π‘“π‘Ÿπ‘œπ‘š π‘Ž 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑒𝑑 π‘šπ‘œπ‘‘π‘’π‘™ π‘π‘Žπ‘ π‘’π‘‘ π‘œπ‘› 𝒙 For example, 𝑓 (𝒙), might come from a MLR model or from an RPART model. Letting πœ‡ 𝑓 = 𝐸( 𝑓 𝒙 ), where the expectation is with respect to the distribution underlying the training sample (since, viewed as a random variable, 𝑓 (𝒙) is a function of training sample, which can be viewed as a high-dimensional random variable) and not 𝒙 (which is considered fixed).

23 Bagging 𝑀𝑆𝐸 π‘π‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘–π‘œπ‘› =𝐸 π‘Œ π‘₯ βˆ’ 𝑓 𝒙 2 =𝐸 π‘Œ π‘₯ βˆ’ πœ‡ 𝑓 + πœ‡ 𝑓 βˆ’ 𝑓 𝒙 2 =𝐸 π‘Œ π‘₯ βˆ’ πœ‡ 𝑓 2 +𝐸 𝑓 𝒙 βˆ’ πœ‡ 𝑓 2 =𝐸 π‘Œ π‘₯ βˆ’ πœ‡ 𝑓 2 +π‘‰π‘Žπ‘Ÿ( 𝑓 𝒙 ) β‰₯𝐸 π‘Œ π‘₯ βˆ’ πœ‡ 𝑓 2 If we could base our prediction on πœ‡ 𝑓 instead of 𝑓 𝒙 , we would shrink the MSE(prediction) and improve the predictive performance of our model. How can we approximate πœ‡ 𝑓 =𝐸 𝑓 𝒙 ? πœ‡ 𝑓 βˆ— 𝒙 = 1 𝐡 𝑖=1 𝐡 𝑓 𝑏 βˆ— (𝒙)

24 Bagging We will now use bagging to hopefully arrive at an even better model the price of a diamond. For simplicity we will first use smaller and simpler trees to to illustrate the idea of bagging. Below are four different trees fit to bootstrap samples drawn from the full diamonds dataset. For each tree fit I used cp =.005 and minsplit = 5. The bagged estimate of the diamond price is the mean of the predictions from the trees fit the 𝐡 bootstrap samples.

25 These trees do not vary much
These trees do not vary much. Thus the benefit of averaging their predictions will not produce a reasonable estimate of πœ‡ 𝑓 .

26 Bagging with 𝑩 = 𝟏𝟎

27 Tree-based Models: Random Forests (Breiman, 2001)
This is the key feature of the random forest that breaks the correlation between trees fit to the bootstrap samples.

28 Random Forests How to control the growth of your random forest model:
ntree – number of trees to grow in your forest, like B or nbagg in bagging mtry – number of predictors to choose randomly for each split (default = 𝑝/3 for regression problems and 𝑝 for classification problems.) nodesize – minimum size of the terminal nodes in terms of the number of observations contained in them, default is 1 for classification problems and 5 for regression problems. Larger values here speed of the fitting process because trees in the forest will not be as big. maxnodes – maximum number of terminal nodes a tree can have in the forest. Smaller values will speed up fitting.

29 Activity: Diamonds in the Forest
Diamond.RF = randomForest(logPrice~., data=DiaNew.train, mtry=??, ntree=??, nodesize=??, maxnodes=??)

30 Tree-based Models: Gradient Boosting
Key Features: Each tree in the sequence is trying to explain variation left over from the previous tree! Trees at each stage are simpler (less terminal nodes) More β€œknobs” to adjust

31 Training a Gradient Boosted Model
Number of Layers (n.trees) interaction.depth = # of terminal nodes in each layer – 1 n.minobsinnode = minimum number of observations that can be in node and still be split. shrinkage = small, more run layers bag.fraction = fraction of cases used in fitting the next layer. train.fraction = allows for a training/validation split

32 Tree-based Models: Gradient Boosting

33 Tree-based Models: Gradient Boosting
Activity: Can you train a GBM that does better than our previous models?

34 Treed Regression Rather than split the our data into disjoint regions in the predictor/term space then using the mean the response in these regions as the basis of a prediction, treed regression fits a linear model in the β€œterminal nodes”. What? Show me please…

35 Treed Regression

36 Treed Regression This approach seems like it might work well for these data as we have different qualities of diamonds based on color, clarity, cut, and we know that price is strongly associated with carat size. Thus a tree that breaks up diamonds in terms of quality first, then examines the relationship between price and carat size might work great?!?

37 Classification Problems
Many of the methods we have considered can be used for classification problems as well, e.g. any of the tree-based methods can be used for classification. Others that are unique to classification include discriminant analysis, nearest neighbor classification, NaΓ―ve Baye’s classifiers, support vector machines (SVM), etc. (Script file with a few examples in Block 4 folder on Github)


Download ppt "A Core Curriculum for Undergraduate Data Science"

Similar presentations


Ads by Google