Download presentation
Presentation is loading. Please wait.
Published byEdvin Andersson Modified over 5 years ago
1
A Core Curriculum for Undergraduate Data Science
Todd Iverson Brant Deppa Silas Bergen Tisha Hooks April Kerby Chris Malone Winona State University
2
Supervised Learning DSCI 425
Brant Deppa, Ph.D. Professor of Statistics & Data Science Winona State University
3
Course Topics Numeric Response (π) Introduction to Supervised Learning
Review of multiple regression (predictors vs. terms) Cross-Validation for numeric response (discussed throughout) Automatic term selectors (ACE/AVAS/MARS) Projection pursuit regression (PPR) (ο Neural networks) Penalized/regularized regression (ridge, LASSO, ElasticNet) Dimension reduction (PCR and PLS regression) Tree-based models (CART, bagging, random forests, boosting, treed regression) Nearest neighbor regression
4
Course Topics (contβd)
Categorical/Nominal/Ordinal Response (π) Introduction to classification problems Nearest neighbor classification NaΓ―ve Bayeβs classification Tree-based models for classification Discriminant analysis Neural networks Support vector machines (SVM) Multiple logistic regression Blending/stacking models (in progress)
5
Supervised Learning β General Problem
π =πππππππ‘ππ π£πππ’π ππ πππ ππππ π ππ ππ’πππππ π =πππππππ‘ππ ππππ π ππ πππ ππππ π πππππππ ππ πππππππ or π = π (ππππ π π|πΏ) Cross-validation (CV) methods are generally used for this purpose.
6
Cross-Validation β Split Sample
7
K-Fold Cross-Validation
Data is divided into π roughly equal sized subgroups. Each subgroup acts a validation set in turn. The average error across all validation sets is computed and used to choose between rival models.
8
Bootstrap Cross-Validation
Observations not selected (i.e. out-of-bootstrap OOB) constitute the validation set. We can then calculate quality of prediction metrics for iteration. π«πππ: π 1 , π¦ 1 , π π , π¦ 2 ,β¦, ( π π , π¦ π ) here the π π β² π are the p-dimensional predictor vectors. π©ππππππππ πΊπππππ: π π β , π¦ 1 β , π π β , π¦ 2 β ,β¦,( π π β , π¦ π β ) where ( π π β , π¦ π β ) is a random selected observation from the original data drawn with replacement. This process is then repeated a large number of times (B = 500, 1000, 5000, etc.).
9
Monte Carlo Cross-Validation (MCCV)
As each of the cross-validation strategies have an element of randomness to them, we can expect the results will vary from CV to the next. With MCCV we can conduct split-sample CV multiple times and aggregate the results from each to quantify predictive performance for a candidate model.
10
Multiple Regression Suppose we have a numeric response (π) and a set predictors π 1 , π 2 ,β¦, π π . The multiple regression model is given by, πΈ π πΏ = π½ π + π½ 1 π 1 + π½ 2 π 2 +β¦+ π½ πβ1 π πβ1 π= π½ π + π½ 1 πΌ π +β¦+ π½ πβ1 πΌ πβπ +π where π π = π π‘β term. Terms are functions of predictors.
11
Types of Terms Hockey Stickβ function π π = π π βπ + = π π βπ ππ π π >π ππ π π β€π βπππ πβ( min π π , max π π ) or π π = πβ π π + = πβ π π ππ π π <π ππ π π β₯π These types of terms are used in fitting Multivariate Adaptive Regression Spline (MARS) models. Intercept π π =1 Predictor terms π π = π π ο Terms can be the predictors themselves as long as the predictor is meaningfully numeric (i.e. π π = count or a measurement). Β Polynomial terms π π = π π π ο These terms are integer powers of numeric predictors, e.g. π π 2 , π π 3 , ππ‘π. Transformation terms π π = π π (π) ο Here π π (π) is the Tukey family log π π ππ log π π +1 fit here Dummy terms π π = ππ π π πππππππ ππππ
πππππ ππ πππ‘ ππ π πππππππ ππππ
πππππ ππ πππ‘ πππ‘
12
Types of Terms π π = πΌ π1 π 1 + πΌ π2 π 2 +β¦+ πΌ ππ π π Factor terms
Suppose the π π‘β predictor ( π π ) is a nominal/ordinal variable with π levels (π>2). Then we chose one of the levels as the reference group and create dummy terms for the remaining (πβ1) levels. Interaction terms π ππ = π π β π π ο Here the term π ππ is a product of two terms ( π π πππ π π ), where these two terms could be of any other term types. Linear combination terms π π = πΌ π1 π 1 + πΌ π2 π 2 +β¦+ πΌ ππ π π PCR/PLS use these terms as basic building blocks. Spline basis terms π π = π π β π πβ1 + π π πβ1 β€ π π < π π ππ‘βπππ€ππ π Nonparametric terms π π = π π ( π₯ π ) where the π π are estimated by smoothing an appropriate scatterplot
13
Types of Terms Trignometric terms π π = sin 2ππ π , π π =cos 2ππ π
The periodicities used can be based on trial and error, knowledge of the physical phenomenon being studied (sunspots in this case), or using tools like spectral analysis to identify important periodicities. Will not delve into harmonic analysis in this course, this is only presented to illustrate that by using appropriate terms we can develop models capable of fitting complex relationships between the response and the predictors. Rob Hyndman β great forecasting online text with supporting R library (fpp2)
14
Activity: Building a MLR model for diamond prices
15
Multivariate Adaptive Regression Splines (MARS)
π= π· π + π=π π΄ π· π π π ( πΏ π ) +π Hockey Stick Functions Interactions If we have factor terms those are handled in the usual way, dummy variables for all but one level of the factor. These dummy terms can be involved in interactions as well.
16
Multivariate Adaptive Regression Splines (MARS)
17
Multivariate Adaptive Regression Splines (MARS) (Friedman, 1991)
The package earth contains functions for building and plotting MARS models. mod = earth(y~.,data=yourdata,degree= 1) Activity: Build a MARS model for the diamond price data.
18
Tree-Based Models: CART (Breiman, et al. 1991)
For regression trees, π¦= π π = π=1 π π π πΌ(πβ π
π ) Using assuming we are minimizing the RSS, the fitted value is the mean response in each terminal node region π
π . π π =ππ£π( π¦ π | π π β π
π ).
19
Tree-Based Models: CART
20
Tree-based Models: CART
The task of determining neighborhoods π
π is solved by determining a split coordinate or variate π, i.e. which variable to split on, and split point π . A split coordinate and split point define the rectangles π
1 πππ π
2 as π
1 π,π = π π₯ π β€π πππ π
2 π,π ={π| π₯ π >π } The residual sum of squares (RSS) for a split determined by (π,π ) is π
ππ(π,π ) = min (π,π ) min π π₯ π β π
1 (π,π ) π¦ π β π min π π₯ π β π
2 (π,π ) π¦ π β π 2 2 The goal at any given stage is to find the pair (π,π ) such that π
ππ(π,π ) is minimal or the overall RSS is maximally reduced.
21
Ensemble Tree-based Models
Ensemble models combine multiple models by averaging their predictions. For trees the main most common approaches or methods for ensembling models are: Bagging (Breiman, 1996) Random or Bootstrap Forests (Breiman, 2001) Boosting (Friedman, 1999)
22
Tree-based Models: Bagging
Suppose we are interested in predicting a numeric response variable π¦|π=response value for a given set of π β² π= Y x and π π =πππππππ‘πππ ππππ π π ππππππππ πππππ πππ ππ ππ π For example, π (π), might come from a MLR model or from an RPART model. Letting π π = πΈ( π π ), where the expectation is with respect to the distribution underlying the training sample (since, viewed as a random variable, π (π) is a function of training sample, which can be viewed as a high-dimensional random variable) and not π (which is considered fixed).
23
Bagging πππΈ πππππππ‘πππ =πΈ π π₯ β π π 2 =πΈ π π₯ β π π + π π β π π 2 =πΈ π π₯ β π π 2 +πΈ π π β π π 2 =πΈ π π₯ β π π 2 +πππ( π π ) β₯πΈ π π₯ β π π 2 If we could base our prediction on π π instead of π π , we would shrink the MSE(prediction) and improve the predictive performance of our model. How can we approximate π π =πΈ π π ? π π β π = 1 π΅ π=1 π΅ π π β (π)
24
Bagging We will now use bagging to hopefully arrive at an even better model the price of a diamond. For simplicity we will first use smaller and simpler trees to to illustrate the idea of bagging. Below are four different trees fit to bootstrap samples drawn from the full diamonds dataset. For each tree fit I used cp =.005 and minsplit = 5. The bagged estimate of the diamond price is the mean of the predictions from the trees fit the π΅ bootstrap samples.
25
These trees do not vary much
These trees do not vary much. Thus the benefit of averaging their predictions will not produce a reasonable estimate of π π .
26
Bagging with π© = ππ
27
Tree-based Models: Random Forests (Breiman, 2001)
This is the key feature of the random forest that breaks the correlation between trees fit to the bootstrap samples.
28
Random Forests How to control the growth of your random forest model:
ntree β number of trees to grow in your forest, like B or nbagg in bagging mtry β number of predictors to choose randomly for each split (default = π/3 for regression problems and π for classification problems.) nodesize β minimum size of the terminal nodes in terms of the number of observations contained in them, default is 1 for classification problems and 5 for regression problems. Larger values here speed of the fitting process because trees in the forest will not be as big. maxnodes β maximum number of terminal nodes a tree can have in the forest. Smaller values will speed up fitting.
29
Activity: Diamonds in the Forest
Diamond.RF = randomForest(logPrice~., data=DiaNew.train, mtry=??, ntree=??, nodesize=??, maxnodes=??)
30
Tree-based Models: Gradient Boosting
Key Features: Each tree in the sequence is trying to explain variation left over from the previous tree! Trees at each stage are simpler (less terminal nodes) More βknobsβ to adjust
31
Training a Gradient Boosted Model
Number of Layers (n.trees) interaction.depth = # of terminal nodes in each layer β 1 n.minobsinnode = minimum number of observations that can be in node and still be split. shrinkage = small, more run layers bag.fraction = fraction of cases used in fitting the next layer. train.fraction = allows for a training/validation split
32
Tree-based Models: Gradient Boosting
33
Tree-based Models: Gradient Boosting
Activity: Can you train a GBM that does better than our previous models?
34
Treed Regression Rather than split the our data into disjoint regions in the predictor/term space then using the mean the response in these regions as the basis of a prediction, treed regression fits a linear model in the βterminal nodesβ. What? Show me pleaseβ¦
35
Treed Regression
36
Treed Regression This approach seems like it might work well for these data as we have different qualities of diamonds based on color, clarity, cut, and we know that price is strongly associated with carat size. Thus a tree that breaks up diamonds in terms of quality first, then examines the relationship between price and carat size might work great?!?
37
Classification Problems
Many of the methods we have considered can be used for classification problems as well, e.g. any of the tree-based methods can be used for classification. Others that are unique to classification include discriminant analysis, nearest neighbor classification, NaΓ―ve Bayeβs classifiers, support vector machines (SVM), etc. (Script file with a few examples in Block 4 folder on Github)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.