A Core Curriculum for Undergraduate Data Science

Slides:

Advertisements

Similar presentations

Random Forest Predrag Radenković 3237/10

Advertisements

Model Assessment, Selection and Averaging

Chapter 7 – Classification and Regression Trees

Chapter 7 – Classification and Regression Trees

Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.

Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Sparse vs. Ensemble Approaches to Supervised Learning

Classification 10/03/07.

Sparse vs. Ensemble Approaches to Supervised Learning

Intelligible Models for Classification and Regression

Classification and Prediction: Regression Analysis

Ensemble Learning (2), Tree and Forest

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Chapter 9 – Classification and Regression Trees

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.

MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.

Today Ensemble Methods. Recap of the course. Classifier Fusion

CLASSIFICATION: Ensemble Methods

CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.

Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression Regression Trees.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Chapter1: Introduction Chapter2: Overview of Supervised Learning

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Classification Ensemble Methods 1

Tutorial I: Missing Value Analysis

Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional.

Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Ensemble Classifiers.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

The simple linear regression model and parameter estimation

Introduction to Machine Learning

JMP Discovery Summit 2016 Janet Alvarado

Data Transformation: Normalization

Semi-Supervised Clustering

Bagging and Random Forests

Eco 6380 Predictive Analytics For Economists Spring 2016

Boosting and Additive Trees (2)

Trees, bagging, boosting, and stacking

The Elements of Statistical Learning

Machine learning, pattern recognition and statistical data modelling

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.

Basic machine learning background with Python scikit-learn

Neuro-Computing Lecture 5 Committee Machine

ECE 5424: Introduction to Machine Learning

ECE 471/571 – Lecture 12 Decision Tree.

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Lecture 1: Introduction to Machine Learning Methods

Ensemble learning.

Ensemble learning Reminder - Bagging of Trees Random Forest

Regression Forecasting and Model Building

Model generalization Brief summary of methods

Classification with CART

Parametric Methods Berlin Chen, 2005 References:

Supervised machine learning: creating a model

MGS 3100 Business Analysis Regression Feb 18, 2016

Support Vector Machines 2

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

STT : Intro. to Statistical Learning

Presentation transcript:

A Core Curriculum for Undergraduate Data Science Todd Iverson Brant Deppa Silas Bergen Tisha Hooks April Kerby Chris Malone Winona State University

Supervised Learning DSCI 425 Brant Deppa, Ph.D. Professor of Statistics & Data Science Winona State University bdeppa@winona.edu

Course Topics Numeric Response (𝒀) Introduction to Supervised Learning Review of multiple regression (predictors vs. terms) Cross-Validation for numeric response (discussed throughout) Automatic term selectors (ACE/AVAS/MARS) Projection pursuit regression (PPR) ( Neural networks) Penalized/regularized regression (ridge, LASSO, ElasticNet) Dimension reduction (PCR and PLS regression) Tree-based models (CART, bagging, random forests, boosting, treed regression) Nearest neighbor regression

Course Topics (cont’d) Categorical/Nominal/Ordinal Response (𝒀) Introduction to classification problems Nearest neighbor classification Naïve Baye’s classification Tree-based models for classification Discriminant analysis Neural networks Support vector machines (SVM) Multiple logistic regression Blending/stacking models (in progress)

Supervised Learning – General Problem 𝑌 =𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 𝑖𝑓 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑖𝑠 𝑛𝑢𝑚𝑒𝑟𝑖𝑐 𝑌 =𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑐𝑙𝑎𝑠𝑠 𝑖𝑓 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑛𝑜𝑚𝑖𝑛𝑎𝑙 𝑜𝑟 𝑜𝑟𝑑𝑖𝑛𝑎𝑙 or 𝑌 = 𝑃 (𝑐𝑙𝑎𝑠𝑠 𝑘|𝑿) Cross-validation (CV) methods are generally used for this purpose.

Cross-Validation – Split Sample

K-Fold Cross-Validation Data is divided into 𝑘 roughly equal sized subgroups. Each subgroup acts a validation set in turn. The average error across all validation sets is computed and used to choose between rival models.

Bootstrap Cross-Validation Observations not selected (i.e. out-of-bootstrap OOB) constitute the validation set. We can then calculate quality of prediction metrics for iteration. 𝑫𝒂𝒕𝒂: 𝒙 1 , 𝑦 1 , 𝒙 𝟐 , 𝑦 2 ,…, ( 𝒙 𝒏 , 𝑦 𝑛 ) here the 𝒙 𝒊 ′ 𝑠 are the p-dimensional predictor vectors. 𝑩𝒐𝒐𝒕𝒔𝒕𝒓𝒂𝒑 𝑺𝒂𝒎𝒑𝒍𝒆: 𝒙 𝟏 ∗ , 𝑦 1 ∗ , 𝒙 𝟐 ∗ , 𝑦 2 ∗ ,…,( 𝒙 𝒏 ∗ , 𝑦 𝑛 ∗ ) where ( 𝒙 𝒊 ∗ , 𝑦 𝑖 ∗ ) is a random selected observation from the original data drawn with replacement. This process is then repeated a large number of times (B = 500, 1000, 5000, etc.).

Monte Carlo Cross-Validation (MCCV) As each of the cross-validation strategies have an element of randomness to them, we can expect the results will vary from CV to the next. With MCCV we can conduct split-sample CV multiple times and aggregate the results from each to quantify predictive performance for a candidate model.

Multiple Regression Suppose we have a numeric response (𝑌) and a set predictors 𝑋 1 , 𝑋 2 ,…, 𝑋 𝑝 . The multiple regression model is given by, 𝐸 𝑌 𝑿 = 𝛽 𝑜 + 𝛽 1 𝑈 1 + 𝛽 2 𝑈 2 +…+ 𝛽 𝑘−1 𝑈 𝑘−1 𝒀= 𝛽 𝑜 + 𝛽 1 𝑼 𝟏 +…+ 𝛽 𝑘−1 𝑼 𝒌−𝟏 +𝒆 where 𝑈 𝑗 = 𝑗 𝑡ℎ term. Terms are functions of predictors.

Types of Terms Hockey Stick” function 𝑈 𝑘 = 𝑋 𝑗 −𝑐 + = 𝑋 𝑗 −𝑐 𝑖𝑓 𝑋 𝑗 >𝑐 0 𝑖𝑓 𝑋 𝑗 ≤𝑐 ℎ𝑒𝑟𝑒 𝑐∈( min 𝑋 𝑗 , max 𝑋 𝑗 ) or 𝑈 𝑘 = 𝑐− 𝑋 𝑗 + = 𝑐− 𝑋 𝑗 𝑖𝑓 𝑋 𝑗 <𝑐 0 𝑖𝑓 𝑋 𝑗 ≥𝑐 These types of terms are used in fitting Multivariate Adaptive Regression Spline (MARS) models. Intercept 𝑈 𝑜 =1 Predictor terms 𝑈 𝑗 = 𝑋 𝑘  Terms can be the predictors themselves as long as the predictor is meaningfully numeric (i.e. 𝑋 𝑘 = count or a measurement). Polynomial terms 𝑈 𝑗 = 𝑋 𝑘 𝑝  These terms are integer powers of numeric predictors, e.g. 𝑋 𝑘 2 , 𝑋 𝑘 3 , 𝑒𝑡𝑐. Transformation terms 𝑈 𝑗 = 𝑋 𝑘 (𝜆)  Here 𝑋 𝑘 (𝜆) is the Tukey family log 𝑋 𝑘 𝑜𝑟 log 𝑋 𝑘 +1 fit here Dummy terms 𝑈 𝑗 = 1 𝑖𝑓 𝑎 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 𝒄𝒐𝒏𝒅𝒊𝒕𝒊𝒐𝒏 𝑖𝑠 𝑚𝑒𝑡 0 𝑖𝑓 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 𝒄𝒐𝒏𝒅𝒊𝒕𝒊𝒐𝒏 𝑖𝑠 𝑛𝑜𝑡 𝑚𝑒𝑡

Types of Terms 𝑈 𝑗 = 𝛼 𝑘1 𝑋 1 + 𝛼 𝑘2 𝑋 2 +…+ 𝛼 𝑘𝑝 𝑋 𝑝 Factor terms Suppose the 𝑘 𝑡ℎ predictor ( 𝑋 𝑘 ) is a nominal/ordinal variable with 𝑚 levels (𝑚>2). Then we chose one of the levels as the reference group and create dummy terms for the remaining (𝑚−1) levels. Interaction terms 𝑈 𝑗𝑘 = 𝑈 𝑗 ∗ 𝑈 𝑘  Here the term 𝑈 𝑗𝑘 is a product of two terms ( 𝑈 𝑗 𝑎𝑛𝑑 𝑈 𝑘 ), where these two terms could be of any other term types. Linear combination terms 𝑈 𝑗 = 𝛼 𝑘1 𝑋 1 + 𝛼 𝑘2 𝑋 2 +…+ 𝛼 𝑘𝑝 𝑋 𝑝 PCR/PLS use these terms as basic building blocks. Spline basis terms 𝑈 𝑘 = 𝑋 𝑗 − 𝑘 𝑚−1 + 𝑝 𝑘 𝑚−1 ≤ 𝑋 𝑗 < 𝑘 𝑚 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Nonparametric terms 𝑈 𝑘 = 𝑓 𝑘 ( 𝑥 𝑗 ) where the 𝑓 𝑘 are estimated by smoothing an appropriate scatterplot

Types of Terms Trignometric terms 𝑈 𝑗 = sin 2𝜋𝑋 𝑚 , 𝑈 𝑘 =cos 2𝜋𝑋 𝑚 The periodicities used can be based on trial and error, knowledge of the physical phenomenon being studied (sunspots in this case), or using tools like spectral analysis to identify important periodicities. Will not delve into harmonic analysis in this course, this is only presented to illustrate that by using appropriate terms we can develop models capable of fitting complex relationships between the response and the predictors. Rob Hyndman – great forecasting online text with supporting R library (fpp2)

Activity: Building a MLR model for diamond prices

Multivariate Adaptive Regression Splines (MARS) 𝒀= 𝜷 𝒐 + 𝒎=𝟏 𝑴 𝜷 𝒎 𝒉 𝒎 ( 𝑿 𝒎 ) +𝒆 Hockey Stick Functions Interactions If we have factor terms those are handled in the usual way, dummy variables for all but one level of the factor. These dummy terms can be involved in interactions as well.

Multivariate Adaptive Regression Splines (MARS)

Multivariate Adaptive Regression Splines (MARS) (Friedman, 1991) The package earth contains functions for building and plotting MARS models. mod = earth(y~.,data=yourdata,degree= 1) Activity: Build a MARS model for the diamond price data.

Tree-Based Models: CART (Breiman, et al. 1991) For regression trees, 𝑦= 𝑓 𝒙 = 𝑚=1 𝑀 𝑐 𝑚 𝐼(𝒙∈ 𝑅 𝑚 ) Using assuming we are minimizing the RSS, the fitted value is the mean response in each terminal node region 𝑅 𝑚 . 𝑐 𝑚 =𝑎𝑣𝑒( 𝑦 𝑖 | 𝒙 𝒊 ∈ 𝑅 𝑚 ).

Tree-Based Models: CART

Tree-based Models: CART The task of determining neighborhoods 𝑅 𝑚 is solved by determining a split coordinate or variate 𝑗, i.e. which variable to split on, and split point 𝑠. A split coordinate and split point define the rectangles 𝑅 1 𝑎𝑛𝑑 𝑅 2 as 𝑅 1 𝑗,𝑠 = 𝒙 𝑥 𝑗 ≤𝑠 𝑎𝑛𝑑 𝑅 2 𝑗,𝑠 ={𝒙| 𝑥 𝑗 >𝑠} The residual sum of squares (RSS) for a split determined by (𝑗,𝑠) is 𝑅𝑆𝑆(𝑗,𝑠) = min (𝑗,𝑠) min 𝑐 1 𝑥 𝑖 ∈ 𝑅 1 (𝑗,𝑠) 𝑦 𝑖 − 𝑐 1 2 + min 𝑐 2 𝑥 𝑖 ∈ 𝑅 2 (𝑗,𝑠) 𝑦 𝑖 − 𝑐 2 2 The goal at any given stage is to find the pair (𝑗,𝑠) such that 𝑅𝑆𝑆(𝑗,𝑠) is minimal or the overall RSS is maximally reduced.

Ensemble Tree-based Models Ensemble models combine multiple models by averaging their predictions. For trees the main most common approaches or methods for ensembling models are: Bagging (Breiman, 1996) Random or Bootstrap Forests (Breiman, 2001) Boosting (Friedman, 1999)

Tree-based Models: Bagging Suppose we are interested in predicting a numeric response variable 𝑦|𝒙=response value for a given set of 𝒙 ′ 𝒔= Y x and 𝑓 𝒙 =𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑓𝑟𝑜𝑚 𝑎 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑒𝑑 𝑚𝑜𝑑𝑒𝑙 𝑏𝑎𝑠𝑒𝑑 𝑜𝑛 𝒙 For example, 𝑓 (𝒙), might come from a MLR model or from an RPART model. Letting 𝜇 𝑓 = 𝐸( 𝑓 𝒙 ), where the expectation is with respect to the distribution underlying the training sample (since, viewed as a random variable, 𝑓 (𝒙) is a function of training sample, which can be viewed as a high-dimensional random variable) and not 𝒙 (which is considered fixed).

Bagging 𝑀𝑆𝐸 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 =𝐸 𝑌 𝑥 − 𝑓 𝒙 2 =𝐸 𝑌 𝑥 − 𝜇 𝑓 + 𝜇 𝑓 − 𝑓 𝒙 2 =𝐸 𝑌 𝑥 − 𝜇 𝑓 2 +𝐸 𝑓 𝒙 − 𝜇 𝑓 2 =𝐸 𝑌 𝑥 − 𝜇 𝑓 2 +𝑉𝑎𝑟( 𝑓 𝒙 ) ≥𝐸 𝑌 𝑥 − 𝜇 𝑓 2 If we could base our prediction on 𝜇 𝑓 instead of 𝑓 𝒙 , we would shrink the MSE(prediction) and improve the predictive performance of our model. How can we approximate 𝜇 𝑓 =𝐸 𝑓 𝒙 ? 𝜇 𝑓 ∗ 𝒙 = 1 𝐵 𝑖=1 𝐵 𝑓 𝑏 ∗ (𝒙)

Bagging We will now use bagging to hopefully arrive at an even better model the price of a diamond. For simplicity we will first use smaller and simpler trees to to illustrate the idea of bagging. Below are four different trees fit to bootstrap samples drawn from the full diamonds dataset. For each tree fit I used cp =.005 and minsplit = 5. The bagged estimate of the diamond price is the mean of the predictions from the trees fit the 𝐵 bootstrap samples.

These trees do not vary much These trees do not vary much. Thus the benefit of averaging their predictions will not produce a reasonable estimate of 𝜇 𝑓 .

Bagging with 𝑩 = 𝟏𝟎

Tree-based Models: Random Forests (Breiman, 2001) This is the key feature of the random forest that breaks the correlation between trees fit to the bootstrap samples.

Random Forests How to control the growth of your random forest model: ntree – number of trees to grow in your forest, like B or nbagg in bagging mtry – number of predictors to choose randomly for each split (default = 𝑝/3 for regression problems and 𝑝 for classification problems.) nodesize – minimum size of the terminal nodes in terms of the number of observations contained in them, default is 1 for classification problems and 5 for regression problems. Larger values here speed of the fitting process because trees in the forest will not be as big. maxnodes – maximum number of terminal nodes a tree can have in the forest. Smaller values will speed up fitting.

Activity: Diamonds in the Forest Diamond.RF = randomForest(logPrice~., data=DiaNew.train, mtry=??, ntree=??, nodesize=??, maxnodes=??)

Tree-based Models: Gradient Boosting Key Features: Each tree in the sequence is trying to explain variation left over from the previous tree! Trees at each stage are simpler (less terminal nodes) More “knobs” to adjust

Training a Gradient Boosted Model Number of Layers (n.trees) interaction.depth = # of terminal nodes in each layer – 1 n.minobsinnode = minimum number of observations that can be in node and still be split. shrinkage = small, more run layers bag.fraction = fraction of cases used in fitting the next layer. train.fraction = allows for a training/validation split

Tree-based Models: Gradient Boosting

Tree-based Models: Gradient Boosting Activity: Can you train a GBM that does better than our previous models?

Treed Regression Rather than split the our data into disjoint regions in the predictor/term space then using the mean the response in these regions as the basis of a prediction, treed regression fits a linear model in the “terminal nodes”. What? Show me please…

Treed Regression

Treed Regression This approach seems like it might work well for these data as we have different qualities of diamonds based on color, clarity, cut, and we know that price is strongly associated with carat size. Thus a tree that breaks up diamonds in terms of quality first, then examines the relationship between price and carat size might work great?!?

Classification Problems Many of the methods we have considered can be used for classification problems as well, e.g. any of the tree-based methods can be used for classification. Others that are unique to classification include discriminant analysis, nearest neighbor classification, Naïve Baye’s classifiers, support vector machines (SVM), etc. (Script file with a few examples in Block 4 folder on Github)