1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 11a, April 14, 2015 Interpreting cross-validation, bootstrapping, bagging, boosting, etc.

Slides:



Advertisements
Similar presentations
Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
Advertisements

Model generalization Test error Bias, variance and complexity
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 10, 2015 Labs: Cross Validation, RandomForest, Multi- Dimensional Scaling, Dimension Reduction,
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
Sparse vs. Ensemble Approaches to Supervised Learning
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Ensemble Learning what is an ensemble? why use an ensemble?
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Ensemble Learning: An Introduction
Data mining and statistical learning - lecture 13 Separating hyperplane.
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning (2), Tree and Forest
For Better Accuracy Eick: Ensemble Learning
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Bias and Variance Two ways to measure the match of alignment of the learning algorithm to the classification problem involve the bias and variance. Bias.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 391L: Machine Learning: Ensembles
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 13a, April 22, 2014 Boosting, dimension reduction and a preview of the return to Big Data.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
ECE 8443 – Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML and Bayesian Model Comparison Combining Classifiers Resources: MN:
Data Mining - Volinsky Columbia University 1 Topic 10 - Ensemble Methods.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
June 30, 2008Stat Lecture 16 - Regression1 Inference for relationships between variables Statistics Lecture 16.
Ensemble Methods in Machine Learning
Classification Ensemble Methods 1
Data Analytics CMIS Short Course part II Day 1 Part 3: Ensembles Sam Buttrey December 2015.
Validation methods.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 8 Combining Methods and Ensemble Learning.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.
Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 12a, April 19, 2016 Cross-validation, Revisiting Regression – local models, and non-parametric…
By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 12b, April 22, 2016 Cross-validation and Local Regression Lab.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 11a, April 12, 2016 Interpreting: MDS, DR, SVM Factor Analysis; and Boosting.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Bagging and Random Forests
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
Weak Models: Bagging, Boosting, Bootstrap Aggregation
Boosting and Additive Trees (2)
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
ECE 5424: Introduction to Machine Learning
Asymmetric Gradient Boosting with Application to Spam Filtering
Data Mining Practical Machine Learning Tools and Techniques
Cross-validation and Local Regression Lab
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Model generalization Brief summary of methods
Cross-validation and Local Regression Lab
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 11a, April 14, 2015 Interpreting cross-validation, bootstrapping, bagging, boosting, etc.

coleman > head(coleman) salaryP fatherWc sstatus teacherSc motherLev Y

What were you doing? > call <- call("lmrob", formula = Y ~.) > # set up folds for cross-validation > folds <- cvFolds(nrow(coleman), K = 5, R = 10) > # perform cross-validation > cvTool(call, data = coleman, y = coleman$Y, cost = rtmspe, + folds = folds, costArgs = list(trim = 0.1)) CV [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] Warning messages: 1: In lmrob.S(x, y, control = control) : S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps 2: In lmrob.S(x, y, control = control) : S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps 3: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations 4: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations

Did you get this plot – how? > cvFits 5-fold CV results: Fit CV 1 LS MM LTS Best model: CV "MM" 4

LS, LTS, MM? The breakdown value of an estimator is defined as the smallest fraction of contamination that can cause the estimator to take on values arbitrarily far from its value on the uncontaminated data. The breakdown value of an estimator can be used as a measure of the robustness of the estimator. Rousseeuw and Leroy (1987) and others introduced high breakdown value estimators for linear regression. LTS – see viewer.htm#statug_rreg_sect018.htm#statug.rreg.robustregfltsest viewer.htm#statug_rreg_sect018.htm#statug.rreg.robustregfltsest MM - viewer.htm#statug_rreg_sect019.htm viewer.htm#statug_rreg_sect019.htm 5

50 and 75% subsets fitLts50 <- ltsReg(Y ~., data = coleman, alpha = 0.5) cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # 75% subsets fitLts75 <- ltsReg(Y ~., data = coleman, alpha = 0.75) cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # combine and plot results cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75) 6

cvFitsLts (50/75) > cvFitsLts 5-fold CV results: Fit reweighted raw Best model: reweighted raw "0.75" "0.75" 7

Tuning tuning <- list(tuning.psi=c(3.14, 3.44, 3.88, 4.68)) # perform cross-validation cvFitsLmrob <- cvTuning(fitLmrob$call, data = coleman, y = coleman$Y, tuning = tuning, cost = rtmspe, folds = folds, costArgs = list(trim = 0.1)) 8

cvFitsLmrob 5-fold CV results: tuning.psi CV Optimal tuning parameter: tuning.psi CV

Lab on Friday mammals.glm <- glm(log(brain) ~ log(body), data = mammals) (cv.err <- cv.glm(mammals, mammals.glm)$delta) [1] > (cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta) [1] # As this is a linear model we could calculate the leave-one-out # cross-validation estimate without any extra model-fitting. muhat <- fitted(mammals.glm) mammals.diag <- glm.diag(mammals.glm) (cv.err <- mean((mammals.glm$y - muhat)^2/(1 - mammals.diag$h)^2)) [1]

Cost functions, etc. # leave-one-out and 11-fold cross-validation prediction error for # the nodal data set. Since the response is a binary variable # an appropriate cost function is > cost 0.5) > nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal) > (cv.err <- cv.glm(nodal, nodal.glm, cost, K = nrow(nodal))$delta) [1] > (cv.11.err <- cv.glm(nodal, nodal.glm, cost, K = 11)$delta) [1]

cvTools project.org/web/packages/cvTools/cvTools.pd fhttp://cran.r- project.org/web/packages/cvTools/cvTools.pd f Very powerful and flexible package for CV (regression) but very much a black box! If you use it, become very, very familiar with the outputs and be prepared to experiment… 12

Bootstrap aggregation (bagging) Improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. Also reduces variance and helps to avoid overfitting. Usually applied to decision tree methods, but can be used with any type of method. –Bagging is a special case of the model averaging approach. Harder to interpret – why? 13

Ozone of 100 bootstrap samples average

Shows improvements for unstable procedures (Breiman, 1996): e.g. neural nets, classification and regression trees, and subset selection in linear regression … can mildly degrade the performance of stable methods such as K-nearest neighbors 15

Bagging (bootstrapping aggregation)* library(mlbench) data(BreastCancer) l <- length(BreastCancer[,1]) sub <- sample(1:l,2*l/3) BC.bagging <- bagging(Class ~., data=BreastCancer[,-1], mfinal=20, control=rpart.control(maxdepth=3)) BC.bagging.pred <-predict.bagging( BC.bagging, newdata=BreastCancer[-sub,-1]) BC.bagging.pred$confusion Observed Class Predicted Class benign malignant benign malignant BC.bagging.pred$error [1]

A little later > data(BreastCancer) > l <- length(BreastCancer[,1]) > sub <- sample(1:l,2*l/3) > BC.bagging <- bagging(Class ~.,data=BreastCancer[,-1],mfinal=20, + control=rpart.control(maxdepth=3)) > BC.bagging.pred <- predict.bagging(BC.bagging,newdata=BreastCancer[- sub,-1]) > BC.bagging.pred$confusion Observed Class Predicted Class benign malignant benign malignant 7 78 > BC.bagging.pred$error [1]

Bagging (Vehicle) > data(Vehicle) > l <- length(Vehicle[,1]) > sub <- sample(1:l,2*l/3) > Vehicle.bagging <- bagging(Class ~.,data=Vehicle[sub, ],mfinal=40, + control=rpart.control(maxdepth=5)) > Vehicle.bagging.pred <- predict.bagging(Vehicle.bagging, newdata=Vehicle[-sub, ]) > Vehicle.bagging.pred$confusion Observed Class Predicted Class bus opel saab van bus opel saab van > Vehicle.bagging.pred$error [1]

Weak models … A weak learner: a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing) A strong learner: a classifier that is arbitrarily well-correlated with the true classification. Can a set of weak learners create a single strong learner? 19

Boosting … reducing bias in supervised learning most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. –typically weighted in some way that is usually related to the weak learners' accuracy. After a weak learner is added, the data is reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight Thus, future weak learners focus more on the examples that previous weak learners misclassified. 20

Diamonds require(ggplot2) # or load package first data(diamonds) head(diamonds) # look at the data! # ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar() ggplot(diamonds, aes(clarity)) + geom_bar() + facet_wrap(~ cut) ggplot(diamonds) + geom_histogram(aes(x=price)) + geom_vline(xintercept=12000) ggplot(diamonds, aes(clarity)) + geom_freqpoly(aes(group = cut, colour = cut)) 21

22 ggplot(diamonds, aes(clarity)) + geom_freqpoly(aes(group = cut, colour = cut))

23

Using diamonds… boost (glm) > mglmboost<-glmboost(as.factor(Expensive) ~., data=diamonds, family=Binomial(link="logit")) > summary(mglmboost) Generalized Linear Models Fitted via Gradient Boosting Call: glmboost.formula(formula = as.factor(Expensive) ~., data = diamonds, family = Binomial(link = "logit")) Negative Binomial Likelihood Loss function: { f <- pmin(abs(f), 36) * sign(f) p <- exp(f)/(exp(f) + exp(-f)) y <- (y + 1)/2 -y * log(p) - (1 - y) * log(1 - p) } 24

Using diamonds… boost (glm) > summary(mglmboost) #continued Number of boosting iterations: mstop = 100 Step size: 0.1 Offset: Coefficients: NOTE: Coefficients from a Binomial model are half the size of coefficients from a model fitted via glm(..., family = 'binomial'). See Warning section in ?coef.mboost (Intercept) carat clarity.L attr(,"offset") [1] Selection frequencies: carat (Intercept) clarity.L

Cluster boosting Assessment of the clusterwise stability of a clustering of data, which can be cases x variables or dissimilarity data. The data is resampled using several schemes (bootstrap, subsetting, jittering, replacement of points by noise) and the Jaccard similarities of the original clusters to the most similar clusters in the resampled data are computed. The mean over these similarities is used as an index of the stability of a cluster (other statistics can be computed as well). 26

Cluster boosting Quite general clustering methods are possible, i.e. methods estimating or fixing the number of clusters, methods producing overlapping clusters or not assigning all cases to clusters (but declaring them as "noise"). In R – clustermethod = X is used to select the method, e.g. Kmeans Lab on Friday… (iris, etc..) 27

Example - bodyfat The response variable is the body fat measured by DXA (DEXfat), which can be seen as the gold standard to measure body fat. However, DXA measurements are too expensive and complicated for a broad use. Anthropometric measurements as waist or hip circumferences are in comparison very easy to measure in a standard screening. A prediction formula only based on these measures could therefore be a valuable alternative with high clinical relevance for daily usage. 28

29

bodyfat ## regular linear model using three variables lm1 <- lm(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat) ## Estimate same model by glmboost glm1 <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat) # We consider all available variables as potential predictors. glm2 <- glmboost(DEXfat ~., data = bodyfat) # or one could essentially call: preds <- names(bodyfat[, names(bodyfat) != "DEXfat"]) ## names of predictors fm <- as.formula(paste("DEXfat ~", paste(preds, collapse = "+"))) ## build formula 30

Compare linear models > coef(lm1) (Intercept) hipcirc kneebreadth anthro3a > coef(glm1, off2int=TRUE) ## off2int adds the offset to the intercept (Intercept) hipcirc kneebreadth anthro3a Conclusion? 31

> fm DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth + anthro3a + anthro3b + anthro3c + anthro4 > coef(glm2, which = "") ## select all. (Intercept) age waistcirc hipcirc elbowbreadth kneebreadth anthro3a anthro3b anthro3c anthro attr(,"offset") [1]

plot(glm2, off2int = TRUE) 33

plot(glm2, ylim = range(coef(glm2, which = preds))) 34

> summary(bodyfat) age DEXfat waistcirc hipcirc elbowbreadth kneebreadth anthro3a Min. :19.00 Min. :11.21 Min. : Min. : Min. :5.200 Min. : Min. : st Qu.: st Qu.: st Qu.: st Qu.: st Qu.: st Qu.: st Qu.:3.540 Median :56.00 Median :29.63 Median : Median : Median :6.500 Median : Median :3.970 Mean :50.86 Mean :30.78 Mean : Mean : Mean :6.508 Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: rd Qu.: rd Qu.: rd Qu.:4.155 Max. :67.00 Max. :62.02 Max. : Max. : Max. :7.400 Max. : Max. :4.680 anthro3b anthro3c anthro4 Min. :2.580 Min. :2.050 Min. : st Qu.: st Qu.: st Qu.:5.040 Median :4.390 Median :3.990 Median :5.530 Mean :4.291 Mean :3.886 Mean : rd Qu.: rd Qu.: rd Qu.:5.840 Max. :5.010 Max. :4.620 Max. :

Other forms of boosting Gamboost = Generalized Additive Model - Gradient boosting for optimizing arbitrary loss functions, where component-wise smoothing procedures are utilized as (univariate) base- learners. 36

> gam1 <- gamboost(DEXfat ~ bbs(hipcirc) + bbs(kneebreadth) + bbs(anthro3a),data = bodyfat) > #Using plot() on a gamboost object delivers automatically the partial e ff ects of the di ff erent base-learners: > par(mfrow = c(1,3)) ## 3 plots in one device > plot(gam1) ## get the partial effects # bbs, bols, btree.. 37

38

> gam2 <- gamboost(DEXfat ~., baselearner = "bbs", data = bodyfat,control = boost_control(trace = TRUE)) [ 1] risk: [ 53] Final risk: > set.seed(123) ## set seed to make results reproducible > cvm <- cvrisk(gam2) ## default method is 25-fold bootstrap cross-validation 39

> cvm Cross-validated Squared Error (Regression) gamboost(formula = DEXfat ~., data = bodyfat, baselearner = "bbs", control = boost_control(trace = TRUE)) Optimal number of boosting iterations: 33 40

> mstop(cvm) ## extract the optimal mstop [1] 33 > gam2[ mstop(cvm) ] ## set the model automatically to the optimal mstop Model-based Boosting Call: gamboost(formula = DEXfat ~., data = bodyfat, baselearner = "bbs", control = boost_control(trace = TRUE)) Squared Error (Regression) Loss function: (y - f)^2 Number of boosting iterations: mstop = 33 Step size: 0.1 Offset: Number of baselearners: 9 41

plot(cvm) 42

> names(coef(gam2)) ## displays the selected base-learners at iteration 30 [1] "bbs(waistcirc, df = dfbase)" "bbs(hipcirc, df = dfbase)" "bbs(kneebreadth, df = dfbase)" [4] "bbs(anthro3a, df = dfbase)" "bbs(anthro3b, df = dfbase)" "bbs(anthro3c, df = dfbase)" [7] "bbs(anthro4, df = dfbase)" > gam2[1000, return = FALSE] # return = FALSE just supresses "print(gam2)" [ 101] risk: [ 153] risk: [ 205] risk: [ 257] risk: [ 309] risk: [ 361] risk: [ 413] risk: [ 465] risk: [ 517] risk: [ 569] risk: [ 621] risk: [ 673] risk: [ 725] risk: [ 777] risk: [ 829] risk: [ 881] risk: [ 933] risk: [ 985] Final risk:

> names(coef(gam2)) ## displays the selected base-learners, now at iteration 1000 [1] "bbs(age, df = dfbase)" "bbs(waistcirc, df = dfbase)" "bbs(hipcirc, df = dfbase)" [4] "bbs(elbowbreadth, df = dfbase)" "bbs(kneebreadth, df = dfbase)" "bbs(anthro3a, df = dfbase)" [7] "bbs(anthro3b, df = dfbase)" "bbs(anthro3c, df = dfbase)" "bbs(anthro4, df = dfbase)” > glm3 <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat,family = QuantReg(tau = 0.5), control = boost_control(mstop = 500)) > coef(glm3, off2int = TRUE) (Intercept) hipcirc kneebreadth anthro3a

45

Compare to rpart > fattree<-rpart(DEXfat ~., data=bodyfat) > plot(fattree) > text(fattree) > labels(fattree) [1] "root" "waistcirc =3.42" "hipcirc =101.3" [7] "waistcirc>=88.4" "hipcirc =109.9" 46

47

cars 48

iris 49

cars 50

51

Optimizing Coefficients: (Intercept) speed attr(,"offset") [1] Call: glmboost.formula(formula = dist ~ speed, data = cars, control = boost_control(mstop = 1000), family = Laplace()) Coefficients: (Intercept) speed attr(,"offset") [1]

53

Sparse matrix example > coef(mod, which = which(beta > 0)) V306 V1052 V1090 V3501 V4808 V5473 V7929 V8333 V8799 V attr(,"offset") [1]

55

Aside: Boosting and SVM… Remember “margins” from the SVM? Partitioning the “linear” or transformed space? In boosting we are effectively (not explicitly) attempting to maximize the minimum margin of any training example 56

Variants on boosting – loss fn cars.gb <- blackboost(dist ~ speed, data = cars, control = boost_control(mstop = 50)) ### plot fit plot(dist ~ speed, data = cars) lines(cars$speed, predict(cars.gb), col = "red") 57

Blackboosting (cf. brown) Gradient boosting for optimizing arbitrary loss functions where regression trees are utilized as base-learners. > cars.gb Model-based Boosting Call: blackboost(formula = dist ~ speed, data = cars, control = boost_control(mstop = 50)) Squared Error (Regression) Loss function: (y - f)^2 Number of boosting iterations: mstop = 50 Step size: 0.1 Offset: Number of baselearners: 1 58