1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 12a, April 19, 2016 Cross-validation, Revisiting Regression – local models, and non-parametric…

Slides:

Advertisements

Similar presentations

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

Advertisements

Polynomial Regression and Transformations STA 671 Summer 2008.

Pattern Recognition and Machine Learning

SVM—Support Vector Machines

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Inference for Regression

FTP Biostatistics II Model parameter estimations: Confronting models with measurements.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 10, 2015 Labs: Cross Validation, RandomForest, Multi- Dimensional Scaling, Dimension Reduction,

1 Statistics & R, TiP, 2011/12 Linear Models & Smooth Regression  Linear models  Diagnostics  Robust regression  Bootstrapping linear models  Scatterplot.

Model Assessment, Selection and Averaging

Model assessment and cross-validation - overview

CMPUT 466/551 Principal Source: CMU

Practical Sheet 6 Solutions Practical Sheet 6 Solutions The R data frame “whiteside” which deals with gas consumption is made available in R by > data(whiteside,

Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 12a, April 21, 2015 Revisiting Regression – local models, and non-parametric…

統計計算與模擬政治大學統計系余清祥 2003 年 6 月 9 日 ~ 6 月 10 日第十六週：估計密度函數

x – independent variable (input)

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Additive Models and Trees

Regression III: Robust regressions

Data mining and statistical learning - lecture 13 Separating hyperplane.

1 An Introduction to Nonparametric Regression Ning Li March 15 th, 2004 Biostatistics 277.

CHAPTER 3 Describing Relationships

Relationships Among Variables

Objectives of Multiple Regression

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 11a, April 14, 2015 Interpreting cross-validation, bootstrapping, bagging, boosting, etc.

Simple Linear Regression

Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 13a, April 22, 2014 Boosting, dimension reduction and a preview of the return to Big Data.

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.

Use of Weighted Least Squares. In fitting models of the form y i = f(x i ) +  i i = 1………n, least squares is optimal under the condition  1 ……….  n.

The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.

PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?

© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.

CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.

Robust Regression V & R: Section 6.5 Denise Hum. Leila Saberi. Mi Lam.

The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.

June 30, 2008Stat Lecture 16 - Regression1 Inference for relationships between variables Statistics Lecture 16.

PCB 3043L - General Ecology Data Analysis.

Essentials of Business Statistics: Communicating with Numbers By Sanjiv Jaggia and Alison Kelly Copyright © 2014 by McGraw-Hill Higher Education. All rights.

Additive Models ， Trees ， and Related Models Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University.

Gaussian Process and Prediction. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)2 Outline Gaussian Process and Bayesian Regression  Bayesian regression.

Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 12b, April 22, 2016 Cross-validation and Local Regression Lab.

Estimating standard error using bootstrap

CHAPTER 3 Describing Relationships

Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600

Weak Models: Bagging, Boosting, Bootstrap Aggregation

PCB 3043L - General Ecology Data Analysis.

CHAPTER 3 Describing Relationships

What is Regression Analysis?

Chapter 3: Describing Relationships

Chapter 3: Describing Relationships

Chapter 3: Describing Relationships

Chapter 3: Describing Relationships

CHAPTER 3 Describing Relationships

CHAPTER 3 Describing Relationships

Chapter 3: Describing Relationships

Chapter 3: Describing Relationships

CHAPTER 3 Describing Relationships

政治大學統計系余清祥 2004年5月26日~ 6月7日第十六、十七週：估計密度函數

Chapter 3: Describing Relationships

Chapter 3: Describing Relationships

Chapter 3: Describing Relationships

Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 12a, April 19, 2016 Cross-validation, Revisiting Regression – local models, and non-parametric…

coleman > head(coleman) salaryP fatherWc sstatus teacherSc motherLev Y

Cross-validation package cvTools > call <- call("lmrob", formula = Y ~.) > # set up folds for cross-validation > folds <- cvFolds(nrow(coleman), K = 5, R = 10) > # perform cross-validation > cvTool(call, data = coleman, y = coleman$Y, cost = rtmspe, + folds = folds, costArgs = list(trim = 0.1)) CV [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] Warning messages: 1: In lmrob.S(x, y, control = control) : S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps 2: In lmrob.S(x, y, control = control) : S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps 3: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations 4: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations

Evaluating? > cvFits 5-fold CV results: Fit CV 1 LS MM LTS Best model: CV "MM" 4

LS, LTS, MM? The breakdown value of an estimator is defined as the smallest fraction of contamination that can cause the estimator to take on values arbitrarily far from its value on the uncontaminated data. The breakdown value of an estimator can be used as a measure of the robustness of the estimator. Rousseeuw and Leroy (1987) and others introduced high breakdown value estimators for linear regression. LTS – see viewer.htm#statug_rreg_sect018.htm#statug.rreg.robustregfltsest viewer.htm#statug_rreg_sect018.htm#statug.rreg.robustregfltsest MM - viewer.htm#statug_rreg_sect019.htm viewer.htm#statug_rreg_sect019.htm 5

50 and 75% subsets fitLts50 <- ltsReg(Y ~., data = coleman, alpha = 0.5) cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # 75% subsets fitLts75 <- ltsReg(Y ~., data = coleman, alpha = 0.75) cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # combine and plot results cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75) 6

cvFitsLts (50/75) > cvFitsLts 5-fold CV results: Fit reweighted raw Best model: reweighted raw "0.75" "0.75" 7

Tuning tuning <- list(tuning.psi=c(3.14, 3.44, 3.88, 4.68)) # perform cross-validation cvFitsLmrob <- cvTuning(fitLmrob$call, data = coleman, y = coleman$Y, tuning = tuning, cost = rtmspe, folds = folds, costArgs = list(trim = 0.1)) 8

cvFitsLmrob 5-fold CV results: tuning.psi CV Optimal tuning parameter: tuning.psi CV

Lab on Friday mammals.glm <- glm(log(brain) ~ log(body), data = mammals) (cv.err <- cv.glm(mammals, mammals.glm)$delta) [1] > (cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta) [1] # As this is a linear model we could calculate the leave-one-out # cross-validation estimate without any extra model-fitting. muhat <- fitted(mammals.glm) mammals.diag <- glm.diag(mammals.glm) (cv.err <- mean((mammals.glm$y - muhat)^2/(1 - mammals.diag$h)^2)) [1]

Cost functions, etc. # leave-one-out and 11-fold cross-validation prediction error for # the nodal data set. Since the response is a binary variable # an appropriate cost function is > cost 0.5) > nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal) > (cv.err <- cv.glm(nodal, nodal.glm, cost, K = nrow(nodal))$delta) [1] > (cv.11.err <- cv.glm(nodal, nodal.glm, cost, K = 11)$delta) [1]

cvTools project.org/web/packages/cvTools/cvTools.pd fhttp://cran.r- project.org/web/packages/cvTools/cvTools.pd f Very powerful and flexible package for CV (regression) but very much a black box! If you use it, become very, very familiar with the outputs and be prepared to experiment… 12

Diamonds require(ggplot2) # or load package first data(diamonds) head(diamonds) # look at the data! # ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar() ggplot(diamonds, aes(clarity)) + geom_bar() + facet_wrap(~ cut) ggplot(diamonds) + geom_histogram(aes(x=price)) + geom_vline(xintercept=12000) ggplot(diamonds, aes(clarity)) + geom_freqpoly(aes(group = cut, colour = cut)) 13

14 ggplot(diamonds, aes(clarity)) + geom_freqpoly(aes(group = cut, colour = cut))

bodyfat ## regular linear model using three variables lm1 <- lm(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat) ## Estimate same model by glmboost glm1 <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat) # We consider all available variables as potential predictors. glm2 <- glmboost(DEXfat ~., data = bodyfat) # or one could essentially call: preds <- names(bodyfat[, names(bodyfat) != "DEXfat"]) ## names of predictors fm <- as.formula(paste("DEXfat ~", paste(preds, collapse = "+"))) ## build formula 15

Compare linear models > coef(lm1) (Intercept) hipcirc kneebreadth anthro3a > coef(glm1, off2int=TRUE) ## off2int adds the offset to the intercept (Intercept) hipcirc kneebreadth anthro3a Conclusion? 16

> fm DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth + anthro3a + anthro3b + anthro3c + anthro4 > coef(glm2, which = "") ## select all. (Intercept) age waistcirc hipcirc elbowbreadth kneebreadth anthro3a anthro3b anthro3c anthro attr(,"offset") [1]

> gam2 <- gamboost(DEXfat ~., baselearner = "bbs", data = bodyfat,control = boost_control(trace = TRUE)) [ 1] risk: [ 53] Final risk: > set.seed(123) ## set seed to make results reproducible > cvm <- cvrisk(gam2) ## default method is 25-fold bootstrap cross-validation 18

> cvm Cross-validated Squared Error (Regression) gamboost(formula = DEXfat ~., data = bodyfat, baselearner = "bbs", control = boost_control(trace = TRUE)) Optimal number of boosting iterations: 33 19

> mstop(cvm) ## extract the optimal mstop [1] 33 > gam2[ mstop(cvm) ] ## set the model automatically to the optimal mstop Model-based Boosting Call: gamboost(formula = DEXfat ~., data = bodyfat, baselearner = "bbs", control = boost_control(trace = TRUE)) Squared Error (Regression) Loss function: (y - f)^2 Number of boosting iterations: mstop = 33 Step size: 0.1 Offset: Number of baselearners: 9 20

plot(cvm) 21

> names(coef(gam2)) ## displays the selected base-learners at iteration 30 [1] "bbs(waistcirc, df = dfbase)" "bbs(hipcirc, df = dfbase)" "bbs(kneebreadth, df = dfbase)" [4] "bbs(anthro3a, df = dfbase)" "bbs(anthro3b, df = dfbase)" "bbs(anthro3c, df = dfbase)" [7] "bbs(anthro4, df = dfbase)" > gam2[1000, return = FALSE] # return = FALSE just supresses "print(gam2)" [ 101] risk: [ 153] risk: [ 205] risk: [ 257] risk: [ 309] risk: [ 361] risk: [ 413] risk: [ 465] risk: [ 517] risk: [ 569] risk: [ 621] risk: [ 673] risk: [ 725] risk: [ 777] risk: [ 829] risk: [ 881] risk: [ 933] risk: [ 985] Final risk:

> names(coef(gam2)) ## displays the selected base-learners, now at iteration 1000 [1] "bbs(age, df = dfbase)" "bbs(waistcirc, df = dfbase)" "bbs(hipcirc, df = dfbase)" [4] "bbs(elbowbreadth, df = dfbase)" "bbs(kneebreadth, df = dfbase)" "bbs(anthro3a, df = dfbase)" [7] "bbs(anthro3b, df = dfbase)" "bbs(anthro3c, df = dfbase)" "bbs(anthro4, df = dfbase)” > glm3 <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat,family = QuantReg(tau = 0.5), control = boost_control(mstop = 500)) > coef(glm3, off2int = TRUE) (Intercept) hipcirc kneebreadth anthro3a

More local methods… 24

Why local? 25

Sparse? 26

Remember this one? 27 How would you apply local methods here?

SVM-type One-class-classification: this model tries to find the support of a distribution and thus allows for outlier/novelty detection; epsilon-regression: here, the data points lie in between the two borders of the margin which is maximized under suitable conditions to avoid outlier inclusion; nu-regression: with analogue modifications of the regression model as in the classification case. 28

Reminder SVM and margin 29

Loss functions… 30 classification outlier regression

Regression By using a different loss function called the ε- insensitive loss function ||y−f (x)||ε = max{0, ||y− f(x)|| − ε}, SVMs can also perform regression. This loss function ignores errors that are smaller than a certain threshold ε > 0 thus creating a tube around the true output. 31

Example lm v. svm 32

33

Again SVM in R E the svm() function in e1071 provides a rigid interface to libsvm along with visualization and parameter tuning methods. kernlab features a variety of kernel-based methods and includes a SVM method based on the optimizers used in libsvm and bsvm Package klaR includes an interface to SVMlight, a popular SVM implementation that additionally offers classification tools such as Regularized Discriminant Analysis. Svmpath – you get the idea… 34

Knn is local – right? nearest neighbors is a simple algorithm that stores all available cases and predict the numerical target based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique. 35

Distance… A simple implementation of KNN regression is to calculate the average of the numerical target of the K nearest neighbors. Another approach uses an inverse distance weighted average of the K nearest neighbors. Choosing K! KNN regression uses the same distance functions as KNN classification. knn.reg and also in kknn project.org/web/packages/kknn/kknn.pdfhttp://cran.r- project.org/web/packages/kknn/kknn.pdf 36

Classes of local regression Locally (weighted) scatterplot smoothing –LOESS –LOWESS Fitting is done locally - the fit at point x, the fit is made using points in a neighborhood of x, weighted by their distance from x (with differences in ‘parametric’ variables being ignored when computing the distance) 37

38

Classes of local regression The size of the neighborhood is controlled by α (set by span). For α 1, all points are used, with the ‘maximum distance’ assumed to be α^(1/p) times the actual maximum distance for p explanatory variables. 39

Classes of local regression For the default family, fitting is by (weighted) least squares. For family="symmetric" a few iterations of an M-estimation procedure with Tukey's biweight are used. Be aware that as the initial value is the least- squares fit, this need not be a very resistant fit. It can be important to tune the control list to achieve acceptable speed. 40

Friedman (supsmu in modreg) is a running lines smoother which chooses between three spans for the lines. The running lines smoothers are symmetric, with k/2 data points each side of the predicted point, and values of k as 0.5 * n, 0.2 * n and 0.05 * n, where n is the number of data points. If span is specified, a single smoother with span span * n is used. 41

Friedman The best of the three smoothers is chosen by cross-validation for each prediction. The best spans are then smoothed by a running lines smoother and the final prediction chosen by linear interpolation. “For small samples (n 0) should be used. Reasonable span values are 0.2 to 0.4.” 42

Local non-param lplm (in Rearrangement) Local nonparametric method, local linear regression estimator with box kernel (default), for conditional mean functions 43

Ridge regression Addresses ill-posed regression problems using filtering approaches (e.g. high-pass) Often called “regularization” lm.ridge (in MASS) 44

Quantile regression –is desired if conditional quantile functions are of interest. One advantage of quantile regression, relative to the ordinary least squares regression, is that the quantile regression estimates are more robust against outliers in the response measurements. –In practice we often prefer using different measures of central tendency and statistical dispersion to obtain a more comprehensive analysis of the relationship between variables. 45

More… Partial Least Squares Regression (PLSR) mvr (in pls) Principal Component Regression (PCR) Canonical Powered Partial Least Squares (CPPLS) 46

PCR creates components to explain the observed variability in the predictor variables, without considering the response variable at all. On the other hand, PLSR does take the response variable into account, and therefore often leads to models that are able to fit the response variable with fewer components. Whether or not that ultimately translates into a better model, in terms of its practical use, depends on the context. 47

Splines smooth.spline, splinefun (stats, modreg) and ns (in splines) – a numeric function that is piecewise-defined by polynomial functions, and which possesses a sufficiently high degree of smoothness at the places where the polynomial pieces connect (which are known as knots) 48

Splines For interpolation, splines are often preferred to polynomial interpolation - they yields similar results to interpolating with higher degree polynomials while avoiding instability due to overfitting Features: simplicity of their construction, their ease and accuracy of evaluation, and their capacity to approximate complex shapes Most common: cubic spline, i.e., of order 3— in particular, cubic B-spline 49

cars 50

Smoothing/ local … s/html/library/modreg/html/00Index.htmlhttps://web.njit.edu/all_topics/Prog_Lang_Doc s/html/library/modreg/html/00Index.html refcard-regression.pdfhttp://cran.r-project.org/doc/contrib/Ricci- refcard-regression.pdf 51

Lab on Friday… And reminder – Assignment 7 due on Apr. 29 – Friday 5pm Next week – mixed models! i.e. optimizing… Open lab on Fri. 29 th (no new work)…. 52