1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 12a, April 19, 2016 Cross-validation, Revisiting Regression – local models, and non-parametric…
coleman > head(coleman) salaryP fatherWc sstatus teacherSc motherLev Y
Cross-validation package cvTools > call <- call("lmrob", formula = Y ~.) > # set up folds for cross-validation > folds <- cvFolds(nrow(coleman), K = 5, R = 10) > # perform cross-validation > cvTool(call, data = coleman, y = coleman$Y, cost = rtmspe, + folds = folds, costArgs = list(trim = 0.1)) CV [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] Warning messages: 1: In lmrob.S(x, y, control = control) : S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps 2: In lmrob.S(x, y, control = control) : S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps 3: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations 4: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations
Evaluating? > cvFits 5-fold CV results: Fit CV 1 LS MM LTS Best model: CV "MM" 4
LS, LTS, MM? The breakdown value of an estimator is defined as the smallest fraction of contamination that can cause the estimator to take on values arbitrarily far from its value on the uncontaminated data. The breakdown value of an estimator can be used as a measure of the robustness of the estimator. Rousseeuw and Leroy (1987) and others introduced high breakdown value estimators for linear regression. LTS – see viewer.htm#statug_rreg_sect018.htm#statug.rreg.robustregfltsest viewer.htm#statug_rreg_sect018.htm#statug.rreg.robustregfltsest MM - viewer.htm#statug_rreg_sect019.htm viewer.htm#statug_rreg_sect019.htm 5
50 and 75% subsets fitLts50 <- ltsReg(Y ~., data = coleman, alpha = 0.5) cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # 75% subsets fitLts75 <- ltsReg(Y ~., data = coleman, alpha = 0.75) cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # combine and plot results cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75) 6
cvFitsLts (50/75) > cvFitsLts 5-fold CV results: Fit reweighted raw Best model: reweighted raw "0.75" "0.75" 7
Tuning tuning <- list(tuning.psi=c(3.14, 3.44, 3.88, 4.68)) # perform cross-validation cvFitsLmrob <- cvTuning(fitLmrob$call, data = coleman, y = coleman$Y, tuning = tuning, cost = rtmspe, folds = folds, costArgs = list(trim = 0.1)) 8
cvFitsLmrob 5-fold CV results: tuning.psi CV Optimal tuning parameter: tuning.psi CV
Lab on Friday mammals.glm <- glm(log(brain) ~ log(body), data = mammals) (cv.err <- cv.glm(mammals, mammals.glm)$delta) [1] > (cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta) [1] # As this is a linear model we could calculate the leave-one-out # cross-validation estimate without any extra model-fitting. muhat <- fitted(mammals.glm) mammals.diag <- glm.diag(mammals.glm) (cv.err <- mean((mammals.glm$y - muhat)^2/(1 - mammals.diag$h)^2)) [1]
Cost functions, etc. # leave-one-out and 11-fold cross-validation prediction error for # the nodal data set. Since the response is a binary variable # an appropriate cost function is > cost 0.5) > nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal) > (cv.err <- cv.glm(nodal, nodal.glm, cost, K = nrow(nodal))$delta) [1] > (cv.11.err <- cv.glm(nodal, nodal.glm, cost, K = 11)$delta) [1]
cvTools project.org/web/packages/cvTools/cvTools.pd fhttp://cran.r- project.org/web/packages/cvTools/cvTools.pd f Very powerful and flexible package for CV (regression) but very much a black box! If you use it, become very, very familiar with the outputs and be prepared to experiment… 12
Diamonds require(ggplot2) # or load package first data(diamonds) head(diamonds) # look at the data! # ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar() ggplot(diamonds, aes(clarity)) + geom_bar() + facet_wrap(~ cut) ggplot(diamonds) + geom_histogram(aes(x=price)) + geom_vline(xintercept=12000) ggplot(diamonds, aes(clarity)) + geom_freqpoly(aes(group = cut, colour = cut)) 13
14 ggplot(diamonds, aes(clarity)) + geom_freqpoly(aes(group = cut, colour = cut))
bodyfat ## regular linear model using three variables lm1 <- lm(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat) ## Estimate same model by glmboost glm1 <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat) # We consider all available variables as potential predictors. glm2 <- glmboost(DEXfat ~., data = bodyfat) # or one could essentially call: preds <- names(bodyfat[, names(bodyfat) != "DEXfat"]) ## names of predictors fm <- as.formula(paste("DEXfat ~", paste(preds, collapse = "+"))) ## build formula 15
Compare linear models > coef(lm1) (Intercept) hipcirc kneebreadth anthro3a > coef(glm1, off2int=TRUE) ## off2int adds the offset to the intercept (Intercept) hipcirc kneebreadth anthro3a Conclusion? 16
> fm DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth + anthro3a + anthro3b + anthro3c + anthro4 > coef(glm2, which = "") ## select all. (Intercept) age waistcirc hipcirc elbowbreadth kneebreadth anthro3a anthro3b anthro3c anthro attr(,"offset") [1]
> gam2 <- gamboost(DEXfat ~., baselearner = "bbs", data = bodyfat,control = boost_control(trace = TRUE)) [ 1] risk: [ 53] Final risk: > set.seed(123) ## set seed to make results reproducible > cvm <- cvrisk(gam2) ## default method is 25-fold bootstrap cross-validation 18
> cvm Cross-validated Squared Error (Regression) gamboost(formula = DEXfat ~., data = bodyfat, baselearner = "bbs", control = boost_control(trace = TRUE)) Optimal number of boosting iterations: 33 19
> mstop(cvm) ## extract the optimal mstop [1] 33 > gam2[ mstop(cvm) ] ## set the model automatically to the optimal mstop Model-based Boosting Call: gamboost(formula = DEXfat ~., data = bodyfat, baselearner = "bbs", control = boost_control(trace = TRUE)) Squared Error (Regression) Loss function: (y - f)^2 Number of boosting iterations: mstop = 33 Step size: 0.1 Offset: Number of baselearners: 9 20
plot(cvm) 21
> names(coef(gam2)) ## displays the selected base-learners at iteration 30 [1] "bbs(waistcirc, df = dfbase)" "bbs(hipcirc, df = dfbase)" "bbs(kneebreadth, df = dfbase)" [4] "bbs(anthro3a, df = dfbase)" "bbs(anthro3b, df = dfbase)" "bbs(anthro3c, df = dfbase)" [7] "bbs(anthro4, df = dfbase)" > gam2[1000, return = FALSE] # return = FALSE just supresses "print(gam2)" [ 101] risk: [ 153] risk: [ 205] risk: [ 257] risk: [ 309] risk: [ 361] risk: [ 413] risk: [ 465] risk: [ 517] risk: [ 569] risk: [ 621] risk: [ 673] risk: [ 725] risk: [ 777] risk: [ 829] risk: [ 881] risk: [ 933] risk: [ 985] Final risk:
> names(coef(gam2)) ## displays the selected base-learners, now at iteration 1000 [1] "bbs(age, df = dfbase)" "bbs(waistcirc, df = dfbase)" "bbs(hipcirc, df = dfbase)" [4] "bbs(elbowbreadth, df = dfbase)" "bbs(kneebreadth, df = dfbase)" "bbs(anthro3a, df = dfbase)" [7] "bbs(anthro3b, df = dfbase)" "bbs(anthro3c, df = dfbase)" "bbs(anthro4, df = dfbase)” > glm3 <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat,family = QuantReg(tau = 0.5), control = boost_control(mstop = 500)) > coef(glm3, off2int = TRUE) (Intercept) hipcirc kneebreadth anthro3a
More local methods… 24
Why local? 25
Sparse? 26
Remember this one? 27 How would you apply local methods here?
SVM-type One-class-classification: this model tries to find the support of a distribution and thus allows for outlier/novelty detection; epsilon-regression: here, the data points lie in between the two borders of the margin which is maximized under suitable conditions to avoid outlier inclusion; nu-regression: with analogue modifications of the regression model as in the classification case. 28
Reminder SVM and margin 29
Loss functions… 30 classification outlier regression
Regression By using a different loss function called the ε- insensitive loss function ||y−f (x)||ε = max{0, ||y− f(x)|| − ε}, SVMs can also perform regression. This loss function ignores errors that are smaller than a certain threshold ε > 0 thus creating a tube around the true output. 31
Example lm v. svm 32
33
Again SVM in R E the svm() function in e1071 provides a rigid interface to libsvm along with visualization and parameter tuning methods. kernlab features a variety of kernel-based methods and includes a SVM method based on the optimizers used in libsvm and bsvm Package klaR includes an interface to SVMlight, a popular SVM implementation that additionally offers classification tools such as Regularized Discriminant Analysis. Svmpath – you get the idea… 34
Knn is local – right? nearest neighbors is a simple algorithm that stores all available cases and predict the numerical target based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique. 35
Distance… A simple implementation of KNN regression is to calculate the average of the numerical target of the K nearest neighbors. Another approach uses an inverse distance weighted average of the K nearest neighbors. Choosing K! KNN regression uses the same distance functions as KNN classification. knn.reg and also in kknn project.org/web/packages/kknn/kknn.pdfhttp://cran.r- project.org/web/packages/kknn/kknn.pdf 36
Classes of local regression Locally (weighted) scatterplot smoothing –LOESS –LOWESS Fitting is done locally - the fit at point x, the fit is made using points in a neighborhood of x, weighted by their distance from x (with differences in ‘parametric’ variables being ignored when computing the distance) 37
38
Classes of local regression The size of the neighborhood is controlled by α (set by span). For α 1, all points are used, with the ‘maximum distance’ assumed to be α^(1/p) times the actual maximum distance for p explanatory variables. 39
Classes of local regression For the default family, fitting is by (weighted) least squares. For family="symmetric" a few iterations of an M-estimation procedure with Tukey's biweight are used. Be aware that as the initial value is the least- squares fit, this need not be a very resistant fit. It can be important to tune the control list to achieve acceptable speed. 40
Friedman (supsmu in modreg) is a running lines smoother which chooses between three spans for the lines. The running lines smoothers are symmetric, with k/2 data points each side of the predicted point, and values of k as 0.5 * n, 0.2 * n and 0.05 * n, where n is the number of data points. If span is specified, a single smoother with span span * n is used. 41
Friedman The best of the three smoothers is chosen by cross-validation for each prediction. The best spans are then smoothed by a running lines smoother and the final prediction chosen by linear interpolation. “For small samples (n 0) should be used. Reasonable span values are 0.2 to 0.4.” 42
Local non-param lplm (in Rearrangement) Local nonparametric method, local linear regression estimator with box kernel (default), for conditional mean functions 43
Ridge regression Addresses ill-posed regression problems using filtering approaches (e.g. high-pass) Often called “regularization” lm.ridge (in MASS) 44
Quantile regression –is desired if conditional quantile functions are of interest. One advantage of quantile regression, relative to the ordinary least squares regression, is that the quantile regression estimates are more robust against outliers in the response measurements. –In practice we often prefer using different measures of central tendency and statistical dispersion to obtain a more comprehensive analysis of the relationship between variables. 45
More… Partial Least Squares Regression (PLSR) mvr (in pls) Principal Component Regression (PCR) Canonical Powered Partial Least Squares (CPPLS) 46
PCR creates components to explain the observed variability in the predictor variables, without considering the response variable at all. On the other hand, PLSR does take the response variable into account, and therefore often leads to models that are able to fit the response variable with fewer components. Whether or not that ultimately translates into a better model, in terms of its practical use, depends on the context. 47
Splines smooth.spline, splinefun (stats, modreg) and ns (in splines) – a numeric function that is piecewise-defined by polynomial functions, and which possesses a sufficiently high degree of smoothness at the places where the polynomial pieces connect (which are known as knots) 48
Splines For interpolation, splines are often preferred to polynomial interpolation - they yields similar results to interpolating with higher degree polynomials while avoiding instability due to overfitting Features: simplicity of their construction, their ease and accuracy of evaluation, and their capacity to approximate complex shapes Most common: cubic spline, i.e., of order 3— in particular, cubic B-spline 49
cars 50
Smoothing/ local … s/html/library/modreg/html/00Index.htmlhttps://web.njit.edu/all_topics/Prog_Lang_Doc s/html/library/modreg/html/00Index.html refcard-regression.pdfhttp://cran.r-project.org/doc/contrib/Ricci- refcard-regression.pdf 51
Lab on Friday… And reminder – Assignment 7 due on Apr. 29 – Friday 5pm Next week – mixed models! i.e. optimizing… Open lab on Fri. 29 th (no new work)…. 52