1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 13a, April 22, 2014 Boosting, dimension reduction and a preview of the return to Big Data
Reading analytics/ analytics/ 2
Weak models … A weak learner: a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing) A strong learner: a classifier that is arbitrarily well-correlated with the true classification. Can a set of weak learners create a single strong learner? 3
Boosting … reducing bias in supervised learning most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. –typically weighted in some way that is usually related to the weak learners' accuracy. After a weak learner is added, the data is reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight Thus, future weak learners focus more on the examples that previous weak learners misclassified. 4
Diamonds require(ggplot2) # or load package first data(diamonds) head(diamonds) # look at the data! # ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar() ggplot(diamonds, aes(clarity)) + geom_bar() + facet_wrap(~ cut) ggplot(diamonds) + geom_histogram(aes(x=price)) + geom_vline(xintercept=12000) ggplot(diamonds, aes(clarity)) + geom_freqpoly(aes(group = cut, colour = cut)) 5
6
Using diamonds… boost (glm) > mglmboost<-glmboost(as.factor(Expensive) ~., data=diamonds,family=Binomial(link="logit")) > summary(mglmboost) Generalized Linear Models Fitted via Gradient Boosting Call: glmboost.formula(formula = as.factor(Expensive) ~., data = diamonds, family = Binomial(link = "logit")) Negative Binomial Likelihood Loss function: { f <- pmin(abs(f), 36) * sign(f) p <- exp(f)/(exp(f) + exp(-f)) y <- (y + 1)/2 -y * log(p) - (1 - y) * log(1 - p) } 7
Using diamonds… boost (glm) > summary(mglmboost) #continued Number of boosting iterations: mstop = 100 Step size: 0.1 Offset: Coefficients: NOTE: Coefficients from a Binomial model are half the size of coefficients from a model fitted via glm(..., family = 'binomial'). See Warning section in ?coef.mboost (Intercept) carat clarity.L attr(,"offset") [1] Selection frequencies: carat (Intercept) clarity.L
9
bodyfat The response variable is the body fat measured by DXA (DEXfat), which can be seen as the gold standard to measure body fat. However, DXA measurements are too expensive and complicated for a broad use. Anthropometric measurements as waist or hip circumferences are in comparison very easy to measure in a standard screening. A prediction formula only based on these measures could therefore be a valuable alternative with high clinical relevance for daily usage. 10
11
bodyfat ## regular linear model using three variables lm1 <- lm(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat) ## Estimate same model by glmboost glm1 <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat) # We consider all available variables as potential predictors. glm2 <- glmboost(DEXfat ~., data = bodyfat) # or one could essentially call: preds <- names(bodyfat[, names(bodyfat) != "DEXfat"]) ## names of predictors fm <- as.formula(paste("DEXfat ~", paste(preds, collapse = "+"))) ## build formula 12
Compare linear models > coef(lm1) (Intercept) hipcirc kneebreadth anthro3a > coef(glm1, off2int=TRUE) ## off2int adds the offset to the intercept (Intercept) hipcirc kneebreadth anthro3a Conclusion? 13
> fm DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth + anthro3a + anthro3b + anthro3c + anthro4 > coef(glm2, which = "") ## select all. (Intercept) age waistcirc hipcirc elbowbreadth kneebreadth anthro3a anthro3b anthro3c anthro attr(,"offset") [1]
plot(glm2, off2int = TRUE) 15
plot(glm2, ylim = range(coef(glm2, which = preds))) 16
> summary(bodyfat) age DEXfat waistcirc hipcirc elbowbreadth kneebreadth anthro3a Min. :19.00 Min. :11.21 Min. : Min. : Min. :5.200 Min. : Min. : st Qu.: st Qu.: st Qu.: st Qu.: st Qu.: st Qu.: st Qu.:3.540 Median :56.00 Median :29.63 Median : Median : Median :6.500 Median : Median :3.970 Mean :50.86 Mean :30.78 Mean : Mean : Mean :6.508 Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: rd Qu.: rd Qu.: rd Qu.:4.155 Max. :67.00 Max. :62.02 Max. : Max. : Max. :7.400 Max. : Max. :4.680 anthro3b anthro3c anthro4 Min. :2.580 Min. :2.050 Min. : st Qu.: st Qu.: st Qu.:5.040 Median :4.390 Median :3.990 Median :5.530 Mean :4.291 Mean :3.886 Mean : rd Qu.: rd Qu.: rd Qu.:5.840 Max. :5.010 Max. :4.620 Max. :
Other forms of boosting Gamboost = Generalized Additive Model - Gradient boosting for optimizing arbitrary loss functions, where component-wise smoothing procedures are utilized as (univariate) base- learners. 18
> gam1 <- gamboost(DEXfat ~ bbs(hipcirc) + bbs(kneebreadth) + bbs(anthro3a),data = bodyfat) > #Using plot() on a gamboost object delivers automatically the partial e ff ects of the di ff erent base-learners: > par(mfrow = c(1,3)) ## 3 plots in one device > plot(gam1) ## get the partial effects # bbs, bols, btree.. 19
20
> gam2 <- gamboost(DEXfat ~., baselearner = "bbs", data = bodyfat,control = boost_control(trace = TRUE)) [ 1] risk: [ 53] Final risk: > set.seed(123) ## set seed to make results reproducible > cvm <- cvrisk(gam2) ## default method is 25-fold bootstrap cross-validation 21
> cvm Cross-validated Squared Error (Regression) gamboost(formula = DEXfat ~., data = bodyfat, baselearner = "bbs", control = boost_control(trace = TRUE)) Optimal number of boosting iterations: 33 22
> mstop(cvm) ## extract the optimal mstop [1] 33 > gam2[ mstop(cvm) ] ## set the model automatically to the optimal mstop Model-based Boosting Call: gamboost(formula = DEXfat ~., data = bodyfat, baselearner = "bbs", control = boost_control(trace = TRUE)) Squared Error (Regression) Loss function: (y - f)^2 Number of boosting iterations: mstop = 33 Step size: 0.1 Offset: Number of baselearners: 9 23
plot(cvm) 24
> names(coef(gam2)) ## displays the selected base-learners at iteration 30 [1] "bbs(waistcirc, df = dfbase)" "bbs(hipcirc, df = dfbase)" "bbs(kneebreadth, df = dfbase)" [4] "bbs(anthro3a, df = dfbase)" "bbs(anthro3b, df = dfbase)" "bbs(anthro3c, df = dfbase)" [7] "bbs(anthro4, df = dfbase)" > gam2[1000, return = FALSE] # return = FALSE just supresses "print(gam2)" [ 101] risk: [ 153] risk: [ 205] risk: [ 257] risk: [ 309] risk: [ 361] risk: [ 413] risk: [ 465] risk: [ 517] risk: [ 569] risk: [ 621] risk: [ 673] risk: [ 725] risk: [ 777] risk: [ 829] risk: [ 881] risk: [ 933] risk: [ 985] Final risk:
> names(coef(gam2)) ## displays the selected base-learners, now at iteration 1000 [1] "bbs(age, df = dfbase)" "bbs(waistcirc, df = dfbase)" "bbs(hipcirc, df = dfbase)" [4] "bbs(elbowbreadth, df = dfbase)" "bbs(kneebreadth, df = dfbase)" "bbs(anthro3a, df = dfbase)" [7] "bbs(anthro3b, df = dfbase)" "bbs(anthro3c, df = dfbase)" "bbs(anthro4, df = dfbase)” > glm3 <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat,family = QuantReg(tau = 0.5), control = boost_control(mstop = 500)) > coef(glm3, off2int = TRUE) (Intercept) hipcirc kneebreadth anthro3a
27
Compare to rpart > fattree<-rpart(DEXfat ~., data=bodyfat) > plot(fattree) > text(fattree) > labels(fattree) [1] "root" "waistcirc =3.42" "hipcirc =101.3" [7] "waistcirc>=88.4" "hipcirc =109.9" 28
29
cars 30
iris 31
cars 32
33
Optimizing Coefficients: (Intercept) speed attr(,"offset") [1] Call: glmboost.formula(formula = dist ~ speed, data = cars, control = boost_control(mstop = 1000), family = Laplace()) Coefficients: (Intercept) speed attr(,"offset") [1]
35
Sparse matrix example > coef(mod, which = which(beta > 0)) V306 V1052 V1090 V3501 V4808 V5473 V7929 V8333 V8799 V attr(,"offset") [1]
37
Aside: Boosting and SVM… Remember “margins” from the SVM? Partitioning the “linear” or transformed space? In boosting we are effectively (not explicitly) attempting to maximize the minimum margin of any training example 38
Dimension reduction.. Principle component analysis (PCA) and metaPCA (in R) Singular Value Decomposition Feature selection, reduction Clustering Why? –Curse of dimensionality – or – some subset of the data should not be used as it adds noise What is it? –Various methods to reach an optimal subset 39
Feature selection The goodness of a feature/feature subset is dependent on measures Various measures –Information measures –Distance measures –Dependence measures –Consistency measures –Accuracy measures 40
Libraries in R that you used… MetaPCA (prcomp, metaPCA) EDR (effective dimension reduction) dr 41
Lab 6 42
prostate data (lab 7). 43
Demo (lab 8) library(EDR) # effective dimension reduction ###install.packages("edrGraphicalTools") ###library(edrGraphicalTools) demo(edr_ex1) demo(edr_ex2) demo(edr_ex3) demo(edr_ex4) 44
Lab 9 library(dr) data(ais) # default fitting method is "sir" s0 <- dr(LBM~log(SSF)+log(Wt)+log(Hg)+log(Ht)+log(WCC)+log(RCC)+ log(Hc)+log(Ferr),data=ais) # Refit, using a different function for slicing to agree with arc. summary(s1 <- update(s0,slice.function=dr.slices.arc)) # Refit again, using save, with 10 slices; the default is max(8,ncol+3) summary(s2<-update(s1,nslices=10,method="save")) # Refit, using phdres. Tests are different for phd, and not # Fit using phdres; output is similar for phdy, but tests are not justifiable. summary(s3<- update(s1,method="phdres")) # fit using ire: summary(s4 <- update(s1,method="ire")) # fit using Sex as a grouping variable. s5 <- update(s4,group=~Sex) 45
> s0 dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais) Estimated Basis Vectors for Central Subspace: Dir1 Dir2 Dir3 Dir4 log(SSF) log(Wt) log(Hg) log(Ht) log(WCC) log(RCC) log(Hc) log(Ferr) Eigenvalues: [1]
> summary(s1 <- update(s0,slice.function=dr.slices.arc)) Call: dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc) Method: sir with 11 slices, n = 202. Slice Sizes: Estimated Basis Vectors for Central Subspace: Dir1 Dir2 Dir3 Dir4 log(SSF) log(Wt) log(Hg) log(Ht) log(WCC) log(RCC) log(Hc) log(Ferr) Dir1 Dir2 Dir3 Dir4 Eigenvalues R^2(OLS|dr) Large-sample Marginal Dimension Tests: Stat df p.value 0D vs >= 1D D vs >= 2D D vs >= 3D D vs >= 4D
> summary(s2<-update(s1,nslices=10,method="save")) Call: dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc, nslices = 10, method = "save") Method: save with 10 slices, n = 202. Slice Sizes: Estimated Basis Vectors for Central Subspace: Dir1 Dir2 Dir3 Dir4 log(SSF) log(Wt) log(Hg) log(Ht) log(WCC) log(RCC) log(Hc) log(Ferr) Dir1 Dir2 Dir3 Dir4 Eigenvalues R^2(OLS|dr) Large-sample Marginal Dimension Tests: Stat df(Nor) p.value(Nor) p.value(Gen) 0D vs >= 1D D vs >= 2D D vs >= 3D D vs >= 4D
S0 v. S2 49
S3 and S4 50
51
Remember - Assignment 2? General assignment – read EPI_data, specify a new data subset, create data frames in R and save them into a database In R Studio –Install package – “rmongodb” (activate it) – MongoDB - – couchdb-vs-redishttp://kkovacs.eu/cassandra-vs-mongodb-vs- couchdb-vs-redis 52
We’ll revisit these ml#databasehttp://projects.apache.org/indexes/category.ht ml#database –Hadoop (MapReduce) –Pig ( ) –HIVE ( ) gStartedhttps://cwiki.apache.org/confluence/display/Hive/Gettin gStarted alhttps://cwiki.apache.org/confluence/display/Hive/Tutori al ageManualhttps://cwiki.apache.org/confluence/display/Hive/Langu ageManual –Spark 53