1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 13a, April 22, 2014 Boosting, dimension reduction and a preview of the return to Big Data.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Chapter Outline 3.1 Introduction
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
Penalized Regression, Part 2
Pattern Recognition and Machine Learning
Computer vision: models, learning and inference Chapter 8 Regression.
Dimension reduction (1)
CMPUT 466/551 Principal Source: CMU
Multiple regression analysis
x – independent variable (input)
Sparse vs. Ensemble Approaches to Supervised Learning
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Sparse vs. Ensemble Approaches to Supervised Learning
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 11a, April 14, 2015 Interpreting cross-validation, bootstrapping, bagging, boosting, etc.
Presented By Wanchen Lu 2/25/2013
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Linear Models for Classification
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Regression Dr. Jieh-Shan George YEH
… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Supervised learning in high-throughput data  General considerations  Dimension reduction with outcome variables  Classification models.
Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.
Support Vector Regression in Marketing Georgi Nalbantov.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 12a, April 19, 2016 Cross-validation, Revisiting Regression – local models, and non-parametric…
By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 12b, April 22, 2016 Cross-validation and Local Regression Lab.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 11a, April 12, 2016 Interpreting: MDS, DR, SVM Factor Analysis; and Boosting.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
1 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Machine learning, pattern recognition and statistical data modelling Lecture 10.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Data Analytics – ITWS-4600/ITWS-6600
Bagging and Random Forests
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
LECTURE 11: Advanced Discriminant Analysis
Weak Models: Bagging, Boosting, Bootstrap Aggregation
Boosting and Additive Trees (2)
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Boosting and Additive Trees
Basic machine learning background with Python scikit-learn
Asymmetric Gradient Boosting with Application to Spam Filtering
CSCI B609: “Foundations of Data Science”
Cross-validation and Local Regression Lab
Support Vector Machines
Cross-validation and Local Regression Lab
Cross-validation and Local Regression Lab
Feature Selection Methods
What is Artificial Intelligence?
Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 13a, April 22, 2014 Boosting, dimension reduction and a preview of the return to Big Data

Reading analytics/ analytics/ 2

Weak models … A weak learner: a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing) A strong learner: a classifier that is arbitrarily well-correlated with the true classification. Can a set of weak learners create a single strong learner? 3

Boosting … reducing bias in supervised learning most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. –typically weighted in some way that is usually related to the weak learners' accuracy. After a weak learner is added, the data is reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight Thus, future weak learners focus more on the examples that previous weak learners misclassified. 4

Diamonds require(ggplot2) # or load package first data(diamonds) head(diamonds) # look at the data! # ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar() ggplot(diamonds, aes(clarity)) + geom_bar() + facet_wrap(~ cut) ggplot(diamonds) + geom_histogram(aes(x=price)) + geom_vline(xintercept=12000) ggplot(diamonds, aes(clarity)) + geom_freqpoly(aes(group = cut, colour = cut)) 5

6

Using diamonds… boost (glm) > mglmboost<-glmboost(as.factor(Expensive) ~., data=diamonds,family=Binomial(link="logit")) > summary(mglmboost) Generalized Linear Models Fitted via Gradient Boosting Call: glmboost.formula(formula = as.factor(Expensive) ~., data = diamonds, family = Binomial(link = "logit")) Negative Binomial Likelihood Loss function: { f <- pmin(abs(f), 36) * sign(f) p <- exp(f)/(exp(f) + exp(-f)) y <- (y + 1)/2 -y * log(p) - (1 - y) * log(1 - p) } 7

Using diamonds… boost (glm) > summary(mglmboost) #continued Number of boosting iterations: mstop = 100 Step size: 0.1 Offset: Coefficients: NOTE: Coefficients from a Binomial model are half the size of coefficients from a model fitted via glm(..., family = 'binomial'). See Warning section in ?coef.mboost (Intercept) carat clarity.L attr(,"offset") [1] Selection frequencies: carat (Intercept) clarity.L

9

bodyfat The response variable is the body fat measured by DXA (DEXfat), which can be seen as the gold standard to measure body fat. However, DXA measurements are too expensive and complicated for a broad use. Anthropometric measurements as waist or hip circumferences are in comparison very easy to measure in a standard screening. A prediction formula only based on these measures could therefore be a valuable alternative with high clinical relevance for daily usage. 10

11

bodyfat ## regular linear model using three variables lm1 <- lm(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat) ## Estimate same model by glmboost glm1 <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat) # We consider all available variables as potential predictors. glm2 <- glmboost(DEXfat ~., data = bodyfat) # or one could essentially call: preds <- names(bodyfat[, names(bodyfat) != "DEXfat"]) ## names of predictors fm <- as.formula(paste("DEXfat ~", paste(preds, collapse = "+"))) ## build formula 12

Compare linear models > coef(lm1) (Intercept) hipcirc kneebreadth anthro3a > coef(glm1, off2int=TRUE) ## off2int adds the offset to the intercept (Intercept) hipcirc kneebreadth anthro3a Conclusion? 13

> fm DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth + anthro3a + anthro3b + anthro3c + anthro4 > coef(glm2, which = "") ## select all. (Intercept) age waistcirc hipcirc elbowbreadth kneebreadth anthro3a anthro3b anthro3c anthro attr(,"offset") [1]

plot(glm2, off2int = TRUE) 15

plot(glm2, ylim = range(coef(glm2, which = preds))) 16

> summary(bodyfat) age DEXfat waistcirc hipcirc elbowbreadth kneebreadth anthro3a Min. :19.00 Min. :11.21 Min. : Min. : Min. :5.200 Min. : Min. : st Qu.: st Qu.: st Qu.: st Qu.: st Qu.: st Qu.: st Qu.:3.540 Median :56.00 Median :29.63 Median : Median : Median :6.500 Median : Median :3.970 Mean :50.86 Mean :30.78 Mean : Mean : Mean :6.508 Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: rd Qu.: rd Qu.: rd Qu.:4.155 Max. :67.00 Max. :62.02 Max. : Max. : Max. :7.400 Max. : Max. :4.680 anthro3b anthro3c anthro4 Min. :2.580 Min. :2.050 Min. : st Qu.: st Qu.: st Qu.:5.040 Median :4.390 Median :3.990 Median :5.530 Mean :4.291 Mean :3.886 Mean : rd Qu.: rd Qu.: rd Qu.:5.840 Max. :5.010 Max. :4.620 Max. :

Other forms of boosting Gamboost = Generalized Additive Model - Gradient boosting for optimizing arbitrary loss functions, where component-wise smoothing procedures are utilized as (univariate) base- learners. 18

> gam1 <- gamboost(DEXfat ~ bbs(hipcirc) + bbs(kneebreadth) + bbs(anthro3a),data = bodyfat) > #Using plot() on a gamboost object delivers automatically the partial e ff ects of the di ff erent base-learners: > par(mfrow = c(1,3)) ## 3 plots in one device > plot(gam1) ## get the partial effects # bbs, bols, btree.. 19

20

> gam2 <- gamboost(DEXfat ~., baselearner = "bbs", data = bodyfat,control = boost_control(trace = TRUE)) [ 1] risk: [ 53] Final risk: > set.seed(123) ## set seed to make results reproducible > cvm <- cvrisk(gam2) ## default method is 25-fold bootstrap cross-validation 21

> cvm Cross-validated Squared Error (Regression) gamboost(formula = DEXfat ~., data = bodyfat, baselearner = "bbs", control = boost_control(trace = TRUE)) Optimal number of boosting iterations: 33 22

> mstop(cvm) ## extract the optimal mstop [1] 33 > gam2[ mstop(cvm) ] ## set the model automatically to the optimal mstop Model-based Boosting Call: gamboost(formula = DEXfat ~., data = bodyfat, baselearner = "bbs", control = boost_control(trace = TRUE)) Squared Error (Regression) Loss function: (y - f)^2 Number of boosting iterations: mstop = 33 Step size: 0.1 Offset: Number of baselearners: 9 23

plot(cvm) 24

> names(coef(gam2)) ## displays the selected base-learners at iteration 30 [1] "bbs(waistcirc, df = dfbase)" "bbs(hipcirc, df = dfbase)" "bbs(kneebreadth, df = dfbase)" [4] "bbs(anthro3a, df = dfbase)" "bbs(anthro3b, df = dfbase)" "bbs(anthro3c, df = dfbase)" [7] "bbs(anthro4, df = dfbase)" > gam2[1000, return = FALSE] # return = FALSE just supresses "print(gam2)" [ 101] risk: [ 153] risk: [ 205] risk: [ 257] risk: [ 309] risk: [ 361] risk: [ 413] risk: [ 465] risk: [ 517] risk: [ 569] risk: [ 621] risk: [ 673] risk: [ 725] risk: [ 777] risk: [ 829] risk: [ 881] risk: [ 933] risk: [ 985] Final risk:

> names(coef(gam2)) ## displays the selected base-learners, now at iteration 1000 [1] "bbs(age, df = dfbase)" "bbs(waistcirc, df = dfbase)" "bbs(hipcirc, df = dfbase)" [4] "bbs(elbowbreadth, df = dfbase)" "bbs(kneebreadth, df = dfbase)" "bbs(anthro3a, df = dfbase)" [7] "bbs(anthro3b, df = dfbase)" "bbs(anthro3c, df = dfbase)" "bbs(anthro4, df = dfbase)” > glm3 <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat,family = QuantReg(tau = 0.5), control = boost_control(mstop = 500)) > coef(glm3, off2int = TRUE) (Intercept) hipcirc kneebreadth anthro3a

27

Compare to rpart > fattree<-rpart(DEXfat ~., data=bodyfat) > plot(fattree) > text(fattree) > labels(fattree) [1] "root" "waistcirc =3.42" "hipcirc =101.3" [7] "waistcirc>=88.4" "hipcirc =109.9" 28

29

cars 30

iris 31

cars 32

33

Optimizing Coefficients: (Intercept) speed attr(,"offset") [1] Call: glmboost.formula(formula = dist ~ speed, data = cars, control = boost_control(mstop = 1000), family = Laplace()) Coefficients: (Intercept) speed attr(,"offset") [1]

35

Sparse matrix example > coef(mod, which = which(beta > 0)) V306 V1052 V1090 V3501 V4808 V5473 V7929 V8333 V8799 V attr(,"offset") [1]

37

Aside: Boosting and SVM… Remember “margins” from the SVM? Partitioning the “linear” or transformed space? In boosting we are effectively (not explicitly) attempting to maximize the minimum margin of any training example 38

Dimension reduction.. Principle component analysis (PCA) and metaPCA (in R) Singular Value Decomposition Feature selection, reduction Clustering Why? –Curse of dimensionality – or – some subset of the data should not be used as it adds noise What is it? –Various methods to reach an optimal subset 39

Feature selection The goodness of a feature/feature subset is dependent on measures Various measures –Information measures –Distance measures –Dependence measures –Consistency measures –Accuracy measures 40

Libraries in R that you used… MetaPCA (prcomp, metaPCA) EDR (effective dimension reduction) dr 41

Lab 6 42

prostate data (lab 7). 43

Demo (lab 8) library(EDR) # effective dimension reduction ###install.packages("edrGraphicalTools") ###library(edrGraphicalTools) demo(edr_ex1) demo(edr_ex2) demo(edr_ex3) demo(edr_ex4) 44

Lab 9 library(dr) data(ais) # default fitting method is "sir" s0 <- dr(LBM~log(SSF)+log(Wt)+log(Hg)+log(Ht)+log(WCC)+log(RCC)+ log(Hc)+log(Ferr),data=ais) # Refit, using a different function for slicing to agree with arc. summary(s1 <- update(s0,slice.function=dr.slices.arc)) # Refit again, using save, with 10 slices; the default is max(8,ncol+3) summary(s2<-update(s1,nslices=10,method="save")) # Refit, using phdres. Tests are different for phd, and not # Fit using phdres; output is similar for phdy, but tests are not justifiable. summary(s3<- update(s1,method="phdres")) # fit using ire: summary(s4 <- update(s1,method="ire")) # fit using Sex as a grouping variable. s5 <- update(s4,group=~Sex) 45

> s0 dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais) Estimated Basis Vectors for Central Subspace: Dir1 Dir2 Dir3 Dir4 log(SSF) log(Wt) log(Hg) log(Ht) log(WCC) log(RCC) log(Hc) log(Ferr) Eigenvalues: [1]

> summary(s1 <- update(s0,slice.function=dr.slices.arc)) Call: dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc) Method: sir with 11 slices, n = 202. Slice Sizes: Estimated Basis Vectors for Central Subspace: Dir1 Dir2 Dir3 Dir4 log(SSF) log(Wt) log(Hg) log(Ht) log(WCC) log(RCC) log(Hc) log(Ferr) Dir1 Dir2 Dir3 Dir4 Eigenvalues R^2(OLS|dr) Large-sample Marginal Dimension Tests: Stat df p.value 0D vs >= 1D D vs >= 2D D vs >= 3D D vs >= 4D

> summary(s2<-update(s1,nslices=10,method="save")) Call: dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc, nslices = 10, method = "save") Method: save with 10 slices, n = 202. Slice Sizes: Estimated Basis Vectors for Central Subspace: Dir1 Dir2 Dir3 Dir4 log(SSF) log(Wt) log(Hg) log(Ht) log(WCC) log(RCC) log(Hc) log(Ferr) Dir1 Dir2 Dir3 Dir4 Eigenvalues R^2(OLS|dr) Large-sample Marginal Dimension Tests: Stat df(Nor) p.value(Nor) p.value(Gen) 0D vs >= 1D D vs >= 2D D vs >= 3D D vs >= 4D

S0 v. S2 49

S3 and S4 50

51

Remember - Assignment 2? General assignment – read EPI_data, specify a new data subset, create data frames in R and save them into a database In R Studio –Install package – “rmongodb” (activate it) – MongoDB - – couchdb-vs-redishttp://kkovacs.eu/cassandra-vs-mongodb-vs- couchdb-vs-redis 52

We’ll revisit these ml#databasehttp://projects.apache.org/indexes/category.ht ml#database –Hadoop (MapReduce) –Pig ( ) –HIVE ( ) gStartedhttps://cwiki.apache.org/confluence/display/Hive/Gettin gStarted alhttps://cwiki.apache.org/confluence/display/Hive/Tutori al ageManualhttps://cwiki.apache.org/confluence/display/Hive/Langu ageManual –Spark 53