1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 11a, April 12, 2016 Interpreting: MDS, DR, SVM Factor Analysis; and Boosting.

Slides:

Advertisements

Similar presentations

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 10, 2015 Labs: Cross Validation, RandomForest, Multi- Dimensional Scaling, Dimension Reduction,

Advertisements

CMPUT 466/551 Principal Source: CMU

Sparse vs. Ensemble Approaches to Supervised Learning

Ensemble Learning: An Introduction

Classification 10/03/07.

Data mining and statistical learning - lecture 13 Separating hyperplane.

Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?

Machine Learning: Ensemble Methods

Sparse vs. Ensemble Approaches to Supervised Learning

SVM Lab material borrowed from tutorial by David Meyer FH Technikum Wien, Austria see:

Classification and Prediction: Regression Analysis

Ensemble Learning (2), Tree and Forest

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 11a, April 14, 2015 Interpreting cross-validation, bootstrapping, bagging, boosting, etc.

Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

CS 391L: Machine Learning: Ensembles

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 13a, April 22, 2014 Boosting, dimension reduction and a preview of the return to Big Data.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10a, April 1, 2014 Support Vector Machines.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 4, 2014 Lab: More on Support Vector Machines, Trees, and your projects.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 11a, April 7, 2014 Support Vector Machines, Decision Trees, Cross- validation.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Classification Ensemble Methods 1

SVM Lab material borrowed from tutorial by David Meyer FH Technikum Wien, Austria see:

Data Analytics CMIS Short Course part II Day 1 Part 3: Ensembles Sam Buttrey December 2015.

1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 8 Combining Methods and Ensemble Learning.

… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

1 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Machine learning, pattern recognition and statistical data modelling Lecture 10.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning: Ensemble Methods

Chapter 7. Classification and Prediction

Bagging and Random Forests

Week 2 Presentation: Project 3

Weak Models: Bagging, Boosting, Bootstrap Aggregation

Boosting and Additive Trees (2)

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Interpreting: MDS, DR, SVM Factor Analysis

COMP61011 : Machine Learning Ensemble Models

Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.

Basic machine learning background with Python scikit-learn

ECE 5424: Introduction to Machine Learning

Overview of Supervised Learning

Asymmetric Gradient Boosting with Application to Spam Filtering

Labs: Dimension Reduction, Multi-dimensional Scaling, SVM

Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 9b, April 1, 2016

Combining Base Learners

Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600

Data Mining Practical Machine Learning Tools and Techniques

Labs: Dimension Reduction, Multi-dimensional Scaling, SVM

Interpreting: MDS, DR, SVM Factor Analysis

Machine Learning Math Essentials Part 2

Multiple Decision Trees ISQS7342

Data Analytics – ITWS-4600/ITWS-6600/MATP-4450

Interpreting: MDS, DR, SVM Factor Analysis

Labs: Trees, Dimension Reduction, Multi-dimensional Scaling, SVM

Ensemble learning.

Model Combination.

Ensemble learning Reminder - Bagging of Trees Random Forest

Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 10b, April 8, 2016

Model generalization Brief summary of methods

ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 11a, April 12, 2016 Interpreting: MDS, DR, SVM Factor Analysis; and Boosting

This? library(EDR) # effective dimension reduction library(dr) library(clustrd) ##### install.packages("edrGraphicalTools") ##### ? library(edrGraphicalTools) demo(edr_ex1) demo(edr_ex2) demo(edr_ex3) demo(edr_ex4) 2

Some examples Lab8b_dr1_2016.R Lab8b_dr2_2016.R Lab8b_dr3_2016.R Lab8b_dr4_2016.R 3

Spellman 4

MDS Lab8b_mds1_2016.R Lab8b_mds2_2016.R Lab8b_mds3_2016.R lhttp:// l to/2013/01/23/MDS-in-R.htmlhttp://gastonsanchez.com/blog/how- to/2013/01/23/MDS-in-R.html 5

Eurodist 6

You worked on these… Lab9b_svm1_2015.R –> Lab9b_svm11_2015.R Lab9b_svm_rpart1_2016.R Karatzoglou et al Who worked on this starting from page 9 (bottom)? 7

Ozone > library(e1071) > library(rpart) > data(Ozone, package=“mlbench”) > # html # for field codes html > ## split data into a train and test set > index <- 1:nrow(Ozone) > testindex <- sample(index, trunc(length(index)/3)) > testset <- na.omit(Ozone[testindex,-3]) > trainset <- na.omit(Ozone[-testindex,-3]) > svm.model <- svm(V4 ~., data = trainset, type=“C-classification”,cost = 1000, gamma = ) > svm.pred <- predict(svm.model, testset[,-3]) > crossprod(svm.pred - testset[,3]) / length(testindex) See: 8

Glass library(e1071) library(rpart) data(Glass, package="mlbench") index <- 1:nrow(Glass) testindex <- sample(index, trunc(length(index)/3)) testset <- Glass[testindex,] trainset <- Glass[-testindex,] svm.model <- svm(Type ~., data = trainset, cost = 100, gamma = 1) svm.pred <- predict(svm.model, testset[,-10]) 9

> table(pred = svm.pred, true = testset[,10]) true pred

Example Lab9b_svm1_2016.R n <- 150 # number of data points p <- 2 # dimension sigma <- 1 # variance of the distribution meanpos <- 0 # centre of the distribution of positive examples meanneg <- 3 # centre of the distribution of negative examples npos <- round(n/2) # number of positive examples nneg <- n-npos # number of negative examples # Generate the positive and negative examples xpos <- matrix(rnorm(npos*p,mean=meanpos,sd=sigma),npos,p) xneg <- matrix(rnorm(nneg*p,mean=meanneg,sd=sigma),npos,p) x <- rbind(xpos,xneg) # Generate the labels y <- matrix(c(rep(1,npos),rep(-1,nneg))) # Visualize the data plot(x,col=ifelse(y>0,1,2)) legend("topleft",c('Positive','Negative'),col=seq(2),pch=1,text.col=seq(2)) 11

Example 1 12

Train/ test ntrain <- round(n*0.8) # number of training examples tindex <- sample(n,ntrain) # indices of training samples xtrain <- x[tindex,] xtest <- x[-tindex,] ytrain <- y[tindex] ytest <- y[-tindex] istrain=rep(0,n) istrain[tindex]=1 # Visualize plot(x,col=ifelse(y>0,1,2),pch=ifelse(istrain==1,1,2)) legend("topleft",c('Positive Train','Positive Test','Negative Train','Negative Test'),col=c(1,1,2,2), pch=c(1,2,1,2), text.col=c(1,1,2,2)) 13

Comparison of test classifier 14

Example ctd svp <- ksvm(xtrain,ytrain,type="C-svc", kernel='vanilladot', C=100,scaled=c()) # General summary svp # Attributes that you can access attributes(svp) # did you look? # For example, the support vectors alpha(svp) alphaindex(svp) b(svp)# remember b? # Use the built-in function to pretty-plot the classifier plot(svp,data=xtrain) 15 > # For example, the support vectors > alpha(svp) [[1]] [1] > alphaindex(svp) [[1]] [1] > b(svp) [1]

16

SVM for iris 17

SVM for Swiss 18

e.g. Probabilities… library(kernlab) data(promotergene) ## create test and training set ind <- sample(1:dim(promotergene)[1],20) genetrain <- promotergene[-ind, ] genetest <- promotergene[ind, ] ## train a support vector machine gene <- ksvm(Class~.,data=genetrain,kernel="rbfdot",\ kpar=list(sigma=0.015),C=70,cross=4,prob.model=TRUE) ## predict gene type probabilities on the test set genetype <- predict(gene,genetest,type="probabilities") 19

Result > genetype + - [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,]

kernlab notes.pdfhttp://aquarius.tw.rpi.edu/html/DA/svmbasic_ notes.pdf Some scripts: Lab9b_svm12_2016.R, Lab9b_svm13_2016.R 21

These example_exploratoryFactorAnalysis.R on dataset_exploratoryFactorAnalysis.csv (on website) – tutorial-series-exploratory-factor.html (this was the example skipped over in lecture 10a) tutorial-series-exploratory-factor.html mlhttp:// ml 6/what-are-the-differences-between-factor- analysis-and-principal-component-analysihttp://stats.stackexchange.com/questions/157 6/what-are-the-differences-between-factor- analysis-and-principal-component-analysi Do these - Lab10b_fa{1,2,4,5}_2016.R 22

Factor Analysis data(iqitems) # data(ability) ability.irt <- irt.fa(ability) ability.scores <- score.irt(ability.irt,ability) data(attitude) cor(attitude) # Compute eigenvalues and eigenvectors of the correlation matrix. pfa.eigen<-eigen(cor(attitude)) pfa.eigen$values # set a value for the number of factors (for clarity) factors<-2 # Extract and transform two components. pfa.eigen$vectors [, 1:factors ] %*% + diag ( sqrt (pfa.eigen$values [ 1:factors ] ),factors,factors ) 23

Glass index <- 1:nrow(Glass) testindex <- sample(index, trunc(length(index)/3)) testset <- Glass[testindex,] trainset <- Glass[-testindex,] Cor(testset) Factor Analysis? 24

Bootstrap aggregation (bagging) Improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. Also reduces variance and helps to avoid overfitting. Usually applied to decision tree methods, but can be used with any type of method. –Bagging is a special case of the model averaging approach. Harder to interpret – why? 25

Ozone of 100 bootstrap samples average

Shows improvements for unstable procedures (Breiman, 1996): e.g. neural nets, classification and regression trees, and subset selection in linear regression … can mildly degrade the performance of stable methods such as K-nearest neighbors 27

Bagging (bootstrapping aggregation)* library(mlbench) data(BreastCancer) l <- length(BreastCancer[,1]) sub <- sample(1:l,2*l/3) BC.bagging <- bagging(Class ~., data=BreastCancer[,-1], mfinal=20, control=rpart.control(maxdepth=3)) BC.bagging.pred <-predict.bagging( BC.bagging, newdata=BreastCancer[-sub,-1]) BC.bagging.pred$confusion Observed Class Predicted Class benign malignant benign malignant BC.bagging.pred$error [1]

A little later > data(BreastCancer) > l <- length(BreastCancer[,1]) > sub <- sample(1:l,2*l/3) > BC.bagging <- bagging(Class ~.,data=BreastCancer[,-1],mfinal=20, + control=rpart.control(maxdepth=3)) > BC.bagging.pred <- predict.bagging(BC.bagging,newdata=BreastCancer[- sub,-1]) > BC.bagging.pred$confusion Observed Class Predicted Class benign malignant benign malignant 7 78 > BC.bagging.pred$error [1]

Bagging (Vehicle) > data(Vehicle) > l <- length(Vehicle[,1]) > sub <- sample(1:l,2*l/3) > Vehicle.bagging <- bagging(Class ~.,data=Vehicle[sub, ],mfinal=40, + control=rpart.control(maxdepth=5)) > Vehicle.bagging.pred <- predict.bagging(Vehicle.bagging, newdata=Vehicle[-sub, ]) > Vehicle.bagging.pred$confusion Observed Class Predicted Class bus opel saab van bus opel saab van > Vehicle.bagging.pred$error [1]

Weak models … A weak learner: a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing) A strong learner: a classifier that is arbitrarily well-correlated with the true classification. Can a set of weak learners create a single strong learner? 31

Boosting … reducing bias in supervised learning most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. –typically weighted in some way that is usually related to the weak learners' accuracy. After a weak learner is added, the data is reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight Thus, future weak learners focus more on the examples that previous weak learners misclassified. 32

33

Using diamonds… boost (glm) > mglmboost<-glmboost(as.factor(Expensive) ~., data=diamonds, family=Binomial(link="logit")) > summary(mglmboost) Generalized Linear Models Fitted via Gradient Boosting Call: glmboost.formula(formula = as.factor(Expensive) ~., data = diamonds, family = Binomial(link = "logit")) Negative Binomial Likelihood Loss function: { f <- pmin(abs(f), 36) * sign(f) p <- exp(f)/(exp(f) + exp(-f)) y <- (y + 1)/2 -y * log(p) - (1 - y) * log(1 - p) } 34

Using diamonds… boost (glm) > summary(mglmboost) #continued Number of boosting iterations: mstop = 100 Step size: 0.1 Offset: Coefficients: NOTE: Coefficients from a Binomial model are half the size of coefficients from a model fitted via glm(..., family = 'binomial'). See Warning section in ?coef.mboost (Intercept) carat clarity.L attr(,"offset") [1] Selection frequencies: carat (Intercept) clarity.L

Cluster boosting Assessment of the clusterwise stability of a clustering of data, which can be cases x variables or dissimilarity data. The data is resampled using several schemes (bootstrap, subsetting, jittering, replacement of points by noise) and the Jaccard similarities of the original clusters to the most similar clusters in the resampled data are computed. The mean over these similarities is used as an index of the stability of a cluster (other statistics can be computed as well). 36

Cluster boosting Quite general clustering methods are possible, i.e. methods estimating or fixing the number of clusters, methods producing overlapping clusters or not assigning all cases to clusters (but declaring them as "noise"). In R – clustermethod = X is used to select the method, e.g. Kmeans Lab on Friday… (iris, etc..) 37

Example - bodyfat The response variable is the body fat measured by DXA (DEXfat), which can be seen as the gold standard to measure body fat. However, DXA measurements are too expensive and complicated for a broad use. Anthropometric measurements as waist or hip circumferences are in comparison very easy to measure in a standard screening. A prediction formula only based on these measures could therefore be a valuable alternative with high clinical relevance for daily usage. 38

39

bodyfat ## regular linear model using three variables lm1 <- lm(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat) ## Estimate same model by glmboost glm1 <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat) # We consider all available variables as potential predictors. glm2 <- glmboost(DEXfat ~., data = bodyfat) # or one could essentially call: preds <- names(bodyfat[, names(bodyfat) != "DEXfat"]) ## names of predictors fm <- as.formula(paste("DEXfat ~", paste(preds, collapse = "+"))) ## build formula 40

Compare linear models > coef(lm1) (Intercept) hipcirc kneebreadth anthro3a > coef(glm1, off2int=TRUE) ## off2int adds the offset to the intercept (Intercept) hipcirc kneebreadth anthro3a Conclusion? 41

> fm DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth + anthro3a + anthro3b + anthro3c + anthro4 > coef(glm2, which = "") ## select all. (Intercept) age waistcirc hipcirc elbowbreadth kneebreadth anthro3a anthro3b anthro3c anthro attr(,"offset") [1]

plot(glm2, off2int = TRUE) 43

plot(glm2, ylim = range(coef(glm2, which = preds))) 44

45

Other forms of boosting Gamboost = Generalized Additive Model - Gradient boosting for optimizing arbitrary loss functions, where component-wise smoothing procedures are utilized as (univariate) base- learners. 46

> gam1 <- gamboost(DEXfat ~ bbs(hipcirc) + bbs(kneebreadth) + bbs(anthro3a),data = bodyfat) > #Using plot() on a gamboost object delivers automatically the partial e ﬀ ects of the di ﬀ erent base-learners: > par(mfrow = c(1,3)) ## 3 plots in one device > plot(gam1) ## get the partial effects # bbs, bols, btree.. 47

48

Compare to rpart > fattree<-rpart(DEXfat ~., data=bodyfat) > plot(fattree) > text(fattree) > labels(fattree) [1] "root" "waistcirc =3.42" "hipcirc =101.3" [7] "waistcirc>=88.4" "hipcirc =109.9" 49

50

cars 51

iris 52

cars 53

54

55

Sparse matrix example > coef(mod, which = which(beta > 0)) V306 V1052 V1090 V3501 V4808 V5473 V7929 V8333 V8799 V attr(,"offset") [1]

57

Aside: Boosting and SVM… Remember “margins” from the SVM? Partitioning the “linear” or transformed space? In boosting we are effectively (not explicitly) attempting to maximize the minimum margin of any training example 58

Variants on boosting – loss fn cars.gb <- blackboost(dist ~ speed, data = cars, control = boost_control(mstop = 50)) ### plot fit plot(dist ~ speed, data = cars) lines(cars$speed, predict(cars.gb), col = "red") 59

Blackboosting (cf. brown) Gradient boosting for optimizing arbitrary loss functions where regression trees are utilized as base-learners. > cars.gb Model-based Boosting Call: blackboost(formula = dist ~ speed, data = cars, control = boost_control(mstop = 50)) Squared Error (Regression) Loss function: (y - f)^2 Number of boosting iterations: mstop = 50 Step size: 0.1 Offset: Number of baselearners: 1 60