1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 7a, March 8, 2016 Decision trees, cross-validation.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Random Forest Predrag Radenković 3237/10
CHAPTER 9: Decision Trees
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 10, 2015 Labs: more data, models, prediction, deciding with trees.
Model generalization Test error Bias, variance and complexity
Model Assessment and Selection
Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},
Chapter 7 – Classification and Regression Trees
Model assessment and cross-validation - overview
Chapter 7 – Classification and Regression Trees
Evaluation.
x – independent variable (input)
Sparse vs. Ensemble Approaches to Supervised Learning
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
Ensemble Learning: An Introduction
Three kinds of learning
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Today Evaluation Measures Accuracy Significance Testing
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Module 04: Algorithms Topic 07: Instance-Based Learning
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7b, March 13, 2015 Interpreting weighted kNN, decision trees, cross-validation, dimension reduction.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Chapter 9 – Classification and Regression Trees
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 8b, March 21, 2014 Using the models, prediction, deciding.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
CpSc 810: Machine Learning Evaluation of Classifier.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 3, 2014, SAGE 3101 Interpreting weighted kNN, forms of clustering, decision trees and Bayesian.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 11a, April 7, 2014 Support Vector Machines, Decision Trees, Cross- validation.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Validation methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Classification and Regression Trees
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)
Using the models, prediction, deciding
More Bayes, Decision trees, and cross-validation
Data Science Algorithms: The Basic Methods
Zaman Faisal Kyushu Institute of Technology Fukuoka, JAPAN
Instance Based Learning
Data Mining Practical Machine Learning Tools and Techniques
CSCI N317 Computation for Scientific Applications Unit Weka
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Nearest Neighbors CSC 576: Data Mining.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 7a, March 8, 2016 Decision trees, cross-validation

Contents 2

Numeric v. non-numeric 3

In R – data frame and types Almost always at input R sees categorical data as “strings” or “character” You can test for membership (as a type) using is. (x=number, factor, etc.) You can “coerce” it (i.e. change the type) using as. (same x) To tell R you have categorical types (also called enumerated types), R calls them “factor”…. Thus – as.factor() 4

In R factor(x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA), ordered(x,...), is.factor(x), is.ordered(x), as.factor(x), as.ordered(x), addNA(x, ifany = FALSE) levels- values that x might have taken. labels- either an optional vector of labels for the levels (in the same order as levels after removing those in exclude), or a character string of length 1. Exclude - Ordered - Nmax- upper bound on the number of levels; Ifany - 5

Relate to the datasets… Abalone - Sex = {F, I, M} Eye color? EPI - GeoRegions? Sample v. population – levels and names IN the dataset versus all possible levels/names 6

Weighted KNN require(kknn) data(iris) m <- dim(iris)[1] val <- sample(1:m, size = round(m/3), replace = FALSE, prob = rep(1/m, m)) iris.learn <- iris[-val,] # train iris.valid <- iris[val,]# test iris.kknn <- kknn(Species~., iris.learn, iris.valid, distance = 1, kernel = "triangular") # Possible choices are "rectangular" (which is standard unweighted knn), "triangular", "epanechnikov" (or beta(2,2)), "biweight" (or beta(3,3)), "triweight" (or beta(4,4)), "cos", "inv", "gaussian", "rank" and "optimal". 7

names(iris.kknn) fitted.valuesVector of predictions. CLMatrix of classes of the k nearest neighbors. WMatrix of weights of the k nearest neighbors. DMatrix of distances of the k nearest neighbors. CMatrix of indices of the k nearest neighbors. probMatrix of predicted class probabilities. responseType of response variable, one of continuous, nominal or ordinal. distanceParameter of Minkowski distance. callThe matched call. termsThe 'terms' object used. 8

Look at the output > head(iris.kknn$W) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] [2,] [3,] [4,] [5,] [6,]

Look at the output > head(iris.kknn$D) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] [2,] [3,] [4,] [5,] [6,]

Look at the output > head(iris.kknn$C) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] [2,] [3,] [4,] [5,] [6,] > head(iris.kknn$prob) setosa versicolor virginica [1,] [2,] [3,] [4,] [5,] [6,]

Look at the output > head(iris.kknn$fitted.values) [1] virginica setosa versicolor setosa virginica virginica Levels: setosa versicolor virginica 12

Contingency tables fitiris <- fitted(iris.kknn) table(iris.valid$Species, fitiris) fitiris setosa versicolor virginica setosa versicolor virginica # rectangular – no weight fitiris2 setosa versicolor virginica setosa versicolor virginica

(Weighted) kNN Advantages –Robust to noisy training data (especially if we use inverse square of weighted distance as the “distance”) –Effective if the training data is large Disadvantages –Need to determine value of parameter K (number of nearest neighbors) –Distance based learning is not clear which type of distance to use and which attribute to use to produce the best results. Shall we use all attributes or certain attributes only? 14

Additional factors Dimensionality – with too many dimensions the closest neighbors are too far away to be considered close Overfitting – does closeness mean right classification (e.g. noise or incorrect data, like wrong street address -> wrong lat/lon) – beware of k=1! Correlated features – double weighting Relative importance – including/ excluding features 15

More factors Sparseness – the standard distance measure (Jaccard) loses meaning due to no overlap Errors – unintentional and intentional Computational complexity Sensitivity to distance metrics – especially due to different scales (recall ages, versus impressions, versus clicks and especially binary values: gender, logged in/not) Does not account for changes over time Model updating as new data comes in 16

Glass library(e1071) library(rpart) data(Glass, package="mlbench") index <- 1:nrow(Glass) testindex <- sample(index, trunc(length(index)/3)) testset <- Glass[testindex,] trainset <- Glass[-testindex,] 17

Now what? # now what happens? > rpart.model <- rpart(Type ~., data = trainset) > rpart.pred <- predict(rpart.model, testset[,-10], type = "class”) 18

General idea behind trees Although the basic philosophy of all the classifiers based on decision trees is identical, there are many possibilities for its construction. Among all the key points in the selection of an algorithm to build decision trees some of them should be highlighted for their importance: –Criteria for the choice of feature to be used in each node –How to calculate the partition of the set of examples –When you decide that a node is a leaf –What is the criterion to select the class to assign to each leaf 19

Some important advantages can be pointed to the decision trees, including: –Can be applied to any type of data –The final structure of the classifier is quite simple and can be stored and handled in a graceful manner –Handles very efficiently conditional information, subdividing the space into sub-spaces that are handled individually –Reveal normally robust and insensitive to misclassification in the training set –The resulting trees are usually quite understandable and can be easily used to obtain a better understanding of the phenomenon in question. This is perhaps the most important of all the advantages listed 20

Stopping – leaves on the tree A number of stopping conditions can be used to stop the recursive process. The algorithm stops when any one of the conditions is true: –All the samples belong to the same class, i.e. have the same label since the sample is already "pure" –Stop if most of the points are already of the same class. This is a generalization of the first approach, with some error threshold –There are no remaining attributes on which the samples may be further partitioned –There are no samples for the branch test attribute 21

Recursive partitioning Recursive partitioning is a fundamental tool in data mining. It helps us explore the structure of a set of data, while developing easy to visualize decision rules for predicting a categorical (classification tree) or continuous (regression tree) outcome. The rpart programs build classification or regression models of a very general structure using a two stage procedure; the resulting models can be represented as binary trees. 22

Recursive partitioning The tree is built by the following process: –first the single variable is found which best splits the data into two groups ('best' will be defined later). The data is separated, and then this process is applied separately to each sub-group, and so on recursively until the subgroups either reach a minimum size or until no improvement can be made. –second stage of the procedure consists of using cross-validation to trim back the full tree. 23

Why are we careful doing this? Because we will USE these trees, i.e. apply them to make decisions about what things are and what to do with them! 24

> printcp(rpart.model) Classification tree: rpart(formula = Type ~., data = trainset) Variables actually used in tree construction: [1] Al Ba Mg RI Root node error: 92/143 = n= 143 CP nsplit rel error xerror xstd

plotcp(rpart.model) 26

> rsq.rpart(rpart.model) Classification tree: rpart(formula = Type ~., data = trainset) Variables actually used in tree construction: [1] Al Ba Mg RI Root node error: 92/143 = n= 143 CP nsplit rel error xerror xstd Warning message: In rsq.rpart(rpart.model) : may not be applicable for this method 27

rsq.rpart 28

> print(rpart.model) n= 143 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root ( ) 2) Ba< ( ) 4) Al< ( ) 8) RI>= ( ) 16) RI< ( ) * 17) RI>= ( ) 34) RI>= ( ) 68) Mg>= ( ) * 69) Mg< ( ) * 35) RI< ( ) * 9) RI< ( ) * 5) Al>= ( ) 10) Mg>= ( ) * 11) Mg< ( ) * 3) Ba>= ( ) * 29

Tree plot 30 plot(object, uniform=FALSE, branch=1, compress=FALSE, nspace, margin=0, minbranch=.3, args) > plot(rpart.model,compress=TRUE) > text(rpart.model, use.n=TRUE)

And if you are brave summary(rpart.model) … pages…. 31

Remember to LOOK at the data > names(Glass) [1] "RI" "Na" "Mg" "Al" "Si" "K" "Ca" "Ba" "Fe" "Type" > head(Glass) RI Na Mg Al Si K Ca Ba Fe Type

rpart.pred > rpart.pred Levels:

plot(rpart.pred) 34

Cross-validation Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. I.e. predictive and prescriptive analytics… 35

Cross-validation In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (testing dataset). Sound familiar? 36

Cross-validation The goal of cross validation is to define a dataset to "test" the model in the training phase (i.e., the validation dataset), in order to limit problems like overfitting And, give an insight on how the model will generalize to an independent data set (i.e., an unknown dataset, for instance from a real problem), etc. 37

Common type of x-validation K-fold 2-fold (do you know this one?) Rep-random-subsample Leave out-subsample Lab in a few weeks … to try these out 38

K-fold Original sample is randomly partitioned into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. Repeat cross-validation process k times (folds), with each of the k subsamples used exactly once as the validation data. –The k results from the folds can then be averaged (usually) to produce a single estimation. 39

Leave out subsample As the name suggests, leave-one-out cross- validation (LOOCV) involves using a single observation from the original sample as the validation data, and the remaining observations as the training data. i.e. K=n-fold cross-validation Leave out > 1 = bootstraping and jackknifing 40

boot(strapping) Generate replicates of a statistic applied to data (parametric and nonparametric). –nonparametric bootstrap, possible methods: ordinary bootstrap, the balanced bootstrap, antithetic resampling, and permutation. For nonparametric multi-sample problems stratified resampling is used: –this is specified by including a vector of strata in the call to boot. –importance resampling weights may be specified. 41

Jackknifing Systematically recompute the statistic estimate, leaving out one or more observations at a time from the sample set From this new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated. Often use log(variance) [instead of variance] especially for non-normal distributions 42

Repeat-random-subsample Random split of the dataset into training and validation data. –For each such split, the model is fit to the training data, and predictive accuracy is assessed using the validation data. Results are then averaged over the splits. Note: for this method can the results will vary if the analysis is repeated with different random splits. 43

Advantage? The advantage of K-fold over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. –10-fold cross-validation is commonly used The advantage of rep-random over k-fold cross validation is that the proportion of the training/validation split is not dependent on the number of iterations (folds). 44

Disadvantage The disadvantage of rep-random is that some observations may never be selected in the validation subsample, whereas others may be selected more than once. –i.e., validation subsets may overlap. 45

New dataset to work with trees fitK <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis) printcp(fitK) # display the results plotcp(fitK) # visualize cross-validation results summary(fitK) # detailed summary of splits # plot tree plot(fitK, uniform=TRUE, main="Classification Tree for Kyphosis") text(fitK, use.n=TRUE, all=TRUE, cex=.8) # create attractive postscript plot of tree post(fitK, file = “kyphosistree.ps", title = "Classification Tree for Kyphosis") # might need to convert to PDF (distill) 46

47

48 > pfitK<- prune(fitK, cp= fitK$cptable[which.min(fitK$cptable[,"xerror"]),"CP"]) > plot(pfitK, uniform=TRUE, main="Pruned Classification Tree for Kyphosis") > text(pfitK, use.n=TRUE, all=TRUE, cex=.8) > post(pfitK, file = “ptree.ps", title = "Pruned Classification Tree for Kyphosis”)

49 > fitK <- ctree(Kyphosis ~ Age + Number + Start, data=kyphosis) > plot(fitK, main="Conditional Inference Tree for Kyphosis”)

50 > plot(fitK, main="Conditional Inference Tree for Kyphosis",type="simple")

randomForest > require(randomForest) > fitKF <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis) > print(fitKF) # view results Call: randomForest(formula = Kyphosis ~ Age + Number + Start, data = kyphosis) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 20.99% Confusion matrix: absent present class.error absent present > importance(fitKF) # importance of each predictor MeanDecreaseGini Age Number Start Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new "forest", and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification).

Trees for the Titanic data(Titanic) rpart, ctree, hclust, etc. for Survived ~. 52

More on another dataset. # Regression Tree Example library(rpart) # build the tree fitM <- rpart(Mileage~Price + Country + Reliability + Type, method="anova", data=cu.summary) printcp(fitM) # display the results …. Root node error: /60 = n=60 (57 observations deleted due to missingness) CP nsplit rel error xerror xstd

Mileage… plotcp(fitM) # visualize cross-validation results summary(fitM) # detailed summary of splits 54

55 par(mfrow=c(1,2)) rsq.rpart(fitM) # visualize cross-validation results

# plot tree plot(fitM, uniform=TRUE, main="Regression Tree for Mileage ") text(fitM, use.n=TRUE, all=TRUE, cex=.8) # prune the tree pfitM<- prune(fitM, cp= ) # from cptable # plot the pruned tree plot(pfitM, uniform=TRUE, main="Pruned Regression Tree for Mileage") text(pfitM, use.n=TRUE, all=TRUE, cex=.8) post(pfitM, file = ”ptree2.ps", title = "Pruned Regression Tree for Mileage”) 56

57

# Conditional Inference Tree for Mileage fit2M <- ctree(Mileage~Price + Country + Reliability + Type, data=na.omit(cu.summary)) 58

Enough of trees! 59

Assignment 6 on Website – later today Your term projects should fall within the scope of a data analytics problem of the type you have worked with in class/ labs, or know of yourself – the bigger the data the better. This means that the work must go beyond just making lots of figures. You should develop the project to indicate you are thinking of and exploring the relationships and distributions within your data. Start with a hypothesis, think of a way to model and use the hypothesis, find or collect the necessary data, and do both preliminary analysis, detailed modeling and summary (interpretation). Grad students must develop two types of models. –Note: You do not have to come up with a positive result, i.e. disproving the hypothesis is just as good. Introduction (2%) % may change… Data Description (3%) Analysis (5%) Model Development (12%) Conclusions and Discussion (3%) Oral presentation (5%) (~5 mins) 60

Coming weeks Spring break – March 14 – 18 Your project proposals (Assignment 5) are on March 22/25. Come prepared (5% grade). Support Vector Machines = the “red pill”? 61