Decision Tree Dr. Jieh-Shan George YEH

Slides:



Advertisements
Similar presentations
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Advertisements

Math 5364 Notes Chapter 4: Classification
CART: Classification and Regression Trees Chris Franck LISA Short Course March 26, 2013.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Introduction to Data Mining with XLMiner
x – independent variable (input)
Decision Tree Rong Jin. Determine Milage Per Gallon.
Three kinds of learning
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Testing an individual module
Supervised Learning & Classification, part I Reading: W&F ch 1.1, 1.2, , 3.2, 3.3, 4.3, 6.1*
Arko Barman Slightly edited by Ch. Eick COSC 6335 Data Mining
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
Decision Tree Models in Data Mining
Tree-Based Methods (V&R 9.1) Demeke Kasaw, Andreas Nguyen, Mariana Alvaro STAT 6601 Project.
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
Slides for “Data Mining” by I. H. Witten and E. Frank.
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
Chapter 9 – Classification and Regression Trees
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.
Inferring Decision Trees Using the Minimum Description Length Principle J. R. Quinlan and R. L. Rivest Information and Computation 80, , 1989.
Figure 1.1 Rules for the contact lens data.. Figure 1.2 Decision tree for the contact lens data.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Data Mining Practical Machine Learning Tools and Techniques Chapter 3: Output: Knowledge Representation Rodney Nielsen Many of these slides were adapted.
Scaling up Decision Trees. Decision tree learning.
Business Intelligence and Decision Modeling Week 9 Customer Profiling Decision Trees (Part 2) CHAID CRT.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
MKT 700 Business Intelligence and Decision Models Algorithms and Customer Profiling (1)
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Regression Analysis Part C Confidence Intervals and Hypothesis Testing
September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Regression Dr. Jieh-Shan George YEH
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Classification and Regression Trees
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Decision Tree Lab. Load in iris data: Display iris data as a sanity.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
CLASSIFICATION: LOGISTIC REGRESSION Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
16BIT IITR Data Collection Module If you have not already done so, download and install R from download.
By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)
CPH Dr. Charnigo Chap. 9 Notes To begin with, have a look at Figure 9.5 on page 315. One can get an intuitive feel for how a tree works by examining.
Решение задач Data Mining. R и Hadoop. Классификация Decision Tree  Исходные данные >names(iris) [1] "Sepal.Length" "Sepal.Width" "Petal.Length“ [4]
Common Linear & Classification for Machine Learning using Microsoft R
BINARY LOGISTIC REGRESSION
Chapter 18 From Data to Knowledge
Introduction to Machine Learning and Tree Based Methods
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Erich Smith Coleman Platt
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Ch9: Decision Trees 9.1 Introduction A decision tree:
CS 235 Decision Tree Classification
Discriminant Analysis
Advanced Analytics Using Enterprise Miner
(classification & regression trees)
Weka Free and Open Source ML Suite Ian Witten & Eibe Frank
Data Mining for Business Analytics
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
CSCI N317 Computation for Scientific Applications Unit Weka
R & Trees There are two tree libraries: tree: original
Statistical Learning Dong Liu Dept. EEIS, USTC.
Decision trees MARIO REGIN.
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Decision Tree Dr. Jieh-Shan George YEH

Decision Tree Recursive partitioning is a fundamental tool in data mining. It helps us explore the structure of a set of data, while developing easy to visualize decision rules for predicting a categorical (classification tree) or continuous (regression tree) outcome. Decision tree is an algorithm the can have a continuous or categorical dependent (DV) and independent variables (IV).

Decision Tree

Advantages to using trees Simple to understand and interpret. People are able to understand decision tree models after a brief explanation. Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. Able to handle both numerical and categorical data.

Advantages to using trees Uses a white box model. If a given situation is observable in a model the explanation for the condition is easily explained by Boolean logic Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model. Performs well with large data in a short time.

Some things to consider when coding the model… Splits. Gini or information. Type of DV (method). Classification (class), regression (anova), count (poison), survival (exp). Minimum of observations for a split (minsplit). Minimum if observations in a node (minbucket). Cross validation (xval). Used more in model building rather than in exploration. Complexity parameter (Cp). This value is used for pruning. A smaller tree is perhaps less detailed, but with less error.

R has many packages for similar/same endeavors party. rpart. Comes with R. C50. Cubists. rpart.plot. Makes rpart plots much nicer.

Dataset iris The iris dataset has been used for classification in many research publications. It consists of 50 samples from each of three classes of iris flowers [Frank and Asuncion, 2010]. One class is linearly separable from the other two, while the latter are not linearly separable from each other. There are five attributes in the dataset: – Sepal.Length in cm, – Sepal.Width in cm, – Petal.Length in cm, – Petal.Width in cm, and – Species: Iris Setosa, Iris Versicolour, and Iris Virginica. Sepal.Length, Sepal.Width, Petal.Length and Petal.Width are used to predict the Species of flowers. str(iris)

head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species setosa setosa setosa setosa setosa setosa

CTREE: CONDITIONAL INFERENCE TREE

Conditional Inference Trees formulaa symbolic description of the model to be fit. Note that symbols like : and - will not work and the tree will make use of all variables listed on the rhs of formula. dataa data frame containing the variables in the model. subsetan optional vector specifying a subset of observations to be used in the fitting process. weightsan optional vector of weights to be used in the fitting process. Only non- negative integer valued weights are allowed. controlsan object of class TreeControl, which can be obtained using ctree_control.TreeControlctree_control Description Recursive partitioning for continuous, censored, ordered, nominal and multivariate response variables in a conditional inference framework. Usage ctree(formula, data, subset = NULL, weights = NULL, controls = ctree_control(), xtrafo = ptrafo, ytrafo = ptrafo, scores = NULL) Arguments

Before modeling, the iris data is split below into two subsets: training (70%) and test (30%) The random seed is set to a fixed value below to make the results reproducible set.seed(1234) ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3)) trainData <- iris[ind==1,] testData <- iris[ind==2,]

library(party) # Species is the target variable and all other variables are independent variables. myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width iris_ctree <- ctree(myFormula, data=trainData)

Prediction Table # check the prediction table(predict(iris_ctree), trainData$Species)

print(iris_ctree) Conditional inference tree with 4 terminal nodes Response: Species Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 112 1) Petal.Length <= 1.9; criterion = 1, statistic = )* weights = 40 1) Petal.Length > 1.9 3) Petal.Width <= 1.7; criterion = 1, statistic = ) Petal.Length <= 4.4; criterion = 0.974, statistic = )* weights = 21 4) Petal.Length > 4.4 6)* weights = 19 3) Petal.Width > 1.7 7)* weights = 32

plot(iris_ctree)

plot(iris_ctree, type="simple")

# predict on test data testPred <- predict(iris_ctree, newdata = testData) table(testPred, testData$Species)

Issues on ctree() The current version of ctree() does not handle missing values well, in that an instance with a missing value may sometimes go to the left sub-tree and sometimes to the right. This might be caused by surrogate rules. When a variable exists in training data and is fed into ctree() but does not appear in the built decision tree, the test data must also have that variable to make prediction. Otherwise, a call to predict() would fail.

Issues on ctree() If the value levels of a categorical variable in test data are different from that in training data, it would also fail to make prediction on the test data. One way to get around the above issue is, after building a decision tree, to call ctree() to build a new decision tree with data containing only those variables existing in the first tree, and to explicitly set the levels of categorical variables in test data to the levels of the corresponding variables in training data.

More info #Edgar Anderson's Iris Data help("iris") #Conditional Inference Trees help("ctree") #Class "BinaryTree" help("BinaryTree-class") #Visualization of Binary Regression Trees help("plot.BinaryTree")

RPART: RECURSIVE PARTITIONING AND REGRESSION TREES

Recursive partitioning for classification, regression and survival trees data("bodyfat", package="TH.data") dim(bodyfat) set.seed(1234) ind <- sample(2, nrow(bodyfat), replace=TRUE, prob=c(0.7, 0.3)) bodyfat.train <- bodyfat[ind==1,] bodyfat.test <- bodyfat[ind==2,] # train a decision tree library(rpart) myFormula <- DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth bodyfat_rpart <- rpart(myFormula, data = bodyfat.train, control = rpart.control(minsplit = 10)) attributes(bodyfat_rpart)

print(bodyfat_rpart$cptable)

print(bodyfat_rpart)

plot(bodyfat_rpart) text(bodyfat_rpart, use.n=T)

select the tree with the minimum prediction error opt <- which.min(bodyfat_rpart$cptable[,"xerror"]) cp <- bodyfat_rpart$cptable[opt, "CP"] bodyfat_prune <- prune(bodyfat_rpart, cp = cp) print(bodyfat_prune) plot(bodyfat_prune) text(bodyfat_prune, use.n=T)

After that, the selected tree is used to make prediction and the predicted values are compared with actual labels. Function abline() draws a diagonal line. The predictions of a good model are expected to be equal or very close to their actual values, that is, most points should be on or close to the diagonal line.

DEXfat_pred <- predict(bodyfat_prune, newdata=bodyfat.test) xlim <- range(bodyfat$DEXfat) plot(DEXfat_pred ~ DEXfat, data=bodyfat.test, xlab="Observed", ylab="Predicted", ylim=xlim, xlim=xlim) abline(a=0, b=1)

More info #Recursive Partitioning and Regression Trees help("rpart") #Control for Rpart Fits help("rpart.control") #Prediction of Body Fat by Skinfold Thickness, Circumferences, and Bone Breadths ??TH.data::bodyfat

C5.0

C50 library(C50) myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width iris_C5.0 <- C5.0(myFormula, data=trainData) summary(iris_C5.0) C5imp(iris_C5.0) C5.0testPred <- predict(iris_C5.0, testData) table(C5.0testPred, testData$Species) predict(iris_C5.0, testData, type = "prob")

More info #C5.0 Decision Trees and Rule-Based Models help("C5.0") #Control for C5.0 Models help("C5.0Control") #Summaries of C5.0 Models help("summary.C5.0") #Variable Importance Measures for C5.0 Models help("C5imp")

PLOT RPART MODELS. AN ENHANCED VERSION OF PLOT.RPART

rpart.plot library(rpart.plot) data(ptitanic) #Titanic data tree <- rpart(survived ~., data=ptitanic, cp=.02) # cp=.02 because want small tree for demo rpart.plot(tree, main="default rpart.plot\n(type = 0, extra = 0)") prp(tree, main="type = 4, extra = 6", type=4, extra=6, faclen=0) # faclen=0 to print full factor names

rpart.plot rpart.plot(tree, main="extra = 106, under = TRUE", extra=106, under=TRUE, faclen=0) # the old way for comparison plot(tree, uniform=TRUE, compress=TRUE, branch=.2) text(tree, use.n=TRUE, cex=.6, xpd=NA) # cex is a guess, depends on your window size title("rpart.plot for comparison", cex=.6) rpart.plot(tree, box.col=3, xflip=FALSE)

More info #Titanic data with passenger names and other details removed. help("ptitanic") # Plot an rpart model. help("rpart.plot") # Plot an rpart model. A superset of rpart.plot. help("prp")