Decision Tree Dr. Jieh-Shan George YEH

Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

Decision Tree Recursive partitioning is a fundamental tool in data mining. It helps us explore the structure of a set of data, while developing easy to visualize decision rules for predicting a categorical (classification tree) or continuous (regression tree) outcome. Decision tree is an algorithm the can have a continuous or categorical dependent (DV) and independent variables (IV).

Decision Tree

Advantages to using trees Simple to understand and interpret. People are able to understand decision tree models after a brief explanation. Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. Able to handle both numerical and categorical data.

Advantages to using trees Uses a white box model. If a given situation is observable in a model the explanation for the condition is easily explained by Boolean logic Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model. Performs well with large data in a short time.

Some things to consider when coding the model… Splits. Gini or information. Type of DV (method). Classification (class), regression (anova), count (poison), survival (exp). Minimum of observations for a split (minsplit). Minimum if observations in a node (minbucket). Cross validation (xval). Used more in model building rather than in exploration. Complexity parameter (Cp). This value is used for pruning. A smaller tree is perhaps less detailed, but with less error.

R has many packages for similar/same endeavors party. rpart. Comes with R. C50. Cubists. rpart.plot. Makes rpart plots much nicer.

Dataset iris The iris dataset has been used for classification in many research publications. It consists of 50 samples from each of three classes of iris flowers [Frank and Asuncion, 2010]. One class is linearly separable from the other two, while the latter are not linearly separable from each other. There are five attributes in the dataset: – Sepal.Length in cm, – Sepal.Width in cm, – Petal.Length in cm, – Petal.Width in cm, and – Species: Iris Setosa, Iris Versicolour, and Iris Virginica. Sepal.Length, Sepal.Width, Petal.Length and Petal.Width are used to predict the Species of flowers. str(iris)

head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa

CTREE: CONDITIONAL INFERENCE TREE http://cran.r-project.org/web/packages/party/party.pdf

Conditional Inference Trees formulaa symbolic description of the model to be fit. Note that symbols like : and - will not work and the tree will make use of all variables listed on the rhs of formula. dataa data frame containing the variables in the model. subsetan optional vector specifying a subset of observations to be used in the fitting process. weightsan optional vector of weights to be used in the fitting process. Only non- negative integer valued weights are allowed. controlsan object of class TreeControl, which can be obtained using ctree_control.TreeControlctree_control Description Recursive partitioning for continuous, censored, ordered, nominal and multivariate response variables in a conditional inference framework. Usage ctree(formula, data, subset = NULL, weights = NULL, controls = ctree_control(), xtrafo = ptrafo, ytrafo = ptrafo, scores = NULL) Arguments

Before modeling, the iris data is split below into two subsets: training (70%) and test (30%) The random seed is set to a fixed value below to make the results reproducible set.seed(1234) ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3)) trainData <- iris[ind==1,] testData <- iris[ind==2,]

library(party) # Species is the target variable and all other variables are independent variables. myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width iris_ctree <- ctree(myFormula, data=trainData)

Prediction Table # check the prediction table(predict(iris_ctree), trainData$Species)

print(iris_ctree) Conditional inference tree with 4 terminal nodes Response: Species Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 112 1) Petal.Length <= 1.9; criterion = 1, statistic = 104.643 2)* weights = 40 1) Petal.Length > 1.9 3) Petal.Width <= 1.7; criterion = 1, statistic = 48.939 4) Petal.Length <= 4.4; criterion = 0.974, statistic = 7.397 5)* weights = 21 4) Petal.Length > 4.4 6)* weights = 19 3) Petal.Width > 1.7 7)* weights = 32

plot(iris_ctree)

plot(iris_ctree, type="simple")

# predict on test data testPred <- predict(iris_ctree, newdata = testData) table(testPred, testData$Species)

Issues on ctree() The current version of ctree() does not handle missing values well, in that an instance with a missing value may sometimes go to the left sub-tree and sometimes to the right. This might be caused by surrogate rules. When a variable exists in training data and is fed into ctree() but does not appear in the built decision tree, the test data must also have that variable to make prediction. Otherwise, a call to predict() would fail.

Issues on ctree() If the value levels of a categorical variable in test data are different from that in training data, it would also fail to make prediction on the test data. One way to get around the above issue is, after building a decision tree, to call ctree() to build a new decision tree with data containing only those variables existing in the first tree, and to explicitly set the levels of categorical variables in test data to the levels of the corresponding variables in training data.

More info #Edgar Anderson's Iris Data help("iris") #Conditional Inference Trees help("ctree") #Class "BinaryTree" help("BinaryTree-class") #Visualization of Binary Regression Trees help("plot.BinaryTree")

RPART: RECURSIVE PARTITIONING AND REGRESSION TREES http://cran.r-project.org/web/packages/rpart/rpart.pdf

Recursive partitioning for classification, regression and survival trees data("bodyfat", package="TH.data") dim(bodyfat) set.seed(1234) ind <- sample(2, nrow(bodyfat), replace=TRUE, prob=c(0.7, 0.3)) bodyfat.train <- bodyfat[ind==1,] bodyfat.test <- bodyfat[ind==2,] # train a decision tree library(rpart) myFormula <- DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth bodyfat_rpart <- rpart(myFormula, data = bodyfat.train, control = rpart.control(minsplit = 10)) attributes(bodyfat_rpart)

print(bodyfat_rpart$cptable)

print(bodyfat_rpart)

plot(bodyfat_rpart) text(bodyfat_rpart, use.n=T)

select the tree with the minimum prediction error opt <- which.min(bodyfat_rpart$cptable[,"xerror"]) cp <- bodyfat_rpart$cptable[opt, "CP"] bodyfat_prune <- prune(bodyfat_rpart, cp = cp) print(bodyfat_prune) plot(bodyfat_prune) text(bodyfat_prune, use.n=T)

After that, the selected tree is used to make prediction and the predicted values are compared with actual labels. Function abline() draws a diagonal line. The predictions of a good model are expected to be equal or very close to their actual values, that is, most points should be on or close to the diagonal line.

DEXfat_pred <- predict(bodyfat_prune, newdata=bodyfat.test) xlim <- range(bodyfat$DEXfat) plot(DEXfat_pred ~ DEXfat, data=bodyfat.test, xlab="Observed", ylab="Predicted", ylim=xlim, xlim=xlim) abline(a=0, b=1)

More info #Recursive Partitioning and Regression Trees help("rpart") #Control for Rpart Fits help("rpart.control") #Prediction of Body Fat by Skinfold Thickness, Circumferences, and Bone Breadths ??TH.data::bodyfat

C5.0 http://cran.r-project.org/web/packages/C50/C50.pdf

C50 library(C50) myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width iris_C5.0 <- C5.0(myFormula, data=trainData) summary(iris_C5.0) C5imp(iris_C5.0) C5.0testPred <- predict(iris_C5.0, testData) table(C5.0testPred, testData$Species) predict(iris_C5.0, testData, type = "prob")

More info #C5.0 Decision Trees and Rule-Based Models help("C5.0") #Control for C5.0 Models help("C5.0Control") #Summaries of C5.0 Models help("summary.C5.0") #Variable Importance Measures for C5.0 Models help("C5imp")

PLOT RPART MODELS. AN ENHANCED VERSION OF PLOT.RPART http://cran.r-project.org/web/packages/rpart.plot/rpart.plot.pdf

rpart.plot library(rpart.plot) data(ptitanic) #Titanic data tree <- rpart(survived ~., data=ptitanic, cp=.02) # cp=.02 because want small tree for demo rpart.plot(tree, main="default rpart.plot\n(type = 0, extra = 0)") prp(tree, main="type = 4, extra = 6", type=4, extra=6, faclen=0) # faclen=0 to print full factor names

rpart.plot rpart.plot(tree, main="extra = 106, under = TRUE", extra=106, under=TRUE, faclen=0) # the old way for comparison plot(tree, uniform=TRUE, compress=TRUE, branch=.2) text(tree, use.n=TRUE, cex=.6, xpd=NA) # cex is a guess, depends on your window size title("rpart.plot for comparison", cex=.6) rpart.plot(tree, box.col=3, xflip=FALSE)

More info #Titanic data with passenger names and other details removed. help("ptitanic") # Plot an rpart model. help("rpart.plot") # Plot an rpart model. A superset of rpart.plot. help("prp")

Decision Tree Dr. Jieh-Shan George YEH

Similar presentations

Presentation on theme: "Decision Tree Dr. Jieh-Shan George YEH"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Decision Tree Dr. Jieh-Shan George YEH

Similar presentations

Presentation on theme: "Decision Tree Dr. Jieh-Shan George YEH"— Presentation transcript:

Similar presentations

About project

Feedback