Download presentation
Presentation is loading. Please wait.
1
Decision Trees SDSC Summer Institute 2012
Natasha Balac, Ph.D. © Copyright 2012, Natasha Balac
2
DECISION TREE INDUCTION
Method for approximating discrete-valued functions robust to noisy/missing data can learn non-linear relationships inductive bias towards shorter trees © Copyright 2012, Natasha Balac
3
Decision trees “Divide-and-conquer” approach
Nodes involve testing a particular attribute Attribute value is compared to Constant Comparing values of two attributes Using a function of one or more attributes Leaves assign classification, set of classifications, or probability distribution to instances Unknown instance is routed down the tree © Copyright 2012, Natasha Balac
4
Decision Tree Learning
Applications: medical diagnosis – ex. heart disease analysis of complex chemical compounds classifying equipment malfunction risk of loan applicants Boston housing project – price prediction © Copyright 2012, Natasha Balac
5
DECISION TREE FOR THE CONCEPT “Sunburn”
Name Hair Height Weight Lotion Result Sarah blonde average light no sunburned (positive) Dana tall yes none (negative) Alex brown short none Annie sunburned Emily red heavy Pete John Katie © Copyright 2012, Natasha Balac
6
DECISION TREE FOR THE CONCEPT “Sunburn”
© Copyright 2011, Natasha Balac
7
Medical Diagnosis and Prognosis
Minimum systolic blood pressure over a 24-hour period following admission to the hospital > 91 <= 91 Age of Patient Class 2: Early death <=62.5 >62.5 Class 1: Survivors Was there sinus tachycardia? Identify patients who are at risk of dying within 30 days from patients who have suffered heart attack and survived at least first 24 hours past admission at UCSD Medical Center (1983) Diagnosis sometimes difficult: 1. history of characteristic chest pain; 2. Indicative electrograms; 3. Characteristic elevations of enzymes that tend to be released by damaged heart muscle 215 patients 37 died a78 did not 19 included variables for each (noninvasive) Sinus tachycardia: present if the sinus node heart rate ever exceeded 100 beats per minute during first 24 hours The sinus node is the normal electrical pacemaker of the heart and is located in the right atrium Even thought decision trees have been used for a wide variety of problems they are best suited for the classification problems that provide instances of data as attribute-value pairs and the target function has discrete output values. Not expressive enough for some problems YES NO Class 1: Survivors Class 2: Early death Beriman et. al, 1984 © Copyright 2012, Natasha Balac
8
Occam’s Razor “The world is inherently simple. Therefore the smallest decision tree that is consistent with the samples is the one that is most likely to identify unknown objects correctly” © Copyright 2012, Natasha Balac
9
Decisions Trees Representation
Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification © Copyright 2012, Natasha Balac
10
When to Consider Decision Trees
Instances describable by attribute-value pairs each attribute takes a small number of disjoint possible values Target function has discrete output value Possibly noisy training data may contain errors may contain missing attribute values Instances represented by attribute-value pairs Instances described by a fixed set of attributes(e.g. Temperature) and their values (e.g.hot) Best scenario: Attribute takes on a small number of disjoint possible values (e.g. hot, mild, cold) Target function has discrete output values Disjunctive hypothesis may be required Simplest case: Only two possible classes (Boolean classification) Noisy training data Both in classifications of the training examples and attribute values that describe examples Missing values © Copyright 2012, Natasha Balac
11
Weather Data Set-Make the Tree!
© Copyright 2012, Natasha Balac
12
DECISION TREE FOR THE CONCEPT “Play Tennis”
© Copyright 2011, Natasha Balac [Mitchell,1997] Mitchell, 1997
13
Constructing decision trees
Normal procedure: top down in recursive divide-and-conquer fashion First: attribute is selected for root node and branch is created for each possible attribute value Then: the instances are split into subsets (one for each branch extending from the node) Finally: procedure is repeated recursively for each branch, using only instances that reach the branch Process stops if all instances have the same class © Copyright 2012, Natasha Balac
14
Induction of Decision Trees
Top-down Method Main loop: A pick the ``best'' decision attribute for next node Assign A as decision- split value attribute for node For each value of A, create new descendant of node Sort training examples to leaf nodes If training examples perfectly classified, Then STOP Else iterate over new leaf nodes © Copyright 2012, Natasha Balac
15
Which is the best attribute?
© Copyright 2011, Natasha Balac
16
Attribute selection How to choose the best attribute?
Smallest tree Heuristic: Attribute that produces the “purest” nodes Impurity criterion: Information gain Increases with the average purity of the subsets produced by the attribute split Choose attribute that results in greatest information gain © Copyright 2012, Natasha Balac
17
Computing information
Information is measured in bits Given a probability distribution, the info required to predict an event is the distribution’s entropy Entropy gives the information required in bits (this can involve fractions of bits!) Formula for computing the entropy: © Copyright 2012, Natasha Balac
18
Expected information for attribute “Outlook”
“Outlook” = “Sunny” “Outlook” =“Overcast” “Outlook” =“Rainy” Total expected information: © Copyright 2012, Natasha Balac
19
Computing the information gain
Information gain: information before splitting – information after splitting Gain(“Outlook”)=info([9,5])-info([2,3],[4,0],[3,2]) = = bits Information gain for attributes from weather data: Gain (“Outlook”) = bits Gain (“Temp”) = bits Gain (“Humidity”) = bits Gain (“Windy”) = bits © Copyright 2012, Natasha Balac
20
Which is the best attribute?
© Copyright 2011, Natasha Balac
21
Further splits Gain (“Temp”)=0.571 bits Gain (“Humidity”)= Gain(“Windy”)=0.020 bits © Copyright 2011, Natasha Balac
22
Final product © Copyright 2012, Natasha Balac
23
Purity Measure Desirable properties
Pure Node -> measure = zero Impurity maximal -> measure = maximal Multistage property decisions can be made in several stages measure([2 ,3,4])= measure([2,7])+(7/9) ´ measure([3,4]) Entropy is the only function that satisfies all the properties Properties we require from a purity measure: When node is pure, measure should be zero When impurity is maximal (i.e. all classes equally likely), measure should be maximal Measure should obey multistage property (i.e. decisions can be made in several stages): © Copyright 2012, Natasha Balac
24
Highly-branching attributes
Attributes with a large number of values example: ID code Subsets more likely to be pure if there is a large number of values Information gain biased towards attributes with a large number of values Overfitting © Copyright 2012, Natasha Balac
25
New version of Weather Data
© Copyright 2012, Natasha Balac
26
ID code attribute split
Info([9,5]) = bits © Copyright 2012, Natasha Balac
27
Gain ratio Modification that reduces its bias
Takes number and size of branches into account when choosing an attribute Taking the intrinsic information of a split into account Intrinsic information: Entropy of distribution of instances into branches How much info do we need to tell which branch an instance belongs to © Copyright 2012, Natasha Balac
28
Computing the gain ratio
Example: intrinsic information for ID code info([1,1,…1])=14´(-1/14´ log1/14)=3.807 bits Value of attribute decreases as intrinsic information gets larger Definition of gain ratio: Example: © Copyright 2012, Natasha Balac
29
Gain ration example © Copyright 2012, Natasha Balac
30
Avoid Overfitting How can we avoid Overfitting:
Stop growing when data split not statistically significant Grow full tree then post-prune How to select best tree? Measure performance over training data Measure performance over separate validation data set © Copyright 2012, Natasha Balac
31
Pruning Pruning simplifies a decision tree to prevent overfitting to noise in the data Post-pruning: takes a fully-grown decision tree and discards unreliable parts Pre-pruning: stops growing a branch when information becomes unreliable Post-pruning preferred in practice because of early stopping in pre-pruning Two main pruning strategies: Prepruning: Based on statistical significance test Stops growing the tree when there is no statistically significant association between any attribute and the class at a particular node ID3 used chi-squared test in addition to information gain Only statistically significant attributes where allowed to be selected by the information gain procedure Early stopping: Pre-pruning may suffer from early stopping: may stop the growth process prematurely Classic example: XOR/Parity-problem No individual attribute exhibits any significant association to the class Structure only visible in fully expanded tree Pre-pruning won’t expand the root node XOR-type problems not common in practice Pre-pruning faster than post-pruning Post Pruning: Builds full tree first and prunes it afterwards Attribute interactions are visible in fully-grown tree Identification of sub-trees and nodes that are due to chance effects Two main operations: 1. Sub-tree replacement 2. Sub-tree raising Possible strategies: error estimation, significance testing, MDL principle Estimating error rates: Pruning operation is performed if this does not increase the estimated error Error on the training data is not a good estimator One possibility: using hold-out set for pruning ( reduced-error pruning) C4.5’s method: using upper limit of 25% confidence interval derived from the training data © Copyright 2012, Natasha Balac
32
Summary Algorithm for top-down induction of decision trees
Probably the most extensively studied method of machine learning used in data mining Different criteria for attribute/test selection rarely make a large difference Different pruning methods mainly change the size of the resulting pruned tree C4.5 rules can be slow for large and noisy datasets Commercial version C5.0 – much faster and a bit more accurate C4.5 offers two parameters The confidence value (default 25%): lower values incur heavier pruning A threshold on the minimum number of instances in the two most popular branches © Copyright 2012, Natasha Balac
33
WEKA Tutorial © Copyright 2012 Natasha Balac
34
REGRESSION TREE INDUCTION
Why Regression tree? Ability to: Predict continuous variable Model conditional effects Model uncertainty © Copyright 2012, Natasha Balac
35
Regression Trees Continuous goal variables
Induction by means of an efficient recursive partitioning algorithm Uses linear regression to select internal node Quinlan, 1992 © Copyright 2012, Natasha Balac
36
Regression trees Differences to decision trees:
Splitting: minimizing intra-subset variation Pruning: numeric error measure Leaf node predicts average class values of training instances reaching that node Can approximate piecewise constant functions Easy to interpret and understand the structure Special kind: Model Trees © Copyright 2012, Natasha Balac
37
Model trees RT with linear regression functions at each leaf
Linear regression (LR) applied to instances that reach a node after full regression tree has been built Only a subset of the attributes is used for LR Fast Overhead for LR is minimal as only a small subset of attributes is used in tree Regression trees with linear regression functions at each leaf node Linear regression applied to instances that reach a node after full regression tree has been built Only a subset of the attributes is used for LR Attributes occurring in subtree (+maybe attributes occurring in path to the root) Fast: overhead for LR not large because usually only a small subset of attributes is used in tree © Copyright 2012, Natasha Balac
38
Building the tree Splitting criterion: standard deviation reduction (T portion of the data reaching the node) Where T1,T2, are the sets that result from splitting the node according to the chosen attribute Termination criteria Standard deviation becomes smaller than certain fraction of sd for full training set (5%) Too few instances remain (< 4) © Copyright 2012, Natasha Balac
39
Nominal attributes Nominal attributes are converted into binary attributes (that can be treated as numeric ones) Nominal values are sorted using average class value If there are k values, k-1 binary attributes are generated It can be proven that the best split on one of the new attributes is the best binary split on original M5‘ only does the conversion once © Copyright 2012, Natasha Balac
40
Pruning Model Trees Based on estimated absolute error of LR models
Heuristic estimate for smoothing calculation P’ is prediction past up to the next node; p is passed to nod below; q is predicted value by the model; n # of training instances in the node; K smoothing const Pruned by greedily removing terms to minimize the estimated error Heavy pruning allowed – single LR model can replace a whole subtree Pruning proceeds bottom up - error for LR model at internal node is compared to error fosubtree Pg 203 © Copyright 2012 Natasha Balac
41
Building the tree Splitting criterion Termination criteria
standard deviation reduction T – portion T of the training data T1, T2, …sets that result from splitting the node on the chosen attribute Treating SD of the class values in T as a measure of the error at the node; calc expected reduction in error by testing each attribute at node Termination criteria Standard deviation becomes smaller than certain fraction of SD for full training set (5%) Too few instances remain (< 4) © Copyright 2012, Natasha Balac
42
Pseudo-code for M5’ Four methods:
Main method: MakeModelTree() Method for splitting: split() Method for pruning: prune() Method that computes error: subtreeError() We’ll briefly look at each method in turn Linear regression method is assumed to perform attribute subset selection based on error © Copyright 2012, Natasha Balac
43
MakeModelTree for each k-valued nominal attribute
SD =sd(instances) for each k-valued nominal attribute convert into k-1 synthetic binary attributes root =newNode root.instances = instances split(root) prune(root) printTree(root) © Copyright 2012 Natasha Balac
44
split(node) if sizeof(node.instances) < 4 or
sd(node.instances) < 0.05*SD node.type = LEAF else node.type = INTERIOR for each attribute for all possible split positions of the attribute calculate the attribute’s SDR node.attribute = attribute with maximum SDR split(node.left) split(node.right) © Copyright 2012, Natasha Balac
45
prune() if node = INTERIOR then prune(node.leftChild)
prune(node.rightChild) node.model = linearRegression(node) if subtreeError(node) > error(node) then node.type = LEAF © Copyright 2012, Natasha Balac
46
subtreeError() l = node.left; r = node.right if node = INTERIOR then
return (sizeof(l.instances)*subtreeError(l) + sizeof(r.instances)*subtreeError(r)) /sizeof(node.instances) else return error(node) © Copyright 2012, Natasha Balac
47
MULTI-VARIATE REGRESSION TREES*
All the characteristics of a regression tree Capable of predicting two or more outcomes Example: Activity and toxicity, monetary gain and time Balac, Gaines ICML 2001 © Copyright 2012, Natasha Balac
48
MULTI-VARIATE REGRESSION TREE INDUCTION
>0.5 AND >-3.61 Var 1 Var 3 >0.5 OR >-3.56 <=0.5 AND <=-3.56 <=-2 AND <=4.8 Var 2 Var 4 >-4.71 OR >4.83 <=-4.71 AND <=4.83 Activity = 7.05 Toxicity = 0.173 Activity = 7.39 Toxicity = 2.89 … © Copyright 2012, Natasha Balac
49
CLUSTERING Basic idea: Group similar things together
Unsupervised Learning – Useful when no other info is available K-means Partitioning instances into k disjoint clusters Measure of similarity Applied when there is no class to be predicted but rather when the instances are to be divided into natural groups. These clusters presumably reflect some mechanism at work in the domain from which instances are drawn, a mechanism that causes some instances to bear a stronger resemblance to one another than they do to the remaining instances. Deleted: Evaluation problematic: usually done by inspection But: if clustering is treated as a density estimation problem, then it can be evaluated on test data! © Copyright 2012 Natasha Balac
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.