Download presentation
Presentation is loading. Please wait.
Published byKevin Morton Modified over 9 years ago
1
Oliver Schulte Machine Learning 726 Decision Tree Classifiers
2
2/13 Overview Parent Node/ Child Node DiscreteContinuous DiscreteMaximum Likelihood Decision Trees logit distribution (logistic regression) Continuousconditional Gaussian (not discussed) linear Gaussian (linear regression)
3
3/13 Decision Tree Popular type of classifier. Easy to visualize. Especially for discrete values, but also for continuous. Learning: Information Theory.
4
4/13 Decision Tree Example
5
5/13 Exercise Find a decision tree to represent A OR B, A AND B, A XOR B. (A AND B) OR (C AND notD AND E)
6
6/13 Decision Tree Learning Basic Loop: 1. A := the “best” decision attribute for next node. 2. For each value of A, create new descendant of node. 3. Assign training examples to leaf nodes. 4. If training examples perfect classified, then STOP. Else iterate over new leaf nodes.
7
7/13 Entropy
8
8/13 Uncertainty and Probability The more “balanced” a probability distribution, the less information it conveys (e.g., about class label). How to quantify? Information Theory: Entropy = Balance. S is sample, p + is proportion positive, p - negative. Entropy(S) = -p + log2(p + ) - p - log2(p - )
9
9/13 Entropy: General Definition Important quantity in coding theory statistical physics machine learning
10
10/13 Intuition
11
11/13 Entropy
12
12/13 Coding Theory Coding theory: X discrete with 8 possible states (“messages”); how many bits to transmit the state of X ? Shannon information theorem: optimal code length assigns p(x) to each “message” X = x. All states equally likely
13
13/13 Another Coding Example
14
14/13 Zipf’s Law General principle: frequent messages get shorter codes. e.g., abbreviations. Information Compression.
15
15/13 The Kullback-Leibler Divergence Measures information-theoretic “distance” between two distributions p and q. Code length of x in true distribution Code length of x in wrong distribution
16
16/13 Information Gain
17
17/13 Splitting Criterion A new attribute value changes the entropy. Intuitively, want to split on attribute that has the greatest reduction in entropy, averaged over its attribute values. Gain(S,A) = expected reduction in entropy due to splitting on A.
18
18/13 Example
19
19/13 Playtennis
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.