Download presentation
Presentation is loading. Please wait.
Published byAsher Morton Modified over 9 years ago
1
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心
2
课程基本信息 主讲教师:陈昱 chen_yu@pku.edu.cn Tel : 82529680 助教:程再兴, Tel : 62763742 wataloo@hotmail.com 课程网页: http://www.icst.pku.edu.cn/course/jiqix uexi/jqxx2011.mht 2
3
Ch3 Decision Tree Learning An illustrate example Basic algorithm Issues in decision tree learning 3
4
Ch3 Decision Tree Learning An illustrate example Basic algorithm Issues in decision tree learning 4
5
Training Examples for Function PlayTennis 5
6
Decision Tree for PlayTennis Each internal node tests one attribute X i Each branch from an internal node selects one value for X i Each leaf node predicts target function Y (sometimes might be non-deterministic) 6
7
Decision Tree for PlayTennis (2) The decision tree corresponds to the expression A disjunction of conjunctions of constraints on the attribute values of instances 7
8
When to Consider Decision Trees? Instances are represented by attribute-value pairs, such as (Temperature, hot), (Humidity, normal) … Extension of the algorithm can handle real-valued attributes Target function has discrete outputs Disjunctive descriptions may be required Training data may contain errors Training data may contain missing attribute values 8
9
Decision Tree Learning An illustrate example Basic algorithm Issues in decision tree learning 9
10
Algorithms for Learning Decision Tree Core algorithm: a top-down greedy search through space of possible decision trees Representative learning algorithms: ID3, C4.5, CART 10
11
ID3 Algorithm 11 Central choice: which attribute to test at each node in the tree?
12
Criterion for Attribute Selection ID3 makes the selection based on a statistical property called information gain It measures how well a given attribute separates training examples according to their target classification Define information gain by ∆(entropy) 12
13
Entropy S: a collection of training examples containing positive and negative examples of certain target concept P + : portion of positive examples in S P - : portion of negative examples in S 13
14
Entropy (2) Shannon’s optimal coding theorem: Given a class of signal I, the coding scheme for such signals, for which a random signal has the smallest expected length, satisfies: In light of Shannon’s Theorem, Entropy(S) = expected number of bits needed to encode class (+ or -) of randomly drawn member of S (also the minimum number of bits needed!) 14
15
Information Gain Gain(S,A) of an attribute A, relative to a collection of examples S, is defined as expected reduction in entropy due to sorting on A. 15
16
Learning Decision Tree for PlayTennis Gain(S, Outlook)=0.246 Gain(S, Humidity)=0.151 Gain(S, Wind)=0.048 Gain(S, Temperature)=0.029 16
17
Learning Decision Tree for PlayTennis Gain(S, Outlook)=0.246 (Outlook as the root node!) Gain(S, Humidity)=0.151 Gain(S, Wind)=0.048 Gain(S, Temperature)=0.029 17
18
Learning Decision Tree (2) 18
19
Learning Decision Tree (3) For each non-terminal node descendant, the learning process continues, and attributes that have already chosen as its ascendant nodes will be excluded. The process stops when every node is terminal or no attribute is left. If some non-terminal node has no attribute left, then the learned function value at the path from root to the node is non-deterministic. 19
20
Final Decision Tree 20
21
Final Decision Tree 21
22
Some Fun for Playing DT Decision tree learning applet http://www.cs.ualberta.ca/~aixplore/ 22
23
Hypothesis Space Searched by ID3 23
24
Expression Power of Decision Tree “Basis” boolean functions, A ∧ B, A ∨ B, and ¬A, can be represented by decision trees Therefore any boolean function defined on finite number of attributes, with each attribute taking finite number of possible values, can be represented by a decision tree E.g. AB ∨ CD(¬E) 24
25
Plus and Minus of ID3 Hypothesis space is a complete space of finite discrete-valued functions, relative to the available attributes. Output a single hypothesis, rather than all hypotheses consistent with training examples. No backtracking in its search: might converge to a locally optimal solution, which is not a globally one. Statistically based search: robust to noisy data By modifying termination criterion, it can accept hypotheses that imperfectly fit training data. 25
26
Inductive Bias in ID3 Recall that inductive bias is the set of assumptions that, together with training examples, justify the prediction of learned function. ID3 outputs a single hypothesis, rather than all hypotheses consistent with training examples. Which one it outputs? It prefers short trees, especially those with attributes of high information gain near the root. 26
27
Ockham’s razor Q: How do we choose from among multiple consistent hypotheses? Ockham’s razor: Prefer the simplest hypothesis consistent with the data. 27 William of Ockham
28
Pro and Con of Ockham’s razor Pro Fewer short hypos than long hypos → Less likely that it coincidently fits the data Con There are many ways to define small set of hypos → What is so special about small sets based on size of hypos? …… 28
29
Decision Tree Learning An illustrate example Basic algorithm Issues in decision tree learning 29
30
Issues Avoid overfitting the data How deep to grow a tree? Incorporating continuous-valued attributes Attributes with many values Unknown attribute values Attributes with various costs ID3 → C4.5 30
31
Issues Avoid overfitting the data Incorporating continuous-valued attributes Attributes with many values Unknown attribute values Attributes with various costs 31
32
Overfitting in Decision Trees Consider adding noisy training example #15: Sunny, Hot, Normal, Strong, PlayTennis=No, what is the new tree? 32
33
The Tree becomes… 33
34
Overfitting Consider error of hypothesis h over training data: error train (h) entire distribution D of data: error D (h) Hypothesis h ∈ H overfits training data if there is alternative hypothesis h’ ∈ H such that error train (h) error D (h’). 34
35
Overfitting in Decision Trees Learning See http://www.cs.cmu.edu/~tom/mlbook.html for software and data for more experimentshttp://www.cs.cmu.edu/~tom/mlbook.html 35
36
Avoid Overfitting How can we avoid overfitting? Stop growing when data split not statistically significant (previous example) Grow full tree, then post-prune How to select “best” tree? Measure performance over both training data and separate validation data Use a measure such as Minimum Description Length (MDL) to minimize both size(tree) and size(misclassifications(tree)) 36
37
Avoid Overfitting How can we avoid overfitting? Stop growing when data split not statistically significant (previous example) Grow full tree, then post-prune How to select “best” tree? Measure performance over both training data and separate validation data Minimum Description Length (MDL): minimize size(tree)+size(misclassifications(tree)) 37
38
Reduced-Error Pruning Split data into training and validation set Do until further pruning is harmful: 1. Evaluate impact on validation set of pruning each possible node (i.e. removing the subtree rooted at the node, and making the node a leaf) 2. Greedily remove the one that most improves validation set accuracy 38
39
Effect of Reduced-Error Pruning 39 What if data is limited? Try rule post-pruning!
40
Rule Post-Pruning 1. Convert tree to equivalent set of rules (one rule for each path from root to a leaf) 2. Prune each rule (independent of others) by removing any precondition that results in improving its estimated accuracy 3. Sort final rules by their estimated accuracy into desired sequence for use Perhaps most frequently used method (e.g. C4.5) 40
41
Ex: Converting a Tree to Rules Totally 5 rules for the tree, e.g. “ If Outlook=Sunny ∧ Humidity=High, then PlayTennis=No”. It has two preconditions, and we will remove the one that greatly improve the classification accuracy. 41
42
Issues Avoid overfitting the data Incorporating continuous-valued attributes Attributes with many values Unknown attribute values Attributes with various costs 42
43
Continuous-Valued Attributes Consider the following example: We want to include continuous–valued attribute Temperature into training examples of PlayTennis, and assume the examples have the following (Temperature, PlayTennis) value pairs: Q: Instead of sorting Temperature according to its discrete values, can we choose a threshold c so that Temperature will become a Boolean attributes, i.e. only two cases: Temp<c, and otherwise? Criterion for choosing c: maximizing information gain 43
44
Choose the Threshold 1. Sort examples according to their Temp values 2. Identify adjacent examples that differ in PlayTennis value 3. Generate a set of candidate c midway between corresponding values of Temp c that maximizes information gain must always lie in such boundary (Fayyad 1991) 4. Determine c by comparing information gain 44
45
Choose the Threshold (2) In the example we have two candidates for c: 48+60/2=54 and 80+90/2=85, and 54 is better. Can extend to handle splitting of continuous-valued attribute into multiple intervals 45
46
Issues Avoid overfitting the data Incorporating continuous-valued attributes Attributes with many values Unknown attribute values Attributes with various costs 46
47
Attributes with Many Values Problem: If attribute has many values, Gain might select it, since entropy at each of its children node would be quite small. Imagine carelessly choosing Date as attribute, with values such as 2008-09-15 One fix: Use Gain ratio instead 47
48
Gain Ratio SplitInformation is the entropy of S w.r.t. A’s values, rather than target function values. Consider the case of S being equally partitioned The formula might have problem if S is very unevenly partitioned A better solution in such case: distance-based measure instead of information gain (Lopez de Mantaras) 48
49
Issues Avoid overfitting the data Incorporating continuous-valued attributes Attributes with many values Unknown attribute values Attributes with various costs 49
50
Unknown Attribute Values What if some training example (x,f(x)) has unknown A(x)? Use (x,f(x)) anyway, go ahead with training If node n test A, and (x,f(x)) is also sorted to n, then assign A(x) to be the common value of A among examples sorted to n, or the common value of A among examples with same f(x) 50
51
Unknown Attribute Values (2) More complex procedure of assigning A(x): assign probability P i to each possible value v i of A assign P i to ith-branch of node n Do the same thing in case of classification. 51
52
Issues Avoid overfitting the data Incorporating continuous-valued attributes Attributes with many values Unknown attribute values Attributes with various costs 52
53
Attributes with Various Costs Consider medical diagnosis: patients are in attributes with various costs, and we prefer a tree low expected cost. We might also introduce “cost” when data is imbalanced Various approaches: replace information gain by Gain 2 (S,A)/Cost(A), Tan & Schlimmer (1990) for Robot perception task Nunez(1988) for medical diagnosis 53
54
Summary Decision tree learning provides a practical way for concept learning and other discrete-valued functions. Three well-known algorithms: CART, ID3 (discussed in this chapter), and C4.5 (improvement over ID3) ID3 searches for a complete space of discrete-valued functions over discrete-valued attributes The inductive bias implicit in ID3 includes a preference for small tree Overfitting is an important issue in decision tree learning. One way to alleviate it is by post-pruning. 54
55
HW 3.4 (10pt, Due Monday, Sept 26) 55
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.