Decision Trees
2 Outline What is a decision tree ? How to construct a decision tree ? What are the major steps in decision tree induction ? How to select the attribute to split the node ? What are the other issues ?
3 Classification by Decision Tree Induction Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of test Leaf nodes represent class labels or class distribution Age? Student? Credit? fair excellent >40 31…40 30 NOYES noyes NOYES
4 Training Dataset AgeIncomeStudentCreditBuys_computer P1<=30highnofairno P2<=30highnoexcellentno P331…40highnofairyes P4>40mediumnofairyes P5>40lowyesfairyes P6>40lowyesexcellentno P731…40lowyesexcellentyes P8<=30mediumnofairno P9<=30lowyesfairyes P10>40mediumyesfairyes P11<=30mediumyesexcellentyes P1231…40mediumnoexcellentyes P1331…40highyesfairyes P14>40mediumnoexcellentno
5 Output: A Decision Tree for “buy_computer” Age? Student? Credit? fair excellent >40 31…40 <=30 NOYES noyes NOYES
6 Outline What is a decision tree ? How to construct a decision tree ? What are the major steps in decision tree induction ? How to select the attribute to split the node ? What are the other issues ?
7 Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-valued, they are discretized in advance) Tree is constructed in a top-down recursive divide-and- conquer manner At start, all training examples are at the root Test attributes are selected on basis of a heuristic or statistical measure (e.g., information gain) Examples are partitioned recursively based on selected attributes
8 Training Dataset AgeIncomeStudentCreditBuys_computer P1<=30highnofairno P2<=30highnoexcellentno P331…40highnofairyes P4>40mediumnofairyes P5>40lowyesfairyes P6>40lowyesexcellentno P731…40lowyesexcellentyes P8<=30mediumnofairno P9<=30lowyesfairyes P10>40mediumyesfairyes P11<=30mediumyesexcellentyes P1231…40mediumnoexcellentyes P1331…40highyesfairyes P14>40mediumnoexcellentno
9 Construction of A Decision Tree for “buy_computer” ? [P1,…P14] Yes: 9, No:5
10 Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-valued, they are discretized in advance) Tree is constructed in a top-down recursive divide-and- conquer manner At start, all training examples are at the root Test attributes are selected on basis of a heuristic or statistical measure (e.g., information gain) Examples are partitioned recursively based on selected attributes
11 Construction of A Decision Tree for “buy_computer” Age? >40 31…40 <=30 [P1,…P14] Yes: 9, No:5
12 Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-valued, they are discretized in advance) Tree is constructed in a top-down recursive divide-and- conquer manner At start, all training examples are at the root Test attributes are selected on basis of a heuristic or statistical measure (e.g., information gain) Examples are partitioned recursively based on selected attributes
13 Training Dataset AgeIncomeStudentCreditBuys_computer P1<=30highnofairno P2<=30highnoexcellentno P331…40highnofairyes P4>40mediumnofairyes P5>40lowyesfairyes P6>40lowyesexcellentno P731…40lowyesexcellentyes P8<=30mediumnofairno P9<=30lowyesfairyes P10>40mediumyesfairyes P11<=30mediumyesexcellentyes P1231…40mediumnoexcellentyes P1331…40highyesfairyes P14>40mediumnoexcellentno
14 Construction of A Decision Tree for “buy_computer” Age? >40 31…40 <=30 [P1,…P14] Yes: 9, No:5 [P1,P2,P8,P9,P11] Yes: 2, No:3 [P3,P7,P12,P13] Yes: 4, No:0 [P4,P5,P6,P10,P14] Yes: 3, No:2 YES ? ?
15 Construction of A Decision Tree for “buy_computer” Age? >40 30…40 <=30 [P1,…P14] Yes: 9, No:5 [P1,P2,P8,P9,P11] Yes: 2, No:3 [P3,P7,P12,P13] Yes: 4, No:0 [P4,P5,P6,P10,P14] Yes: 3, No:2 Student? noyes YES ? [P1,P2,P8] Yes: 0, No:3 [P9,P11] Yes: 2, No:0 NOYES
16 Construction of A Decision Tree for “buy_computer” Age? >40 30…40 <=30 [P1,…P14] Yes: 9, No:5 [P1,P2,P8,P9,P11] Yes: 2, No:3 [P3,P7,P12,P13] Yes: 4, No:0 [P4,P5,P6,P10,P14] Yes: 3, No:2 Student? noyes YES [P1,P2,P8] Yes: 0, No:3 [P9,P11] Yes: 2, No:0 Credit? fairexcellent NOYES NO YES [P6,P14] Yes: 0, No:2 [P4,P5,P10] Yes: 3, No:0
17 Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-valued, they are discretized in advance) Tree is constructed in a top-down recursive divide-and- conquer manner At start, all training examples are at the root Test attributes are selected on basis of a heuristic or statistical measure (e.g., information gain) Examples are partitioned recursively based on selected attributes
18 Outline What is a decision tree ? How to construct a decision tree ? What are the major steps in decision tree induction ? How to select the attribute to split the node ? What are the other issues ?
19 Which Attribute is the Best? The attribute most useful for classifying examples Information gain An information-theoretic approach Measure how well an attribute separates the training examples Use the attribute with the highest information gain to split Minimize the expected number of tests needed to classify a new tuple How useful? How well separated? How pure splitting result? Information gain
20 Choosing an attribute Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" Patrons? is a better choice
21 Information theory If there are n equally probable possible messages, then the probability p of each is 1/n Information conveyed by a message is -log(p) = log(n) E.g., if there are 16 messages, then log(16) = 4 and we need 4 bits to identify/send each message In general, if we are given a probability distribution P = (p 1, p 2,.., p n ) Then the information conveyed by the distribution (aka entropy of P) is: I(P) = -(p 1 *log(p 1 ) + p 2 *log(p 2 ) p n *log(p n ))
22 Information theory II Information conveyed by distribution (a.k.a. entropy of P): I(P) = -(p 1 *log(p 1 ) + p 2 *log(p 2 ) p n *log(p n )) Examples: If P is (0.5, 0.5) then I(P) is 1 If P is (0.67, 0.33) then I(P) is 0.92 If P is (1, 0) then I(P) is 0 The more uniform the probability distribution, the greater its information: More information is conveyed by a message telling you which event actually occurred Entropy is the average number of bits/message needed to represent a stream of messages
23 Information for classification If a set S of records is partitioned into disjoint exhaustive classes (C 1,C 2,..,C k ) on the basis of the value of the class attribute, then the information needed to identify the class of an element of S is Info(S) = I(P) where P is the probability distribution of partition (C 1,C 2,..,C k ): P = (|C 1 |/|S|, |C 2 |/|S|,..., |C k |/|S|) C1C1 C2C2 C3C3 C1C1 C2C2 C3C3 High information Low information
24 Information, Entropy, and Information Gain S contains s i tuples of class C i for i = {1,..., m} Information measures “ the amount of info ” required to classify any arbitrary tuple where is the probability that an arbitrary tuple belongs to C i Example: S contains 100 tuples, 25 belong to class C 1 and 75 belong to class C 2
25 Information, Entropy, and Information Gain Information reflects the “ purity ” of the data set Low information value indicates high purity High information value indicates high diversity Example: S contains 100 tuples 0 belongs to class C 1 and 100 belong to class C 2 50 belong to class C 1 and 50 belong to class C 2
26 Information for classification II If we partition S w.r.t attribute X into sets {T 1,T 2,..,T n } then the information needed to identify the class of an element of S becomes the weighted average of the information needed to identify the class of an element of T i, i.e. the weighted average of Info(T i ): Info(X,T) = |T i |/|S| * Info(T i ) C1C1 C2C2 C3C3 C1C1 C2C2 C3C3 High information Low information
27 Information gain Consider the quantity Gain(X,S) defined as Gain(X,S) = Info(S) - Info(X,S) This represents the difference between information needed to identify an element of S and information needed to identify an element of S after the value of attribute X has been obtained That is, this is the gain in information due to attribute X We can use this to rank attributes and to build decision trees where at each node is located the attribute with greatest gain among the attributes not yet considered in the path from the root The intent of this ordering is: To create small decision trees so that records can be identified after only a few questions To match a hoped-for minimality of the process represented by the records being considered (Occam’s Razor)
28 Information, Entropy, and Information Gain S contains s i tuples of class C i for i = {1,..., m} Attribute A has values {a 1,a 2,...,a v } Let s ij be the number of tuples which belong to class C i, and have a value of a j in attribute A Entropy of attribute A is Information gained by branching on attribute A
29 Information, Entropy, and Information Gain Let T j be the set of tuples having value a j in attribute A s 1j + …,+s mj = |T j | I(s 1j, …,s mj ) = I(T j ) Entropy of attribute A is Proportion of |Tj| over |S| Information of Tj
30 Information, Entropy, and Information Gain A=a2 A=a I(10,10)=1 I(10,20)=0.918 I(20,30)=0.971 I(40,60)=0.971 S contains 100 tuples, 40 belong to class C1 (red) and 60 belong to class C2 (blue) 30 tuples 50 tuples A=a1 20 tuples
31 Computing information gain French Italian Thai Burger EmptySomeFull Y Y Y Y Y YN N N N N N I(S) = - (.5 log log.5) = = 1 I (Pat, S) = 1/6 (0) + 1/3 (0) + 1/2 (- (2/3 log 2/3 + 1/3 log 1/3)) = 1/2 (2/3*.6 + 1/3*1.6) =.47 I (Type, S) = 1/6 (1) + 1/6 (1) + 1/3 (1) + 1/3 (1) = 1 Gain (Pat, S) = =.53 Gain (Type, T) = 1 – 1 = 0
32 Regarding the Definition of Entropy… On Text book Page 134 (Equ. 3.6) On Text book Page 287 (Equ. 7.2) Polymophism When entropy is defined on tuples, use Equ. 3.6 When entropy is defined on attribute, use Equ. 7.2
33 How well does it work? Many case studies have shown that decision trees are at least as accurate as human experts. A study for diagnosing breast cancer had humans correctly classifying the examples 65% of the time; the decision tree classified 72% correct British Petroleum designed a decision tree for gas-oil separation for offshore oil platforms that replaced an earlier rule-based expert system Cessna designed an airplane flight controller using 90,000 examples and 20 attributes per example
34 Outline What is a decision tree ? How to construct a decision tree ? What are the major steps in decision tree induction ? How to select the attribute to split the node ? What are the other issues ?
35 Extracting Classification Rules from Trees Represent knowledge in the form of IF-THEN rules One rule is created for each path from root to a leaf Each attribute-value pair along a path forms a conjunction Leaf node holds class prediction Rules are easier for humans to understand
36 Examples of Classification Rules Age? Student? Credit? fair excellent >40 31…40 30 NOYES noyes NOYES Classification rules: 1. IF age = “<=30” AND student = “no” THEN buys_computer = “no” 2. IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” 3. IF age = “31…40” THEN buys_computer = “yes” 4. IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” 5. IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
37 Avoid Over-fitting in Classification Generated tree may over-fit training data Too many branches, some may reflect anomalies due to noise or outliers Result is in poor accuracy for unseen samples Two approaches to avoiding over-fitting Pre-pruning: Halt tree construction early—do not split a node if this would result in goodness measure falling below a threshold Difficult to choose an appropriate threshold Post-pruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from training data to decide which is “best pruned tree”
38 Enhancements to basic decision tree induction Dynamic discretization for continuous-valued attributes Dynamically define new discrete-valued attributes that partition continuous attribute value into a discrete set of intervals Handle missing attribute values Assign most common value of attribute Assign probability to each of possible values Attribute construction Create new attributes based on existing ones that are sparsely represented Reduce fragmentation (no. of samples at branch becomes too small to be statistically significant), repetition (attribute is repeatedly tested along a branch), and replication (duplicate subtrees)
39 Classification in Large Databases Classification—a classical problem extensively studied by statisticians and machine learning researchers Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed Why decision tree induction in data mining? relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods
40 Scalable Decision Tree Induction Methods SLIQ (EDBT’96 — Mehta et al.) Build an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.) constructs an attribute list data structure PUBLIC (VLDB’98 — Rastogi & Shim) integrates tree splitting and tree pruning: stop growing the tree earlier RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) separates the scalability aspects from the criteria that determine the quality of the tree builds an AVC-list (attribute, value, class label)
41 Summary What is a decision tree ? A flow-chart-like tree: internal nodes, branches, and leaf nodes How to construct a decision tree ? What are the major steps in decision tree induction ? Test attribute selection Sample partition How to select the attribute to split the node ? Select the attribute with the highest information gain Calculate the information of the node Calculate the entropy of the attribute Calculate the difference between the information and entropy What are the other issues ?