Presentation is loading. Please wait.

Presentation is loading. Please wait.

Inferring Decision Trees Using the Minimum Description Length Principle J. R. Quinlan and R. L. Rivest Information and Computation 80, 227-248, 1989.

Similar presentations


Presentation on theme: "Inferring Decision Trees Using the Minimum Description Length Principle J. R. Quinlan and R. L. Rivest Information and Computation 80, 227-248, 1989."— Presentation transcript:

1 Inferring Decision Trees Using the Minimum Description Length Principle J. R. Quinlan and R. L. Rivest Information and Computation 80, 227-248, 1989

2 Introduction Minimum Description Length Principle  The best theory is the one which minimizes the sum of 1. the length of the theory, and 2. the length of the data when encoded using the theory as a predictor for the data. Goal  Application of MDLP to the construction of decision trees from data

3 Example (1) Attribute No.OutlookTemperatureHumidityWindyClass 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Sunny Overcast Rain Overcast Sunny Rain Sunny Overcast Rain Hot Mild Cool Mild Cool Mild Hot Mild High Normal High Normal High Normal High False True False True False True False True NNPPPNPNPPPPPNNNPPPNPNPPPPPN

4 Example (2)

5 Best Tree It has the smallest possible error rate when classifying previously unseen objects. An imperfect, smaller DT often achieves greater accuracy in the classification of new objects rather than one which perfectly classifies all the known objects.  May be overly sensitive to statistical irregularities and idiosyncrasies of the given data set.  Generally not possible for a DT inference to explicitly minimize the error rate on new examples.  A number of different approximate measures  MDLP

6 MDLP (1) DT which minimizes this measure is proposed as a “best” DT to infer from the given data. Motivation Problem  Communication problem based on the given data  Transmitting the fewest total bits Our communication problem  You and I have the copy of the data set.  Your copy : Last column is missing  Send you an exact description of the missing column using as few bits as possible  Simplest technique : 14 bits

7 MDLP (2) The more predictable that the class of an object is from the attributes, the fewer bits I may need to send. In general, 1. Partition the set of objects into a number of subsets based on the attributes of the objects. 2. Send you a description of this partition. 3. Send you a description of the most frequent class to be associated with each subset. 4. For each category of objects, send you a description of the exceptions.  There are few exceptions in a category.

8 MDLP (3) Decision Tree  Natural and efficient way of partitioning  Associating a default class with each category Best DT  The combined length of the description of the DT, plus the description of the exceptions, must be as small as possible.

9 Bayesian Interpretation of the MDLP MDLP can be naturally viewed as a Bayesian MAP estimator.  T : decision tree  t : the length of T  D : the data to be transmitted  d : the length of D  r : a fixed parameter, r > 1.  Control how quickly the probability decrease as the length t of the string increases.

10 Bayesian Interpretation (2) The prob. of each binary string of length t  Empty string : (1 – 1 / r)  Strings 0 and 1 : (1 – 1 / r) (1 / 2r) r T and r D : two fixed parameters  r T > 1 and r D > 1 The priori prob. of the theory represented by the DT T

11 Bayesian Interpretation (3) The prob. of the observed data, given the theory, The posteriori prob. of the theory

12 Bayesian Interpretation (4) The tree which minimizes tc T + dc D will have maximum posteriori probability. r T = r D = 2  c T = c D = 2  t + d If r T is large, the large trees T will be penalized heavily.  A more compact tree will have maximum posteriori prob. If r D is large, exceptions will be penalized heavily.  A large tree, which explains the given data most accurately, is likely to result. Assume r T = r D, so that c T = c D.

13 Coding Strings of 0’s and 1’s Notations  n : the length of string  k : the number of symbol 1  (n – k) : the number of symbol 0  k  b  b : a known priori upper bound on k  b = n or b = (n + 1) / 2 The procedure  1. First I transmit the value of k.  lg(b +1) bits  2. There are only strings possible. 

14 Coding Strings (2) Total cost  Standard measure of the complexity of a binary string of length n containing exactly k 1’s, where k < b. Table I  N,N,P,P,P,N,P,N,P,P,P,P,P,N  L(14,9,14) = lg(15) + lg(2002) = 14.874 bits Approximation using Stirling’s formula  Do not depend on the position of 1s.

15 Coding Strings (3) Quinlan’s Heuristics  The information content in a string of length n containing k P’s  nH(k / n)  Under approximation to L(n, k, b)  May result in large decision tree Generalization  k = k 1 + k 2 +  + k t

16 Coding Sets of Strings Example Table  Attribute “humidity”  High humidity objects –N, N, P, P, N, P, N  Normal humidity objects –P, N, P, P, P, P, P  Code the exceptions  L(7, 3, 3) + L(7, 1, 3) = 11.937 bits < 14 bits  Some relationship between the attribute and the class  Need to include the cost of describing this decision tree

17 Coding Decision Trees (1) Coding scheme  Smaller DTs are represented by shorter codewords than larger DTs.  Recursive, top-down, depth-first procedure  A leaf is encoded as “0” followed by an encoding of the default class for that leaf.  To code a tree which is not a leaf, begin with “1”, followed by the code for the attribute at the root of the tree, followed by the encoding of the subtrees of the tree, in order.

18 Coding Decision Trees (2) 1 Outlook 1 Humidity 0 N 0 P 0 P 1 Windy 0 N 0 P “Outlook” require 2 bits  Selecting the first attribute out of four “Humidity” require lg(3) bits  Only three attributes remains Example Tree : 18.170 bits

19 Coding Decision Trees (3) For a uniform b-ary tree with n decision nodes and (b – 1)n + 1 leaves  bn + 1 bits The number of b-ary trees with n internal node and (b – 1)n + 1 leaves The proposed coding scheme is not efficient for high arity trees.

20 Coding Decision Trees (4) Cost of Describing the structure of the tree  Suppose the tree has k decision nodes and n – k leaves.  L(n, k, (n + 1) / 2)  k < n – k  all tests will have arity at least two. Total tree description cost  Add the cost of specifying the attribute names for each node and the cost of specifying the default class for each leaf

21 Coding Exceptions Five subsets of the set of objects  Sunny outlook & high humidity : N, N, N  Sunny outlook & normal humidity : P, P  Overcast outlook : P, P, P, P  Rainy outlook & windy : N, N  Rainy outlook & not windy : P, P, P The exceptions can be encoded with a cost  L(3, 0, 1) + L(2, 0, 1) + L(4, 0, 2) + L(2, 0, 1) + L(3, 0, 1) = 5.585 Total cost for the communication problem  18.170 + 5.585

22 Coding Real-Valued Attributes Two approaches to find a good “cut point”  Using values of the known objects  Suppose that for the desired attribute the n given objects have m < n distinct values.  Sort m real values and specify ith number by specifying i  Specify i using lg(m) bits  Using compactly described rational numbers

23 Computing Good Decision Trees Cost of replacing a leaf with a decision node  Suppose A attributes and replacing a leaf at depth d with a decision node.  There are d’  d attributes tested on the path from the root to this leaf.  There A – d’ attributes eligible.  To indicate which one is selected require lg(A – d’) bits.  If the attribute selected has v values, the cost for describing the additional tree structure is 2v – 1 bits.  Split one subset into v subsets.  Measure the extent to which the exceptions can be coded more efficiently.  If the savings so obtained is greater than the cost of extending the tree, the extension should be selected.

24 Two Phase Process First Phase  Begin with a single leaf and continue to extend the tree  Iterate until the tree is perfect or cannot be grown any further 1. Let x be a leaf whose corresponding categories are of varying classes, such that it is possible to replace x with a decision node. 2. For each possible attribute A, compute the total communication cost if this change is made. 3. Replace x with the decision node least total communication cost. Second Phrase  The tree is repeatedly pruned back by replacing decision nodes by leaves, whenever this improves the total communication cost, until no further improvement in communication cost is possible.

25 Attribute with a large number of values An attribute with a large number of values is penalized.  It splits the set of objects into many subsets.  e.g. The “Object number” in the example.  The cost of specifying the attribute will not be justified in terms of the extra compression achieved in transmitting the class information.

26 Empirical Results Compare with C4 MDLP provides a unified framework for both growing and pruning the decision tree. Data set MDLPC4 SizeError rateSizeError Rate Hypo Discordant LED Credit Endgame Prob-Disj 11 15 83 14 15 17 0.6% 1.9% 26.9% 17.4% 17.9% 20.5% 11.0 13.6 56.0 32.5 62.6 42.6 0.55% 1.25% 28.1% 16.1% 13.6% 14.9%

27 Extensions (1) In presence of noisy data  The significance of the existing dependencies of the class on the attributes will be masked.  As the noise level increases, the tree grown should decrease in size. Handle “training sets” which are especially representative of the concept being learned  Associate a “frequency count” larger than one with each input object  If it is required that the DT will classify each object correctly, these counts can be set to a large value.  The saving that can be realized by using a perfect DT increases, as the counts increases.

28 Extensions (2)  Experiment  Replicate the data set some number c times  If the communication is separated into t “tree bits” and d “data bits”, the total cost should be t + cd.  c : priori understanding of the representativeness or completeness of the given data set.

29 Extensions (3) Data set c = 1c = 2c = 8 SizeError rateSizeError rateSizeError rate Hypo Discordant LED Credit Endgame Prob-Disj 11 15 83 14 15 17 0.6% 1.9% 26.9% 17.4% 17.9% 20.5% 15 23 93 11 35 45 0.6% 1.8% 27.0% 13.5% 11.5% 13.5% 19 37 95 76 71 69 0.5% 1.1% 27.0% 17.0% 12.0% 13.0%


Download ppt "Inferring Decision Trees Using the Minimum Description Length Principle J. R. Quinlan and R. L. Rivest Information and Computation 80, 227-248, 1989."

Similar presentations


Ads by Google