Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem of estimating credit risk by considering four features of a potential creditor. Such data can be derived from the history of credit applications.
Learning, page 20 CSI 4106, Winter 2005 Learning decision trees (2) At every level, one feature value is selected.
Learning, page 21 CSI 4106, Winter 2005 Learning decision trees (3) Usually many decision trees are possible, with varying average cost of classification. Not all features must be included.
Learning, page 22 CSI 4106, Winter 2005 Learning decision trees (4) The ID3 algorithm (its latest industrial-strength implementation is called C5.0) If all examples are in the same class, build a leaf with this class. (If, for example, we have no historical data that record low or moderate risk, we can only learn that everything is high-risk.) Otherwise, if no more features can be used, build a leaf with a disjunction of the classes of the examples. (We might have data that only allow us to distinguish low risk from high and moderate risk.) Otherwise, select a feature for the root; partition the examples of this feature; build recursively the decision trees for all partitions; attach them to the root. (This is a greedy algorithm: a form of hill climbing.)
Learning, page 23 CSI 4106, Winter 2005 Learning decision trees (5) Two partially constructed decisions trees.
Learning, page 24 CSI 4106, Winter 2005 Learning decision trees (6) We saw that the same data can be turned into different trees. The question is what trees are better. Essentially, the choice of the feature for the root is important. We want to select a feature that gives the most information. Information in a set of disjoint classes C = {c 1,..., c n } is defined by this formula: I(C) = Σ -p(c i ) log 2 p(c i ) p(c i ) is the probability that an example is in class c i. The information is measured in bits.
Learning, page 25 CSI 4106, Winter 2005 Learning decision trees (7) Let us consider our credit risk data. There are three feature values in 14 classes. 6 classes have high risk, 3 have moderate risk, 5 have low risk. Assuming uniform distribution, their probabilities are as follows: Information contained in this partition: I(RISK) = - log 2 - log 2 - log 2 ≈ bits high, moderate, low
Learning, page 26 CSI 4106, Winter 2005 Learning decision trees (8) Let feature F be at the root, and let e 1,..., e m be the partitions of the examples on this feature. Information needed to build a tree for partition e i is I(e i ). Expected information needed to build the whole tree is a weighted average of I(e i ). Let |s| be the cardinality of set s. Let {e i } be the set of all partitions. Expected information is defined by this formula: E(F) = Σ |e i | / |{e i }| * I(e i )
Learning, page 27 CSI 4106, Winter 2005 Learning decision trees (9) In our data, there are three partitions based on income: e 1 = {1, 4, 7, 11}, |e 1 | = 4, I(e 1 ) = 0.0 All examples have high risk, so I(e 1 ) = -1 log 2 1. e 2 = {2, 3, 12, 14}, |e 2 | = 4, I(e 2 ) = 1.0 Two examples have high risk, two have moderate: I(e 2 ) = - 1/2 log 2 1/2 - 1/2 log 2 1/2. e 3 = {5, 6, 8, 9, 10, 13}, |e 3 | = 6, I(e 3 ) ≈ 0.65 I(e 3 ) = - 1/6 log 2 1/6 - 5/6 log 2 5/6. The expected information to complete the tree using income as the root feature is this: - 4/14 * /14 * /14 * 0.65 ≈ bits
Learning, page 28 CSI 4106, Winter 2005 Learning decision trees (10) Now we define the information gain from selecting feature F for tree-building, given a set of classes C. G(F) = I(C) - E(F) For our sample data and for F = income, we get this: G(INCOME) = I(RISK) - E(INCOME) ≈ bits = bits. Our analysis will be complete, and our choice clear, after we have similarly considered the remaining three features. The values are as follows: G(COLLATERAL) ≈ bits, G(DEBT) ≈ bits, G(CREDIT HISTORY) ≈ bits. That is, we should choose INCOME as the criterion in the root of the best decision tree that we can construct.
Learning, page 29 CSI 4106, Winter 2005 Explanation-based learning A target concept The learning system finds an “operational” definition of this concept, expressed in terms of some primitives. The target concept is represented as a predicate. A training example This is an instance of the target concept. It takes the form of a set of simple facts, not all of them necessarily relevant to the theory. A domain theory This is a set of rules, usually in predicate logic, that can explain how the training example fits the target concept. Operationality criteria These are the predicates (features) that should appear in an effective definition of the target concept.
Learning, page 30 CSI 4106, Winter 2005 Explanation-based learning (2) A classic example: a theory and an instance of a cup. A cup is a container for liquids that can be easily lifted. It has some typical parts, such as a handle and a bowl, Bowls, the actual containers, must be concave. Because a cup can be lifted, it should be light. And so on. The target concept is cup(X). The domain theory has five rules. liftable( X ) holds_liquid( X ) cup( X ) part( Z, W ) concave( W ) points_up( W ) holds_liquid( Z ) light( X ) part( X, handle ) liftable( X ) small( A ) light( A ) made_of( A, feathers ) light( A )
Learning, page 31 CSI 4106, Winter 2005 Explanation-based learning (3) The training example lists nine facts (some of them are not relevant). cup( obj1 )small( obj1 ) part( obj1, handle )owns( bob, obj1 ) part( obj1, bottom )part( obj1, bowl ) points_up( bowl )concave( bowl ) color( obj1, red ) Operationality criteria require a definition in terms of structural properties of objects (part, points_up, small, concave).
Learning, page 32 CSI 4106, Winter 2005 Explanation-based learning (4) Step 1: prove the target concept using the training example
Learning, page 33 CSI 4106, Winter 2005 Explanation-based learning (5) Step 2: generalize the proof. Constants from the domain theory, for example handle, are not generalized.
Learning, page 34 CSI 4106, Winter 2005 Explanation-based learning (6) Step 3: Take the definition “off the tree”, only the root and the leaves. In our example, we get this rule: small( X ) part( X, handle ) part( X, W ) concave( W ) points_up( W ) cup( X )