Decision Trees and Rule Induction

Slides:



Advertisements
Similar presentations
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Advertisements

Decision Trees Decision tree representation ID3 learning algorithm
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
1er. Escuela Red ProTIC - Tandil, de Abril, Decision Tree Learning 3.1 Introduction –Method for approximation of discrete-valued target functions.
Decision Tree Approach in Data Mining
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
ICS320-Foundations of Adaptive and Learning Systems
Classification Techniques: Decision Tree Learning
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
Decision Tree Algorithm
CS 590M Fall 2001: Security Issues in Data Mining Lecture 4: ID3.
Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics.
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Decision Trees Decision tree representation Top Down Construction
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification.
Decision Tree Learning
Decision tree learning
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Mohammad Ali Keyvanrad
Classification with Decision Trees and Rules Evgueni Smirnov.
Decision tree learning Maria Simi, 2010/2011 Inductive inference with decision trees  Decision Trees is one of the most widely used and practical methods.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Machine Learning Lecture 10 Decision Tree Learning 1.
CpSc 810: Machine Learning Decision Tree Learning.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
Decision-Tree Induction & Decision-Rule Induction
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
Decision Tree Learning
Decision Tree Learning Presented by Ping Zhang Nov. 26th, 2007.
Chap. 10 Learning Sets of Rules 박성배 서울대학교 컴퓨터공학과.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
DECISION TREES An internal node represents a test on an attribute.
CS 9633 Machine Learning Decision Tree Learning
Decision Tree Learning
Artificial Intelligence
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Science Algorithms: The Basic Methods
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Machine Learning Chapter 3. Decision Tree Learning
INTRODUCTION TO Machine Learning 2nd Edition
Decision Trees Jeff Storey.
Presentation transcript:

Decision Trees and Rule Induction Kurt Driessens with slides stolen from Evgueni Smirnov and Hendrik Blockeel

Overview Concepts, Instances, Hypothesis space Decisions trees Decision Rules

Concepts - Classes

Instances & Representation How to represent information about instances Attribute-Value head = triangle body = round color = blue legs = short holding = balloon smiling = false Can be symbolic or numeric head = round body = square color = red legs = long holding = knife smiling = true

More Advanced Representations Sequences dna, stock market, patient evolution Structures graphs: computer networks, Internet sites trees: html/xml documents, natural language Relational data-base molecules, complex problems In this course: Attribute-Value

Hypothesis Space H

Learning task H

Induction of decision trees What are decision trees? How can they be induced automatically? top-down induction of decision trees avoiding overfitting a few extensions

What are decision trees? Cf. guessing a person using only yes/no questions: ask some question depending on answer, ask a new question continue until answer known A decision tree Tells you which question to ask, depending on outcome of previous questions Gives you the answer in the end Usually not used for guessing an individual, but for predicting some property (e.g., classification)

Example decision tree 1 Play tennis or not? (depending on weather conditions) Each internal node tests an attribute Outlook Each branch corresponds to an attribute value Sunny Overcast Rainy Humidity Yes Wind Normal Strong High Weak No Yes No Yes Each leaf assigns a classification

Example decision tree 2 Tree for predicting whether C-section necessary Leaves are not pure here; ratio pos/neg is given Fetal_Presentation 1 3 2 Previous_Csection - - 1 [3+, 29-] .11+ .89- [8+, 22-] .27+ .73- Primiparous + [55+, 35-] .61+ .39- … …

Representation power Trees can represent any Boolean function i.e., also disjunctive concepts (<-> VS: conjunctive concepts) E.g. A or B Trees can allow noise (non-pure leaves) posterior class probabilities A false true B

Classification, Regression and Clustering Classification trees represent function X -> C with C discrete (like the decision trees we just saw) Hence, can be used for concept learning Regression trees predict numbers in leaves can use a constant (e.g., mean), or linear regression model, or … Clustering trees just group examples in leaves Most (but not all) decision tree research in data mining focuses on classification trees

Top-Down Induction of Decision Trees Basic algorithm for TDIDT: (based on ID3; later more formal) start with full data set find test that partitions examples as good as possible = examples with same class, or otherwise similar, are put together for each outcome of test, create child node move examples to children according to outcome of test repeat procedure for each child that is not “pure” Main questions: how to decide which test is “best” when to stop the procedure

Example problem ? Is this drink going to make me ill, or not?

Data set: 8 classified instances

Observation 1: Shape is important

Observation 2: For some shapes, Colour is important

The decision tree Shape Colour orange Non-orange ?

Finding the best test (for classification) Find test for which children are as “pure” as possible Purity measure borrowed from information theory: entropy measure of “missing information”; related to the minimum number of bits needed to represent the missing information Given set S with instances belonging to class i with probability pi: Entropy(S) = -  pi log2 pi

Entropy Entropy in function of p, for 2 classes:

Information gain Heuristic for choosing a test in a node: choose that test that on average provides most information about the class this is the test that, on average, reduces class entropy most entropy reduction differs according to outcome of test expected reduction of entropy = information gain

Example Assume S has 9 + and 5 - examples; partition according to Wind or Humidity attribute S: [9+,5-] S: [9+,5-] E = 0.985 E = 0.592 E = 0.811 E = 1.0 E = 0.940 Humidity Wind Normal Strong High Weak S: [3+,4-] S: [6+,1-] S: [6+,2-] S: [3+,3-] Gain(S, Humidity) = .940 - (7/14).985 - (7/14).592 = 0.151 Gain(S, Wind) = .940 - (8/14).811 - (6/14)1.0 = 0.048

Hypothesis space search in TDIDT Hypothesis space H = set of all trees H is searched in a hill-climbing fashion, from simple to complex maintain a single tree no backtracking

Inductive bias in TDIDT Note: for e.g. Boolean attributes, H is complete: each concept can be represented! given n attributes, we can keep on adding tests until all attributes tested So what about inductive bias? Clearly no “restriction bias” Preference bias: some hypotheses in H are preferred over others In this case: preference for short trees with informative attributes at the top

Occam’s Razor Preference for simple models over complex models is quite generally used in data mining Similar principle in science: Occam’s Razor roughly: do not make things more complicated than necessary Reasoning, in the case of decision trees: more complex trees have higher probability of overfitting the data set

Avoiding Overfitting Phenomenon of overfitting: keep improving a model, making it better and better on training set by making it more complicated … increases risk of modeling noise and coincidences in the data set may actually harm predictive power of theory on unseen cases Cf. fitting a curve with too many parameters .

Overfitting: example - + + + - + - + - + - + - area with probably wrong predictions - + - - - - - - - - - - - -

Overfitting: effect on predictive accuracy Typical phenomenon when overfitting: training accuracy keeps increasing accuracy on unseen validation set starts decreasing accuracy on training data accuracy on unseen data size of tree accuracy overfitting starts about here

How to avoid overfitting? Option 1: stop adding nodes to tree when overfitting starts occurring need stopping criterion Option 2: don’t bother about overfitting when growing the tree after the tree has been built, start pruning it again

Stopping criteria How do we know when overfitting starts? use a validation set = data not considered for choosing the best test  when accuracy goes down on validation set: stop adding nodes to this branch use a statistical test significance test: is the change in class distribution significant? (2-test) [in other words: does the test yield a clearly better situation?] MDL: minimal description length principle entirely correct theory = tree + corrections for misclassifications minimize size(theory) = size(tree) + size(misclassifications(tree)) Cf. Occam’s razor

Post-pruning trees After learning the tree: start pruning branches away For all nodes in tree: Estimate effect of pruning tree at this node on predictive accuracy, e.g. on validation set Prune node that gives greatest improvement Continue until no improvements Constitutes a second search in the hypothesis space

Reduced Error Pruning accuracy accuracy on training data accuracy on unseen data size of tree accuracy effect of pruning

Turning trees into rules From a tree a rule set can be derived Path from root to leaf in a tree = 1 if-then rule Advantage of such rule sets may increase comprehensibility Disjunctive concept definition can be pruned more flexibly in 1 rule, 1 single condition can be removed vs. tree: when removing a node, the whole subtree is removed 1 rule can be removed entirely

Rules from trees: example Outlook Sunny Overcast Rainy Humidity Yes Wind Normal Strong High Weak No Yes No Yes if Outlook = Sunny and Humidity = High then No if Outlook = Sunny and Humidity = Normal then Yes …

Pruning rules Possible method: convert tree to rules prune each rule independently remove conditions that do not harm accuracy of rule sort rules (e.g., most accurate rule first) more on this later

Handling missing values What if result of test is unknown for example? e.g. because value of attribute unknown Some possible solutions, when training: guess value: just take most common value (among all examples, among examples in this node / class, …) assign example partially to different branches e.g. counts for 0.7 in yes subtree, 0.3 in no subtree When using tree for prediction: combine predictions of different branches

High Branching Factors Attributes with continuous domains (numbers) cannot different branch for each possible outcome allow, e.g., binary test of the form Temperature < 20 same evaluation as before, but need to generate value (e.g. 20) For instance, just try all reasonable values Attributes with many discrete values unfair advantage over attributes with few values question with many possible answers is more informative than yes/no question To compensate: divide gain by “max. potential gain” SI Gain Ratio: GR(S,A) = Gain(S,A) / SI(S,A) Split-information SI(S,A) = -  |Si|/|S| log2 |Si|/|S| with i ranging over different results of test A

Why concave functions? E1 (n1/n)E1+(n2/n)E2 E Gain E2 p1 p p2 Assume node with size n, entropy E and proportion of positives p is split into 2 nodes with n1, E1, p1 and n2, E2 p2. We have p = (n1/n)p1 + (n2/n) p2 and the new average entropy E’ = (n1/n)E1+(n2/n)E2 is therefore found by linear interpolation between (p1,E1) and (p2,E2) at p. Gain = difference in height between (p, E) and (p,E’).

Generic TDIDT algorithm Many different algorithms for top-down induction of decision trees exist What do they have in common, and where do they differ? We look at a generic algorithm General framework for TDIDT algorithms Several “parameter procedures” instantiating them yields a specific algorithm Summarizes previously discussed points and puts them into perspective

Generic TDIDT algorithm function TDIDT(E: set of examples) returns tree; T' := grow_tree(E); T := prune(T'); return T; function grow_tree(E: set of examples) returns tree; T := generate_tests(E); t := best_test(T, E); P := partition induced on E by t; if stop_criterion(E, P) then return leaf(info(E)) else for all Ej in P: tj := grow_tree(Ej); return node(t, {(j,tj)};

For classification... prune: e.g. reduced-error pruning, ... generate_tests : Attr=val, Attr<val, ... for numeric attributes: generate val best_test : Gain, Gainratio, ... stop_criterion : MDL, significance test (e.g. 2-test), ... info : most frequent class ("mode") Popular systems: C4.5 (Quinlan 1993), C5.0

For regression... change best_test: e.g. minimize average variance info: mean stop_criterion: significance test (e.g., F-test), ... {1,3,4,7,8,12} {1,3,4,7,8,12} A1 A2 {1,4,12} {3,7,8} {1,3,7} {4,8,12}

Model trees Make predictions using linear regression models in the leaves info: regression model (y=ax1+bx2+c) best_test: ? variance: simple, not so good (M5 approach) residual variance after model construction: better, computationally expensive (RETIS approach) stop_criterion: significant reduction of variance A

When to Consider Decision Trees Each instance consists of an attribute array with discrete values (e.g. outlook/sunny, etc..) The classification is over discrete values (e.g. yes/no ) It is okay to have disjunctive descriptions – each path in the tree represents a disjunction of attribute combinations. Any Boolean function can be represented! It is okay for the training data to contain errors – decision trees are robust to classification errors in the training data. It is okay for the training data to contain missing values – decision trees can be used even if instances have missing attributes.

Summary Decision trees are a practical method for concept learning TDIDT = greedy search through complete hypothesis space search based bias only Overfitting is an important issue Large number of extensions of basic algorithm exist that handle overfitting, missing values, numerical values, etc.

Induction of Rule Sets What are decision rules? Induction of predictive rules Sequential covering approaches Learn-one-rule procedure Pruning

Decision Rules Another popular representation for concept definitions: if-then-rules IF <conditions> THEN belongs to concept Can be more compact and easier to interpret than trees How can we learn such rules ? By learning trees and converting them to rules With specific rule-learning methods (“sequential covering”)

Decision Boundaries + - - - - + - - + + - - + - + - + + + + + + + + + if A and B then pos if C and D then pos

Sequential Covering Approaches Or: “separate-and-conquer” approach Versus trees: “divide-and-conquer” General principle: learn a rule set one rule at a time Learn one rule that has High accuracy When it predicts something, it should be correct Any coverage Does not make a prediction for all examples, just for some of them Mark covered examples These have been taken care of; now focus on the rest Repeat this until all examples covered

Sequential Covering function LearnRuleSet(Target, Attrs, Examples, Threshold): LearnedRules :=  Rule := LearnOneRule(Target, Attrs, Examples) while performance(Rule,Examples) > Threshold, do LearnedRules := LearnedRules  {Rule} Examples := Examples \ {examples classified correctly by Rule} sort LearnedRules according to performance return LearnedRules

Learning One Rule To learn one rule: Perform greedy search Could be top-down or bottom-up Top-down: Start with maximally general rule (has maximal coverage but low accuracy) Add literals one by one Gradually maximize accuracy without sacrificing coverage (using some heuristic) Bottom-up: Start with maximally specific rule (has minimal coverage but maximal accuracy) Remove literals one by one Gradually maximize coverage without sacrificing accuracy (using some heuristic)

Learning One Rule function LearnOneRule(Target, Attrs, Examples): NewRule := “IF true THEN pos” NewRuleNeg := Neg while NewRuleNeg not empty, do // add a new literal to the rule Candidates := generate candidate literals BestLit := argmaxLCandidates performance(Specialise(NewRule,L)) NewRule := Specialise(NewRule, BestLit) NewRuleNeg := {xNeg | x covered by NewRule} return NewRule function Specialise(Rule, Lit): let Rule = “IF conditions THEN pos” return “IF conditions and Lit THEN pos”

Illustration IF true THEN pos - - + - - + - + IF A THEN pos IF A & B THEN pos + + + + + + + + - - - + - - - - -

Illustration IF true THEN pos IF C THEN pos IF C & D THEN pos + - IF A & B THEN pos

Bottom-up vs. Top-down Bottom-up: typically more specific rules - - + Top-down: typically more general rules

Heuristics Heuristics When is a rule “good”? High accuracy Somewhat less important: high coverage Possible evaluation functions: Accuracy: p / (p+n) (p=#positives, n=#negatives) A variant of accuracy: m-estimate: (p+mq) / (p+n+m) Weighted mean between accuracy on covered set of examples and a priori estimate of true accuracy q (m is weight) Entropy: more symmetry between pos and neg

Example-driven Top-down Rule Induction Example: AQ algorithms (Michalski et al.) for a given class C: as long as there are uncovered examples for C pick one such example e consider He = {rules that cover this example} search top-down in He to find best rule Much more efficient search Hypothesis spaces He much smaller than H (set of all rules) Less robust with respect to noise what if noisy example picked? some restarts may be necessary

Illustration: not example-driven Value of A: a - - If A=a then pos - - - + b + + + + + c + + + - - - - - d - - - Looking for a good rule in the format “IF A=... THEN pos”

Illustration: not example-driven + b + + If A=b then pos + + + c + + + - - - d - - - - -

Illustration: not example-driven + b + + + + + + If A=c then pos c + + - - - - - d - - -

Illustration: not example-driven + b + + + + + + c + + - - - d - - If A=d then pos - - -

Illustration: example-driven + If A=b then pos + + + + + + + + - - - + - - - - - Try only rules that cover the seed “+” which has A=b. Hence, A=b is a reasonable test, A=a is not. We do not try all 4 alternatives in this case! Just one.

How to Arrange the Rules According to the order they have been learned. According to their accuracy. Unordered: devise a strategy how to apply the rules E.g., an instance covered by conflicting rules use the rule with higher training accuracy; if an instance is not covered by any rule, then it is assigned the majority class

Approaches to Avoiding Overfitting Pre-pruning: stop learning the decision rules before they reach the point where they perfectly classify the training data Post-pruning: allow the decision rules to overfit the training data, and then post-prune the rules.

Post-Pruning Split instances into Growing Set and Pruning Set; Learn set SR of rules using Growing Set; Find the best simplification BSR of SR. while (Accuracy(BSR, Pruning Set) > Accuracy(SR, Pruning Set) ) do 4.1 SR = BSR; 4.2 Find the best simplification BSR of SR. 5. return BSR;

Incremental Reduced Error Pruning Post-pruning D1 D1 D21 D3 D2 D22 D3

Incremental Reduced Error Pruning Split Training Set into Growing Set and Validation Set; Learn rule R using Growing Set; Prune the rule R using Validation Set; if performance(R, Training Set) > Threshold 4.1 Add R to Set of Learned Rules 4.2 Remove in Training Set the instances covered by R; 4.2 go to 1; 5. else return Set of Learned Rules

Summary Points Decision rules are easier for human comprehension than decision trees. Decision rules have simpler decision boundaries than decision trees. Decision rules are learned by sequential covering of the training instances.