Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem.

Slides:



Advertisements
Similar presentations
Explanation-Based Learning (borrowed from mooney et al)
Advertisements

Data Mining Lecture 9.
 2002, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci Deductive.
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Decision Tree Learning - ID3
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
IT 433 Data Warehousing and Data Mining
Decision Tree Approach in Data Mining
Greedy Algorithms Greed is good. (Some of the time)
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Learning control knowledge and case-based planning Jim Blythe, with additional slides from presentations by Manuela Veloso.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Decision Tree Algorithm
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Induction of Decision Trees
Classification Continued
Machine Learning: Symbol-Based
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Decision Trees and more!. Learning OR with few attributes Target function: OR of k literals Goal: learn in time – polynomial in k and log n –  and 
Classification.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Fall 2004 TDIDT Learning CS478 - Machine Learning.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Chapter 9 – Classification and Regression Trees
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Experimental Evaluation of Learning Algorithms Part 1.
Ch10 Machine Learning: Symbol-Based
Machine Learning Queens College Lecture 2: Decision Trees.
Learning from Observations Chapter 18 Through
Decision Trees. Decision trees Decision trees are powerful and popular tools for classification and prediction. The attractiveness of decision trees is.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Classification Techniques: Bayesian Classification
Learning, page 1 CSI 4106, Winter 2005 Symbolic learning Points Definitions Representation in logic What is an arch? Version spaces Candidate elimination.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
ID3 example. No.Risk (Classification)Credit HistoryDebtCollateralIncome 1HighBadHighNone$0 to $15k 2HighUnknownHighNone$15 to $35k 3ModerateUnknownLowNone$15.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Copyright © Curt Hill Other Trees Applications of the Tree Structure.
More Symbolic Learning CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
DECISION TREES An internal node represents a test on an attribute.
Random Testing: Theoretical Results and Practical Implications IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2012 Andrea Arcuri, Member, IEEE, Muhammad.
Artificial Intelligence
Ch9: Decision Trees 9.1 Introduction A decision tree:
Chapter 6 Classification and Prediction
Data Science Algorithms: The Basic Methods
SAD: 6º Projecto.
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
MIS2502: Data Analytics Classification using Decision Trees
9.3 The ID3 Decision Tree Induction Algorithm
Data Mining – Chapter 3 Classification
Machine Learning: Lecture 3
Greedy Algorithms Alexandra Stefan.
Decision trees.
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Machine Learning: Lecture 5
Presentation transcript:

Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem of estimating credit risk by considering four features of a potential creditor. Such data can be derived from the history of credit applications.

Learning, page 20 CSI 4106, Winter 2005 Learning decision trees (2) At every level, one feature value is selected.

Learning, page 21 CSI 4106, Winter 2005 Learning decision trees (3) Usually many decision trees are possible, with varying average cost of classification. Not all features must be included.

Learning, page 22 CSI 4106, Winter 2005 Learning decision trees (4) The ID3 algorithm (its latest industrial-strength implementation is called C5.0) If all examples are in the same class, build a leaf with this class. (If, for example, we have no historical data that record low or moderate risk, we can only learn that everything is high-risk.) Otherwise, if no more features can be used, build a leaf with a disjunction of the classes of the examples. (We might have data that only allow us to distinguish low risk from high and moderate risk.) Otherwise, select a feature for the root; partition the examples of this feature; build recursively the decision trees for all partitions; attach them to the root. (This is a greedy algorithm: a form of hill climbing.)

Learning, page 23 CSI 4106, Winter 2005 Learning decision trees (5) Two partially constructed decisions trees.

Learning, page 24 CSI 4106, Winter 2005 Learning decision trees (6) We saw that the same data can be turned into different trees. The question is what trees are better. Essentially, the choice of the feature for the root is important. We want to select a feature that gives the most information. Information in a set of disjoint classes C = {c 1,..., c n } is defined by this formula: I(C) = Σ -p(c i ) log 2 p(c i ) p(c i ) is the probability that an example is in class c i. The information is measured in bits.

Learning, page 25 CSI 4106, Winter 2005 Learning decision trees (7) Let us consider our credit risk data. There are three feature values in 14 classes. 6 classes have high risk, 3 have moderate risk, 5 have low risk. Assuming uniform distribution, their probabilities are as follows: Information contained in this partition: I(RISK) = - log 2 - log 2 - log 2 ≈ bits high, moderate, low

Learning, page 26 CSI 4106, Winter 2005 Learning decision trees (8) Let feature F be at the root, and let e 1,..., e m be the partitions of the examples on this feature. Information needed to build a tree for partition e i is I(e i ). Expected information needed to build the whole tree is a weighted average of I(e i ). Let |s| be the cardinality of set s. Let {e i } be the set of all partitions. Expected information is defined by this formula: E(F) = Σ |e i | / |{e i }| * I(e i )

Learning, page 27 CSI 4106, Winter 2005 Learning decision trees (9) In our data, there are three partitions based on income: e 1 = {1, 4, 7, 11}, |e 1 | = 4, I(e 1 ) = 0.0 All examples have high risk, so I(e 1 ) = -1 log 2 1. e 2 = {2, 3, 12, 14}, |e 2 | = 4, I(e 2 ) = 1.0 Two examples have high risk, two have moderate: I(e 2 ) = - 1/2 log 2 1/2 - 1/2 log 2 1/2. e 3 = {5, 6, 8, 9, 10, 13}, |e 3 | = 6, I(e 3 ) ≈ 0.65 I(e 3 ) = - 1/6 log 2 1/6 - 5/6 log 2 5/6. The expected information to complete the tree using income as the root feature is this: - 4/14 * /14 * /14 * 0.65 ≈ bits

Learning, page 28 CSI 4106, Winter 2005 Learning decision trees (10) Now we define the information gain from selecting feature F for tree-building, given a set of classes C. G(F) = I(C) - E(F) For our sample data and for F = income, we get this: G(INCOME) = I(RISK) - E(INCOME) ≈ bits = bits. Our analysis will be complete, and our choice clear, after we have similarly considered the remaining three features. The values are as follows: G(COLLATERAL) ≈ bits, G(DEBT) ≈ bits, G(CREDIT HISTORY) ≈ bits. That is, we should choose INCOME as the criterion in the root of the best decision tree that we can construct.

Learning, page 29 CSI 4106, Winter 2005 Explanation-based learning A target concept The learning system finds an “operational” definition of this concept, expressed in terms of some primitives. The target concept is represented as a predicate. A training example This is an instance of the target concept. It takes the form of a set of simple facts, not all of them necessarily relevant to the theory. A domain theory This is a set of rules, usually in predicate logic, that can explain how the training example fits the target concept. Operationality criteria These are the predicates (features) that should appear in an effective definition of the target concept.

Learning, page 30 CSI 4106, Winter 2005 Explanation-based learning (2) A classic example: a theory and an instance of a cup. A cup is a container for liquids that can be easily lifted. It has some typical parts, such as a handle and a bowl, Bowls, the actual containers, must be concave. Because a cup can be lifted, it should be light. And so on. The target concept is cup(X). The domain theory has five rules. liftable( X )  holds_liquid( X )  cup( X ) part( Z, W )  concave( W )  points_up( W )  holds_liquid( Z ) light( X )  part( X, handle )  liftable( X ) small( A )  light( A ) made_of( A, feathers )  light( A )

Learning, page 31 CSI 4106, Winter 2005 Explanation-based learning (3) The training example lists nine facts (some of them are not relevant). cup( obj1 )small( obj1 ) part( obj1, handle )owns( bob, obj1 ) part( obj1, bottom )part( obj1, bowl ) points_up( bowl )concave( bowl ) color( obj1, red ) Operationality criteria require a definition in terms of structural properties of objects (part, points_up, small, concave).

Learning, page 32 CSI 4106, Winter 2005 Explanation-based learning (4) Step 1: prove the target concept using the training example

Learning, page 33 CSI 4106, Winter 2005 Explanation-based learning (5) Step 2: generalize the proof. Constants from the domain theory, for example handle, are not generalized.

Learning, page 34 CSI 4106, Winter 2005 Explanation-based learning (6) Step 3: Take the definition “off the tree”, only the root and the leaves. In our example, we get this rule: small( X )  part( X, handle )  part( X, W )  concave( W )  points_up( W )  cup( X )