Induction of Decision Trees (IDT) CSE 335/435 Resources: –http://www.cs.ualberta.ca/~aixplore/learning/DecisionTrees/ –http://www.aaai.org/AITopics/html/trees.html.

Slides:



Advertisements
Similar presentations
Learning from Observations
Advertisements

Induction of Decision Trees (IDT)
Learning from Observations Chapter 18 Section 1 – 3.
ICS 178 Intro Machine Learning
Decision Tree Learning - ID3
P is a subset of NP Let prob be a problem in P There is a deterministic algorithm alg that solves prob in polynomial time O(n k ), for some constant k.
Machine Learning A Quick look Sources:
Homework. Homework (next class) Read Chapter 2 of the Experience Management book and answer the following questions: Provide an example of something that.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Decision Tree Approach in Data Mining
Final Exam: May 10 Thursday. If event E occurs, then the probability that event H will occur is p ( H | E ) IF E ( evidence ) is true THEN H ( hypothesis.
Learning Department of Computer Science & Engineering Indian Institute of Technology Kharagpur.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Spring 2004.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.
Cooperating Intelligent Systems
18 LEARNING FROM OBSERVATIONS
Learning From Observations
Induction: Discussion Sources: –Chapter 3, Lenz et al Book: Case-based Reasoning Technology –
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2004.
Induction of Decision Trees
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18.
ICS 273A Intro Machine Learning
Analysis of Algorithms CS 477/677
LEARNING DECISION TREES
1 Theory of Inductive Learning zSuppose our examples are drawn with a probability distribution Pr(x), and that we learned a hypothesis f to describe a.
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
Learning decision trees
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
Learning….in a rather broad sense: improvement of performance on the basis of experience Machine learning…… improve for task T with respect to performance.
ICS 273A Intro Machine Learning
Learning: Introduction and Overview
CS 4700: Foundations of Artificial Intelligence
Information Theory and Games (Ch. 16). Information Theory Information theory studies information flow Under this context information has no intrinsic.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning CPS4801. Research Day Keynote Speaker o Tuesday 9:30-11:00 STEM Lecture Hall (2 nd floor) o Meet-and-Greet 11:30 STEM 512 Faculty Presentation.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Inductive learning Simplest form: learn a function from examples
Learning CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Machine Learning A Quick look Sources: Artificial Intelligence – Russell & Norvig Artifical Intelligence - Luger By: Héctor Muñoz-Avila.
LEARNING DECISION TREES Yılmaz KILIÇASLAN. Definition - I Decision tree induction is one of the simplest, and yet most successful forms of learning algorithm.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Learning from observations
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
Learning from Observations Chapter 18 Section 1 – 3, 5-8 (presentation TBC)
Learning from Observations Chapter 18 Section 1 – 3.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
L6. Learning Systems in Java. Necessity of Learning No Prior Knowledge about all of the situations. Being able to adapt to changes in the environment.
Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
Chapter 18 Section 1 – 3 Learning from Observations.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Anifuddin Azis LEARNING. Why is learning important? So far we have assumed we know how the world works Rules of queens puzzle Rules of chess Knowledge.
Decision Tree Learning CMPT 463. Reminders Homework 7 is due on Tuesday, May 10 Projects are due on Tuesday, May 10 o Moodle submission: readme.doc and.
Learning from Observations
Learning from Observations
Introduce to machine learning
Artificial Intelligence
Presented By S.Yamuna AP/CSE
Data Science Algorithms: The Basic Methods
Learning from Observations
P is a subset of NP Let prob be a problem in P
Learning from Observations
Decision trees One possible representation for hypotheses
Machine Learning: Decision Tree Learning
Presentation transcript:

Induction of Decision Trees (IDT) CSE 335/435 Resources: – – –

Recap from Previous Class Two main motivations for using decision trees:  As an analysis tool for lots of data (Giant Card)  As a way around the Knowledge Acquisition problem of expert systems (Gasoil example) We discussed several properties of decision trees:  Represent all boolean functions (i.e., tables): F: A 1 × A 2 × … × A n  {Yes, No}  Each boolean function (i.e., tables) may have several decision trees

Example Ex’ple Bar Fri Hun PatAltTypewait x1 no yes some yesFrench yes x4 no yes full yes Thai yes x5 no yes no full yesFrench no x6 yes no yes some noItalianyes x7 yes no none noBurgerno x8 no yes some no Thaiyes x9 yes no full noBurgerno x10 yes full yesItalianno x11 noNo no none no Thaino

Example of a Decision Tree Bar? yes Hun yes Pat full Alt … yes … Possible Algorithm: 1.Pick an attribute A randomly 2.Make a child for every possible value of A 3.Repeat 1 for every child until all attributes are exhausted 4.Label the leaves according to the cases no Fri no Hun yes Pat some Alt … yes Problem: Resulting tree could be very long

Example of a Decision Tree (II) Patrons? noyes none some waitEstimate? no yes 0-10 >60 Full Alternate? Reservation? Yes no yes No no Bar? Yes no yes Fri/Sat? NoYes yes no yes Hungry? yes No Alternate? yes Yes no Raining? no yes no yes Nice: Resulting tree is optimal.

Optimality Criterion for Decision Trees We want to reduce the average number of questions that are been asked. But how do we measure this for a tree T How about using the height: T is better than T’ if height(T) < height(T’)  Doesn’t work. Easy to show a counterexample, whereby height(T) = height(T’) but T asks less questions on average than T’ Better: the average path lenght, APL(T), of the tree T. Let L1,..,Ln be the n leaves of a decision tree T.  APL(T) = (height(L1) + height(L2) +…+ height(Ln))/n Homework

Construction of Optimal Decision Trees is NP-Complete Formulate this problem as a decision problem Show that the decision problem is in NP Design an algorithm solving the decision problem. Compute the complexity of this algorithm Discuss what makes the decision problem so difficult. (optional: if you do this you will be exempt of two homework assignments of your choice: sketch proof in class of this problem been NP-hard) Homework

Induction Ex’ple Bar Fri Hun Pat TypeReswai t x1 no yes some french yes x4 no yes full thai no yes x5 no yes no full french yes no x6 x7 x8 x9 x10 x11 Data pattern Databases: what are the data that matches this pattern? database Induction: what is the pattern that matches these data? induction

Learning: The Big Picture Two forms of learning:  Supervised: the input and output of the learning component can be perceived (for example: friendly teacher)  Unsupervised: there is no hint about the correct answers of the learning component (for example to find clusters of data)

Inductive Learning An example has the from (x,f(x)) Inductive task: Given a collection of examples, called the training set, for a function f, return a function h that approximates f (h is called the hypothesis) noise, error There is no way to know which of these two hypothesis is a better approximation of f. A preference of one over the other is called a bias.

Example, Training Sets in IDT Each row in the table (i.e., entry for the boolean function) is an example All rows form the training set If the classification of the example is “yes”, we say that the example is positive, otherwise we say that the example is negative

Induction of Decision Trees Objective: find a concise decision tree that agrees with the examples The guiding principle we are going to use is the Ockham’s razor principle: the most likely hypothesis is the simplest one that is consistent with the examples Problem: finding the smallest decision tree is NP-complete However, with simple heuristics we can find a small decision tree (approximations)

Induction of Decision Trees: Algorithm Algorithm: 1.Initially all examples are in the same group 2.Select the attribute that makes the most difference (i.e., for each of the values of the attribute most of the examples are either positive or negative) 3.Group the examples according to each value for the selected attribute 4.Repeat 1 within each group (recursive call)

IDT: Example Lets compare two candidate attributes: Patrons and Type. Which is a better attribute? Patrons? none X7(-),x11(-) some X1(+),x3(+),x6(+),x8(+) full X4(+),x12(+), x2(-),x5(-),x9(-),x10(-) Type? french X1(+), x5(-) italian X6(+), x10(-) burger X3(+),x12(+), x7(-),x9(-) X4(+),x12(+) x2(-),x11(-) thai

IDT: Example (cont’d) We select a best candidate for discerning between X4(+),x12(+), x2(-),x5(-),x9(-),x10(-) Patrons? none no some yes full Hungry noyes X5(-),x9(-)X4(+),x12(+), X2(-),x10(-)

IDT: Example (cont’d) By continuing in the same manner we obtain: Patrons? none no some yes full Hungry noyes YesType? Yes no Fri/Sat? french italian thai burger yes noyes noyes

IDT: Some Issues Sometimes we arrive to a node with no examples. This means that the example has not been observed. We just assigned as value the majority vote of its parent Sometimes we arrive to a node with both positive and negative examples and no attributes left. This means that there is noise in the data. We just assigned as value the majority vote of the examples

How Well does IDT works? This means: how well does H approximates f? Empirical evidence: 1.Collect a large set of examples 2.Divide it into two sets: the training set and the test set 3.Measure percentage of examples in test set that are classified correctly 4.Repeat 1 top 4 for different size of training sets, which are randomly selected Next slide shows the sketch of a resulting graph for the meal domain

How Well does IDT works? (II) % correct on test set Training set size Learning curve As the training set grows the prediction quality improves (for this reason these kinds of curves are called happy curves)

Selection of a Good Attribute: Information Gain Theory Suppose that I flip a “fair” coin:  what is the probability that it will come heads:  How much information you gain when it fall: bit Suppose that I flip a “totally unfair” coin (always come heads):  what is the probability that it will come heads:  How much information you gain when it fall: 1 0

Selection of a Good Attribute: Information Gain Theory (II) Suppose that I flip a “very unfair” coin (99% will come heads):  what is the probability that it will come heads:  How much information you gain when it fall: 0.99 Fraction of A bit In general, the information provided by an event decreases with the increase in the probability that that event occurs. Information gain of an event e (Shannon and Weaver, 1949): I(e) = log 2 (1/p(e))

Selection of a Good Attribute: Information Gain Theory (III) If the possible answers v i have probabilities p(v i ), then the information content of the actual answer is given by: I(p(v 1 ), p(v 2 ), …, p(v n )) = p(v 1 )I(v 1 ) + p(v 2 )I(v 2 ) +…+ p(v n )I(v n ) = p(v 1 )log 2 (1/p(v 1 )) + p(v 2 ) log 2 (1/p(v 2 )) +…+ p(v n ) log 2 (1/p(v n )) Examples:  Information content with the fair coin:  Information content with the totally unfair:  Information content with the very unfair: I(1/2,1/2) = 1 I(1,0) = 0 I(1/100,99/100) = 0.08

Selection of a Good Attribute For Decision Trees: suppose that the training set has p positive examples and n negative. The information content expected in a correct answer: I(p/(p+n),n/(p+n)) We can now measure how much information is needed after testing an attribute A:  Suppose that A has v values. Thus, if E is the training set, E is partitioned into subsets E 1, …, E v.  Each subset E i has p i positive and n i negative examples

Selection of a Good Attribute (II)  Each subset E i has p i positive and n i negative examples  If we go along the i branch, the information content of the i branch is given by: I(p i /(p i + n i ), n i /(p i + n i )) Probability that an example has the i-th attribute of A: P(A,i) =(p i + n i )/(p+n) Bits of information to classify an example after testing attribute A: Reminder(A) = p(A,1) I(p 1 /(p 1 + n 1 ), n 1 /(p 1 + n 1 )) + p(A,2) I(p 2 /(p 2 + n 2 ), n 2 /(p 2 + n 2 )) +…+ p(A,v) I(p v /(p v + n v ), n v /(p v + n v ))

Selection of a Good Attribute (III) The information gain of testing an attribute A: Gain(A) = I(p/(p+n),n/(p+n)) – Remainder(A) Example (restaurant):  Gain(Patrons) = ?  Gain(Type) = ?

Homework Assignment: 1.Compute Gain(Patrons) and Gain(Type)

Next Class (CSE 495) Assignment: 1.We will not prove formally that obtaining the smallest decision tree is NP-complete, but explain what makes this problem so difficult for a deterministic computer. 2.What is the complexity of the algorithm shown in Slide 11, assuming that the selection of the attribute in Step 2 is done by the information gain formula of Slide 23 3.Why this algorithm does not necessarily produces the smallest decision tree?