1 Chapter 18 Learning from Observations Decision tree examples Additional source used in preparing the slides: Jean-Claude Latombe’s CS121 slides: robotics.stanford.edu/~latombe/cs121.

Slides:

Advertisements

Similar presentations

Explanation-Based Learning (borrowed from mooney et al)

Advertisements

Learning from Observations Chapter 18 Section 1 – 3.

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

CHAPTER 13 Inference Techniques. Reasoning in Artificial Intelligence n Knowledge must be processed (reasoned with) n Computer program accesses knowledge.

Combining Inductive and Analytical Learning Ch 12. in Machine Learning Tom M. Mitchell 고려대학교 자연어처리 연구실 한 경 수

Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Spring 2004.

Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.

Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.

Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2004.

Evaluating Hypotheses

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

Machine Learning CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanning.

Machine Learning: Symbol-based

Inductive Learning (1/2) Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3.

CS 4700: Foundations of Artificial Intelligence

1 Machine Learning: Symbol-based 10b 10.0Introduction 10.1A Framework for Symbol-based Learning 10.2Version Space Search 10.3The ID3 Decision Tree Induction.

Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.

MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.

Part I: Classification and Bayesian Learning

1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.

Issues with Data Mining

Inductive Learning (1/2) Decision Tree Method

Notes for Chapter 12 Logic Programming The AI War Basic Concepts of Logic Programming Prolog Review questions.

Learning from observations

Learning from Observations Chapter 18 Through

CHAPTER 18 SECTION 1 – 3 Learning from Observations.

1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.

November 10, Machine Learning: Lecture 9 Rule Learning / Inductive Logic Programming.

Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.

Decision Tree Learning R&N: Chap. 18, Sect. 18.1–3.

Classification Techniques: Bayesian Classification

CS B351: D ECISION T REES. A GENDA Decision trees Learning curves Combatting overfitting.

Machine Learning Chapter 5. Artificial IntelligenceChapter 52 Learning 1. Rote learning rote( โรท ) n. วิถีทาง, ทางเดิน, วิธีการตามปกติ, (by rote จากความทรงจำ.

Concept Learning and the General-to-Specific Ordering 이 종우 자연언어처리연구실.

For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.

CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.

Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.

1 Inductive Learning (continued) Chapter 19 Slides for Ch. 19 by J.C. Latombe.

1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.

CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

For Monday Finish chapter 19 Take-home exam due. Program 4 Any questions?

1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.

CS Inductive Bias1 Inductive Bias: How to generalize on novel data.

Machine Learning Concept Learning General-to Specific Ordering

Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

CpSc 810: Machine Learning Analytical learning. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various.

Data Mining and Decision Support

CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.

1 Propositional Logic Limits The expressive power of propositional logic is limited. The assumption is that everything can be expressed by simple facts.

MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

More Symbolic Learning CPSC 386 Artificial Intelligence Ellen Walker Hiram College.

Daniel Kroening and Ofer Strichman 1 Decision Procedures An Algorithmic Point of View Basic Concepts and Background.

Chapter 18 Section 1 – 3 Learning from Observations.

Inductive Learning (2/2) Version Space and PAC Learning Russell and Norvig: Chapter 18, Sections 18.5 through 18.7 Chapter 18, Section 18.5 Chapter 19,

Learning From Observations Inductive Learning Decision Trees Ensembles.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

Lecture 1.31 Criteria for optimal reception of radio signals.

Chapter 2 Concept Learning

Learning from Observations

Chapter 7. Classification and Prediction

Artificial Intelligence

Data Science Algorithms: The Basic Methods

Issues in Decision-Tree Learning Avoiding overfitting through pruning

Classification Techniques: Bayesian Classification

Discrete Event Simulation - 4

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Why Machine Learning Flood of data

Machine Learning: Lecture 6

Machine Learning: UNIT-3 CHAPTER-1

Presentation transcript:

1 Chapter 18 Learning from Observations Decision tree examples Additional source used in preparing the slides: Jean-Claude Latombe’s CS121 slides: robotics.stanford.edu/~latombe/cs121

2 Decision Trees A decision tree allows a classification of an object by testing its values for certain properties check out the example at: We are trying to learn a structure that determines class membership after a sequence of questions. This structure is a decision tree.

3 Reverse engineered decision tree of the whale watcher expert system see flukes? see dorsal fin? size? blue whale yes no yes no vlg blow forward? yesno sperm whale humpback whale med size med? yes Size? lgvsm bowhead whale narwhal whale no blows? 12 gray whale right whale (see next page)

4 Reverse engineered decision tree of the whale watcher expert system (cont’d) see flukes? see dorsal fin? yes no yes size? lg dorsal fin tall and pointed? yesno killer whale northern bottlenose whale sm dorsal fin and blow visible at the same time? yesno sei whale fin whale blow? no yes no (see previous page)

5 What might the original data look like?

6 The search problem Given a table of observable properties, search for a decision tree that correctly represents the data (assuming that the data is noise-free), and is as small as possible. What does the search tree look like?

7 The predicate CONCEPT(x)  A(x)  (  B(x) v C(x)) can be represented by the following decision tree: A? B? C? True FalseTrue False Example: A mushroom is poisonous iff it is yellow and small, or yellow, big and spotted x is a mushroom CONCEPT = POISONOUS A = YELLOW B = BIG C = SPOTTED D = FUNNEL-CAP E = BULKY Predicate as a Decision Tree

8 Training Set

9 D CE B E AA A T F F FF F T T T TT Possible Decision Tree

10 D CE B E AA A T F F FF F T T T TT CONCEPT  (D  (  E v A)) v (C  (B v ((E   A) v A))) A? B? C? True False True False CONCEPT  A  (  B v C) KIS bias  Build smallest decision tree Computationally intractable problem  greedy algorithm Possible Decision Tree

11 True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 The distribution of the training set is: Getting Started

12 True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 The distribution of training set is: Without testing any observable predicate, we could report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13 Getting Started

13 True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 The distribution of training set is: Without testing any observable predicate, we could report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13 Assuming that we will only include one observable predicate in the decision tree, which predicate should we test to minimize the probability of error? Getting Started

14 A True: False: 6, 7, 8, 9, 10, 13 11, 12 1, 2, 3, 4, 5 T F If we test only A, we will report that CONCEPT is True if A is True (majority rule) and False otherwise. The estimated probability of error is: Pr(E) = (8/13)x(2/8) + (5/13)x(0/5) = 2/13 8/13 is the probability of getting True for A, and 2/8 is the probability that the report was incorrect (we are always reporting True for the concept). How to compute the probability of error

15 A True: False: 6, 7, 8, 9, 10, 13 11, 12 1, 2, 3, 4, 5 T F If we test only A, we will report that CONCEPT is True if A is True (majority rule) and False otherwise. The estimated probability of error is: Pr(E) = (8/13)x(2/8) + (5/13)x(0/5) = 2/13 5/13 is the probability of getting False for A, and 0 is the probability that the report was incorrect (we are always reporting False for the concept). How to compute the probability of error

16 A True: False: 6, 7, 8, 9, 10, 13 11, 12 1, 2, 3, 4, 5 T F If we test only A, we will report that CONCEPT is True if A is True (majority rule) and False otherwise The estimated probability of error is: Pr(E) = (8/13)x(2/8) + (5/8)x0 = 2/13 Assume It’s A

17 B True: False: 9, 10 2, 3, 11, 12 1, 4, 5 T F If we test only B, we will report that CONCEPT is False if B is True and True otherwise The estimated probability of error is: Pr(E) = (6/13)x(2/6) + (7/13)x(3/7) = 5/13 6, 7, 8, 13 Assume It’s B

18 C True: False: 6, 8, 9, 10, 13 1, 3, 4 1, 5, 11, 12 T F If we test only C, we will report that CONCEPT is True if C is True and False otherwise The estimated probability of error is: Pr(E) = (8/13)x(3/8) + (5/13)x(1/5) = 4/13 7 Assume It’s C

19 D T F If we test only D, we will report that CONCEPT is True if D is True and False otherwise The estimated probability of error is: Pr(E) = (5/13)x(2/5) + (8/13)x(3/8) = 5/13 True: False: 7, 10, 13 3, 5 1, 2, 4, 11, 12 6, 8, 9 Assume It’s D

20 E True: False: 8, 9, 10, 13 1, 3, 5, 12 2, 4, 11 T F If we test only E we will report that CONCEPT is False, independent of the outcome The estimated probability of error is: Pr(E) = (8/13)x(4/8) + (5/13)x(2/5) = 6/13 6, 7 Assume It’s E

21 So, the best predicate to test is A Pr(error) for each If A: 2/13 If B: 5/13 If C: 4/13 If D: 5/13 If E: 6/13

22 A T F The majority rule gives the probability of error Pr(E|A) = 1/8 and Pr(E) = 1/13 C True: False: 6, 8, 9, 10, 13 11, 12 7 T F False Choice of Second Predicate

23 C T F B True: False: 11,12 7 T F A T F False True Choice of Third Predicate

24 A C True B False A? B? C? True False True False L  CONCEPT  A  (C v  B) Final Tree

25 What happens if there is noise in the training set? The part of the algorithm shown below handles this: if attributes is empty then return MODE(examples) Consider a very small (but inconsistent) training set: A? True False  True False Aclassification TT FF FT

26 Using Information Theory Rather than minimizing the probability of error, learning procedures try to minimize the expected number of questions needed to decide if an object x satisfies CONCEPT. This minimization is based on a measure of the “quantity of information” that is contained in the truth value of an observable predicate.

27 Issues in learning decision trees If data for some attribute is missing and is hard to obtain, it might be possible to extrapolate or use “unknown.” If some attributes have continuous values, groupings might be used. If the data set is too large, one might use bagging to select a sample from the training set. Or, one can use boosting to assign a weight showing importance to each instance. Or, one can divide the sample set into subsets and train on one, and test on others.

28 Inductive bias Usually the space of learning algorithms is very large Consider learning a classification of bit strings  A classification is simply a subset of all possible bit strings  If there are n bits there are 2^n possible bit strings  If a set has m elements, it has 2^m possible subsets  Therefore there are 2^(2^n) possible classifications (if n=50, larger than the number of molecules in the universe) We need additional heuristics (assumptions) to restrict the search space

29 Inductive bias (cont’d) Inductive bias refers to the assumptions that a machine learning algorithm will use during the learning process One kind of inductive bias is Occams Razor: assume that the simplest consistent hypothesis about the target function is actually the best Another kind is syntactic bias: assume a pattern defines the class of all matching strings  “nr” for the cards  {0, 1, #} for bit strings

30 Inductive bias (cont’d) Note that syntactic bias restricts the concepts that can be learned  If we use “nr” for card subsets, “all red cards except King of Diamonds” cannot be learned  If we use {0, 1, #} for bit strings “1##0” represents {1110, 1100, 1010, 1000} but a single pattern cannot represent all strings of even parity ( the number of 1s is even, including zero) The tradeoff between expressiveness and efficiency is typical

31 Inductive bias (cont’d) Some representational biases include  Conjunctive bias: restrict learned knowledge to conjunction of literals  Limitations on the number of disjuncts  Feature vectors: tables of observable features  Decision trees  Horn clauses  BBNs There is also work on programs that change their bias in response to data, but most programs assume a fixed inductive bias

32 Two formulations for learning Inductive Hypothesis fits data Statistical inference Requires little prior knowledge Syntactic inductive bias Analytical Hypothesis fits domain theory Deductive inference Learns from scarce data Bias is domain theory DT and VS learners are “similarity-based” Prior knowledge is important. It might be one of the reasons for humans’ ability to generalize from as few as a single training instance. Prior knowledge can guide in a space of an unlimited number of generalizations that can be produced by training examples.

33 An example: META-DENDRAL Learns rules for DENDRAL Remember that DENDRAL infers structure of organic molecules from their chemical formula and mass spectrographic data. Meta-DENDRAL constructs an explanation of the site of a cleavage using  structure of a known compound  mass and relative abundance of the fragments produced by spectrography  a “half-order” theory (e.g., double and triple bonds do not break; only fragments larger than two carbon atoms show up in the data) These explanations are used as examples for constructing general rules