Machine Learning: Symbol-based

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

CHAPTER 13 Inference Techniques. Reasoning in Artificial Intelligence n Knowledge must be processed (reasoned with) n Computer program accesses knowledge.
Concept Learning DefinitionsDefinitions Search Space and General-Specific OrderingSearch Space and General-Specific Ordering The Candidate Elimination.
Combining Inductive and Analytical Learning Ch 12. in Machine Learning Tom M. Mitchell 고려대학교 자연어처리 연구실 한 경 수
1 Machine Learning: Symbol-based 10a 10.0Introduction 10.1A Framework for Symbol-based Learning 10.2Version Space Search 10.3The ID3 Decision Tree Induction.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.
1 Chapter 18 Learning from Observations Decision tree examples Additional source used in preparing the slides: Jean-Claude Latombe’s CS121 slides: robotics.stanford.edu/~latombe/cs121.
Evaluating Hypotheses
1 Chapter 19 Knowledge in Learning Version spaces examples Additional sources used in preparing the slides: Jean-Claude Latombe’s CS121 slides: robotics.stanford.edu/~latombe/cs121.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Inductive Learning (1/2) Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3.
Machine Learning: Symbol-Based
1 Machine Learning: Symbol-based 10b 10.0Introduction 10.1A Framework for Symbol-based Learning 10.2Version Space Search 10.3The ID3 Decision Tree Induction.
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Inductive Learning (1/2) Decision Tree Method
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning Version Spaces Learning. 2  Neural Net approaches  Symbolic approaches:  version spaces  decision trees  knowledge discovery  data.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Inductive learning Simplest form: learn a function from examples
CS 478 – Tools for Machine Learning and Data Mining The Need for and Role of Bias.
1 Machine Learning: Lecture 11 Analytical Learning / Explanation-Based Learning (Based on Chapter 11 of Mitchell, T., Machine Learning, 1997)
Machine Learning Chapter 11. Analytical Learning
Machine Learning CSE 681 CH2 - Supervised Learning.
Learning Holy grail of AI. If we can build systems that learn, then we can begin with minimal information and high-level strategies and have systems better.
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
Machine Learning Chapter 2. Concept Learning and The General-to-specific Ordering Tom M. Mitchell.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, January 22, 2001 William.
Decision Tree Learning R&N: Chap. 18, Sect. 18.1–3.
George F Luger ARTIFICIAL INTELLIGENCE 6th edition Structures and Strategies for Complex Problem Solving Machine Learning: Symbol-Based Luger: Artificial.
Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem.
CS B351: D ECISION T REES. A GENDA Decision trees Learning curves Combatting overfitting.
1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.
Machine Learning Chapter 5. Artificial IntelligenceChapter 52 Learning 1. Rote learning rote( โรท ) n. วิถีทาง, ทางเดิน, วิธีการตามปกติ, (by rote จากความทรงจำ.
CpSc 810: Machine Learning Concept Learning and General to Specific Ordering.
Outline Inductive bias General-to specific ordering of hypotheses
Overview Concept Learning Representation Inductive Learning Hypothesis
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
Learning, page 1 CSI 4106, Winter 2005 Symbolic learning Points Definitions Representation in logic What is an arch? Version spaces Candidate elimination.
1 Inductive Learning (continued) Chapter 19 Slides for Ch. 19 by J.C. Latombe.
KU NLP Machine Learning1 Ch 9. Machine Learning: Symbol- based  9.0 Introduction  9.1 A Framework for Symbol-Based Learning  9.2 Version Space Search.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Machine Learning Concept Learning General-to Specific Ordering
CpSc 810: Machine Learning Analytical learning. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various.
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
More Symbolic Learning CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Concept Learning and The General-To Specific Ordering
Computational Learning Theory Part 1: Preliminaries 1.
Chapter 18 Section 1 – 3 Learning from Observations.
Inductive Learning (2/2) Version Space and PAC Learning Russell and Norvig: Chapter 18, Sections 18.5 through 18.7 Chapter 18, Section 18.5 Chapter 19,
Learning From Observations Inductive Learning Decision Trees Ensembles.
Chapter 2 Concept Learning
Learning from Observations
Learning from Observations
Machine Learning: Symbol-Based
Machine Learning: Symbol-Based
CS 9633 Machine Learning Concept Learning
Presented By S.Yamuna AP/CSE
Learning from Observations
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Machine Learning Chapter 2
Inductive Learning (2/2) Version Space and PAC Learning
Implementation of Learning Systems
Version Space Machine Learning Fall 2018.
Machine Learning Chapter 2
Presentation transcript:

Machine Learning: Symbol-based 9 Machine Learning: Symbol-based 9.0 Introduction 9.1 A Framework for Symbol-based Learning 9.2 Version Space Search 9.3 The ID3 Decision Tree Induction Algorithm 9.4 Inductive Bias and Learnability 9.5 Knowledge and Learning 9.6 Unsupervised Learning 9.7 Reinforcement Learning 9.8 Epilogue and References 9.9 Exercises Additional source used in preparing the slides: Jean-Claude Latombe’s CS121 (Introduction to Artificial Intelligence) lecture notes, http://robotics.stanford.edu/~latombe/cs121/2003/home.htm (version spaces, decision trees) Tom Mitchell’s machine learning notes (explanation based learning)

Chapter Objectives Learn about several “paradigms” of symbol-based learning Learn about the issues in implementing and using learning algorithms The agent model: can learn, i.e., can use prior experience to perform better in the future

A learning agent Critic KB environment Learning element sensors actuators

A general model of the learning process

A learning game with playing cards I would like to show what a full house is. I give you examples which are/are not full houses: 6 6 6 9 9 is a full house 6 6 6 6 9 is not a full house 3 3 3 6 6 is a full house 1 1 1 6 6 is a full house Q Q Q 6 6 is a full house 1 2 3 4 5 is not a full house 1 1 3 4 5 is not a full house 1 1 1 4 5 is not a full house 1 1 1 4 4 is a full house

A learning game with playing cards If you haven’t guessed already, a full house is three of a kind and a pair of another kind. 6 6 6 9 9 is a full house 6 6 6 6 9 is not a full house 3 3 3 6 6 is a full house 1 1 1 6 6 is a full house Q Q Q 6 6 is a full house 1 2 3 4 5 is not a full house 1 1 3 4 5 is not a full house 1 1 1 4 5 is not a full house 1 1 1 4 4 is a full house

Intuitively, I’m asking you to describe a set. This set is the concept I want you to learn. This is called inductive learning, i.e., learning a generalization from a set of examples. Concept learning is a typical inductive learning problem: given examples of some concept, such as “cat,” “soybean disease,” or “good stock investment,” we attempt to infer a definition that will allow the learner to correctly recognize future instances of that concept.

Supervised learning This is called supervised learning because we assume that there is a teacher who classified the training data: the learner is told whether an instance is a positive or negative example of a target concept.

Supervised learning? This definition might seem counter intuitive. If the teacher knows the concept, why doesn’t s/he tell us directly and save us all the work? The teacher only knows the classification, the learner has to find out what the classification is. Imagine an online store: there is a lot of data concerning whether a customer returns to the store. The information is there in terms of attributes and whether they come back or not. However, it is up to the learning system to characterize the concept, e.g, If a customer bought more than 4 books, s/he will return. If a customer spent more than $50, s/he will return.

Rewarded card example Deck of cards, with each card designated by [r,s], its rank and suit, and some cards “rewarded” Background knowledge in the KB: ((r=1)  …  (r=10))  NUM (r) ((r=J)  (r=Q)  (r=K))  FACE (r) ((s=S)  (s=C))  BLACK (s) ((s=D)  (s=H))  RED (s) Training set: REWARD([4,C])  REWARD([7,C])  REWARD([2,S])  REWARD([5,H])  REWARD([J,S])

Rewarded card example Training set: REWARD([4,C])  REWARD([7,C])  REWARD([2,S])  REWARD([5,H])  REWARD([J,S]) Card In the target set? 4 yes 7 yes 2 yes 5 no J no Possible inductive hypothesis, h,: h = (NUM (r)  BLACK (s)  REWARD([r,s])

Learning a predicate Set E of objects (e.g., cards, drinking cups, writing instruments) Goal predicate CONCEPT (X), where X is an object in E, that takes the value True or False (e.g., REWARD, MUG, PENCIL, BALL) Observable predicates A(X), B(X), … (e.g., NUM, RED, HAS-HANDLE, HAS-ERASER) Training set: values of CONCEPT for some combinations of values of the observable predicates Find a representation of CONCEPT of the form CONCEPT(X)  A(X)  ( B(X) C(X) )

How can we do this? Go with the most general hypothesis possible: “any card is a rewarded card” This will cover all the positive examples, but will not be able to eliminate any negative examples. Go with the most specific hypothesis possible: “the rewarded cards are 4, 7, 2” This will correctly sort all the examples in the training set, but it is overly specific, will not be sort any new examples. But the above two are good starting points.

Version space algorithm What we want to do is start with the most general and specific hypotheses, and when we see a positive example, we minimally generalize the most specific hypotheses when we see a negative example, we minimally specialize the most general hypothesis When the most general hypothesis and the most specific hypothesis are the same, the algorithm has converged, this is the target concept

potential target concepts Pictorially + - - - + ? ? + ? - - + - + ? ? - + - - + ? ? + ? + + - + + boundary of G - - - - - - - - - - ? - ? + + - - - + - + + ? + + + + ? - ? - - - - - boundary of S potential target concepts

Hypothesis space When we shrink G, or enlarge S, we are essentially conducting a search in the hypothesis space A hypothesis is any sentence h of the form CONCEPT(X)  A(X)  ( B(X) C(X) ) where, the right hand side is built with observable predicates The set of all hypotheses is called the hypothesis space, or H A hypothesis h agrees with an example if it gives the correct value of CONCEPT

Size of the hypothesis space n observable predicates 2^n entries in the truth table A hypothesis is any subset of observable predicates with the associated truth tables: so there are 2^(2^n) hypotheses to choose from: BIG! n=6  2 ^ 64 = 1.8 x 10 ^ 19 BIG! Generate-and-test won’t work. 2 2n

Simplified Representation for the card problem For simplicity, we represent a concept by rs, with: r = a, n, f, 1, …, 10, j, q, k s = a, b, r, , , ,  For example: n represents: NUM(r)  (s=)  REWARD([r,s]) aa represents: ANY-RANK(r)  ANY-SUIT(s)  REWARD([r,s])

Extension of an hypothesis The extension of an hypothesis h is the set of objects that verifies h. For instance, the extension of f is: {j, q, k}, and the extension of aa is the set of all cards.

More general/specific relation Let h1 and h2 be two hypotheses in H h1 is more general than h2 iff the extension of h1 is a proper superset of the extension of h2 For instance, aa is more general than f, f is more general than q, fr and nr are not comparable

More general/specific relation (cont’d) The inverse of the “more general” relation is the “more specific” relation The “more general” relation defines a partial ordering on the hypotheses in H

A subset of the partial order for cards aa na ab nb n 4 4b a 4a

G-Boundary / S-Boundary of V An hypothesis in V is most general iff no hypothesis in V is more general G-boundary G of V: Set of most general hypotheses in V An hypothesis in V is most specific iff no hypothesis in V is more general S-boundary S of V: Set of most specific hypotheses in V

Example: The starting hypothesis space aa na ab nb n 4 4b a 4a aa 4 1 k … S

4 is a positive example We replace every hypothesis in S whose extension does not contain 4 by its generalization set aa na ab The generalization set of a hypothesis h is the set of the hypotheses that are immediately more general than h 4a nb a 4b n 4 Generalization set of 4

7 is the next positive example Minimally generalize the most specific hypothesis set aa We replace every hypothesis in S whose extension does not contain 7 by its generalization set na ab 4a nb a 4b n 4

7 is positive(cont’d) Minimally generalize the most specific hypothesis set aa na ab 4a nb a 4b n 4

7 is positive (cont’d) Minimally generalize the most specific hypothesis set aa na ab 4a nb a 4b n 4

5 is a negative example Minimally specialize the most general hypothesis set Specialization set of aa aa na ab 4a nb a 4b n 4

5 is negative(cont’d) Minimally specialize the most general hypothesis set aa na ab 4a nb a 4b n 4

After 3 examples (2 positive,1 negative) G and S, and all hypotheses in between form exactly the version space ab nb a 1. If an hypothesis between G and S disagreed with an example x, then an hypothesis G or S would also disagree with x, hence would have been removed n

After 3 examples (2 positive,1 negative) G and S, and all hypotheses in between form exactly the version space ab nb a 2. If there were an hypothesis not in this set which agreed with all examples, then it would have to be either no more specific than any member of G – but then it would be in G – or no more general than some member of S – but then it would be in S n

Do 8, 6, j satisfy CONCEPT? At this stage ab nb a No n Yes Maybe Do 8, 6, j satisfy CONCEPT?

2 is the next positive example Minimally generalize the most specific hypothesis set ab nb a n

j is the next negative example Minimally specialize the most general hypothesis set ab nb

Result + 4 7 2 – 5 j nb NUM(r)  BLACK(s)  REWARD([r,s])

The version space algorithm Begin Initialize S to the first positive training instance N is the set of all negative instances seen so far; For each example x If x is positive, then (G,S)  POSITIVE-UPDATE(G,S,x) else (G,S)  NEGATIVE-UPDATE(G,S,x) If G = S and both are singletons, then the algorithm has found a single concept that is consistent with all the data and the algorithm halts If G and S become empty, then there is no concept that covers all the positive instances and none of the negative instances End

The version space algorithm (cont’d) POSITIVE-UPDATE(G,S,x) Begin Delete all members of G that fail to match x For every s  S, if s does not match x, replace s with its most specific generalizations that match x; Delete from S any hypothesis that is more general than some other hypothesis in S; Delete from S any hypothesis that is neither more specific than nor equal to a hypothesis in G; (different than the textbook) End;

The version space algorithm (cont’d) NEGATIVE-UPDATE(G,S,x) Begin Delete all members of S that match x For every g  G, that matches x, replace g with its most general specializations that do not match x; Delete from G any hypothesis that is more specific than some other hypothesis in G; Delete from G any hypothesis that is neither more general nor equal to hypothesis in S; (different than the textbook) End;

Comments on Version Space Learning (VSL) It is a bi-directional search. One direction is specific to general and is driven by positive instances. The other direction is general to specific and is driven by negative instances. It is an incremental learning algorithm. The examples do not have to be given all at once (as opposed to learning decision trees.) The version space is meaningful even before it converges. The order of examples matters for the speed of convergence As is, cannot tolerate noise (misclassified examples), the version space might collapse

Examples and near misses for the concept “arch”

More on generalization operators Replacing constants with variables. For example, color (ball,red) generalizes to color (X,red) Dropping conditions from a conjunctive expression. For example, shape (X, round)  size (X, small)  color (X, red) generalizes to shape (X, round)  color (X, red)

More on generalization operators (cont’d) Adding a disjunct to an expression. For example, shape (X, round)  size (X, small)  color (X, red) generalizes to shape (X, round)  size (X, small)  ( color (X, red)  (color (X, blue) ) Replacing a property with its parent in a class hierarchy. If we know that primary_color is a superclass of red, then color (X, red) generalizes to color (X, primary_color)

Another example sizes = {large, small} colors = {red, white, blue} shapes = {sphere, brick, cube} object (size, color, shape) If the target concept is a “red ball,” then size should not matter, color should be red, and shape should be sphere If the target concept is “ball,” then size or color should not matter, shape should be sphere.

A portion of the concept space

Learning the concept of a “red ball” G : { obj (X, Y, Z)} S : { } positive: obj (small, red, sphere) G: { obj (X, Y, Z)} S : { obj (small, red, sphere) } negative: obj (small, blue, sphere) G: { obj (large, Y, Z), obj (X, red, Z), obj (X, white, sphere) obj (X,Y, brick), obj (X, Y, cube) } S: { obj (small, red, sphere) } delete from G every hypothesis that is neither more general than nor equal to a hypothesis in S G: {obj (X, red, Z) } S: { obj (small, red, sphere) }

Learning the concept of a “red ball” (cont’d) G: { obj (X, red, Z) } S: { obj (small, red, sphere) } positive: obj (large, red, sphere) G: { obj (X, red, Z)} S : { obj (X, red, sphere) } negative: obj (large, red, cube) G: { obj (small, red, Z), obj (X, red, sphere), obj (X, red, brick)} S: { obj (X, red, sphere) } delete from G every hypothesis that is neither more general than nor equal to a hypothesis in S G: {obj (X, red, sphere) } S: { obj (X, red, sphere) } converged to a single concept

LEX: a program that learns heuristics Learns heuristics for symbolic integration problems Typical transformations used in performing integration include OP1:  r f(x) dx  r  f(x) dx OP2:  u dv  uv -  v du OP3: 1 * f(x)  f(x) OP4:  (f1(x) + f2(x)) dx   f1(x) dx +  f2(x) dx A heuristic tells when an operator is particularly useful: If a problem state matches  x transcendental(x) dx then apply OP2 with bindings u = x dv = transcendental (x) dx

A portion of LEX’s hierarchy of symbols

The overall architecture A generalizer that uses candidate elimination to find heuristics A problem solver that produces positive and negative heuristics from a problem trace A critic that produces positive and negative instances from a problem traces (the credit assignment problem) A problem generator that produces new candidate problems

A version space for OP2 (Mitchell et al.,1983)

Comments on LEX The evolving heuristics are not guaranteed to be admissible. The solution path found by the problem solver may not actually be a shortest path solution. The problem generator is the least developed part of the program. Empirical studies: before: 5 problems solved in an average of 200 steps train with 12 problems after: 5 problems solved in an average of 20 steps

More comments on VSL Still lots of research going on Uses breadth-first search which might be inefficient: might need to use beam-search to prune hypotheses from G and S if they grow excessively another alternative is to use inductive-bias and restrict the concept language How to address the noise problem? Maintain several G and S sets.

Decision Trees A decision tree allows a classification of an object by testing its values for certain properties check out the example at: www.aiinc.ca/demos/whale.html The learning problem is similar to concept learning using version spaces in the sense that we are trying to identify a class using the observable properties. It is different in the sense that we are trying to learn a structure that determines class membership after a sequence of questions. This structure is a decision tree.

Reverse engineered decision tree of the whale watcher expert system see flukes? yes no see dorsal fin? no yes (see next page) size? size med? vlg med yes no blue whale blow forward? blows? Size? yes no 1 2 lg vsm sperm whale humpback whale gray whale right whale bowhead whale narwhal whale

Reverse engineered decision tree of the whale watcher expert system (cont’d) see flukes? yes no see dorsal fin? no (see previous page) yes blow? yes no size? lg sm dorsal fin and blow visible at the same time? dorsal fin tall and pointed? yes no yes no killer whale northern bottlenose whale sei whale fin whale

What does the original data look like?

The search problem Given a table of observable properties, search for a decision tree that correctly represents the data (assuming that the data is noise-free), and is as small as possible. What does the search tree look like?

Comparing VSL and learning DTs A hypothesis learned in VSL can be represented as a decision tree. Consider the predicate that we used as a VSL example: NUM(r)  BLACK(s)  REWARD([r,s]) The decision tree on the right represents it: NUM? True False BLACK? False False True True False

Predicate as a Decision Tree The predicate CONCEPT(x)  A(x)  (B(x) v C(x)) can be represented by the following decision tree: Example: A mushroom is poisonous iff it is yellow and small, or yellow, big and spotted x is a mushroom CONCEPT = POISONOUS A = YELLOW B = BIG C = SPOTTED D = FUNNEL-CAP E = BULKY A? B? C? True False

Training Set True False 13 12 11 10 9 8 7 6 5 4 3 2 1 CONCEPT E D C B Ex. #

Possible Decision Tree A T F

Possible Decision Tree A T F CONCEPT  (D  (E v A)) v (C  (B v ((E  A) v A))) A? B? C? True False CONCEPT  A  (B v C) KIS bias  Build smallest decision tree Computationally intractable problem  greedy algorithm

Getting Started The distribution of the training set is: True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12

Getting Started The distribution of training set is: True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 Without testing any observable predicate, we could report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13

Getting Started The distribution of training set is: True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 Without testing any observable predicate, we could report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13 Assuming that we will only include one observable predicate in the decision tree, which predicate should we test to minimize the probability of error?

Assume It’s A A F T True: 6, 7, 8, 9, 10, 13 False: 11, 12 1, 2, 3, 4, 5 T F If we test only A, we will report that CONCEPT is True if A is True (majority rule) and False otherwise The estimated probability of error is: Pr(E) = (8/13)x(2/8) + (5/8)x0 = 2/13

Assume It’s B B F T True: 9, 10 6, 7, 8, 13 False: 2, 3, 11, 12 1, 4, 5 T F If we test only B, we will report that CONCEPT is False if B is True and True otherwise The estimated probability of error is: Pr(E) = (6/13)x(2/6) + (7/13)x(3/7) = 5/13 6, 7, 8, 13

Assume It’s C C F T True: 6, 8, 9, 10, 13 7 False: 1, 3, 4 1, 5, 11, 12 T F If we test only C, we will report that CONCEPT is True if C is True and False otherwise The estimated probability of error is: Pr(E) = (8/13)x(3/8) + (5/13)x(1/5) = 4/13 7

Assume It’s D D F T True: 7, 10, 13 6, 8, 9 False: 3, 5 If we test only D, we will report that CONCEPT is True if D is True and False otherwise The estimated probability of error is: Pr(E) = (5/13)x(2/5) + (8/13)x(3/8) = 5/13 True: False: 7, 10, 13 3, 5 1, 2, 4, 11, 12 6, 8, 9

So, the best predicate to test is A Assume It’s E E True: False: 8, 9, 10, 13 1, 3, 5, 12 2, 4, 11 T F If we test only E we will report that CONCEPT is False, independent of the outcome The estimated probability of error is unchanged: Pr(E) = (8/13)x(4/8) + (5/13)x(2/5) = 6/13 6, 7 So, the best predicate to test is A

Choice of Second Predicate False C F T True: False: 6, 8, 9, 10, 13 11, 12 7 The majority rule gives the probability of error Pr(E|A) = 1/8 and Pr(E) = 1/13

Choice of Third Predicate False F T True B T F True: False: 11,12 7

Final Tree L  CONCEPT  A  (C v B) A? B? C? A C B True False True

Learning a decision tree Function induce_tree (example_set, properties) begin if all entries in example_set are in the same class then return a leaf node labeled with that class else if properties is empty then return leaf node labeled with disjunction of all classes in example_set else begin select a property, P, and make it the root of the current tree; delete P from properties; for each value, V, of P begin create a branch of the tree labeled with V; let partitionv be elements of example_set with values V for property P; call induce_tree (partitionv, properties), attach result to branch V end end end If property V is Boolean: the partition will contain two sets, one with property V true and one with false

What happens if there is noise in the training set? The part of the algorithm shown below handles this: if properties is empty then return leaf node labeled with disjunction of all classes in example_set Consider a very small (but inconsistent) training set: A classification T T F F F T A? True False False  True True

Using Information Theory Rather than minimizing the probability of error, most existing learning procedures try to minimize the expected number of questions needed to decide if an object x satisfies CONCEPT. This minimization is based on a measure of the “quantity of information” that is contained in the truth value of an observable predicate and is explained in Section 9.3.2. We will skip the technique given there and use the “probability of error” approach.

Assessing performance size of training set % correct on test set 100 Typical learning curve

The evaluation of ID3 in chess endgame

Other issues in learning decision trees If data for some attribute is missing and is hard to obtain, it might be possible to extrapolate or use “unknown.” If some attributes have continuous values, groupings might be used. If the data set is too large, one might use bagging to select a sample from the training set. Or, one can use boosting to assign a weight showing importance to each instance. Or, one can divide the sample set into subsets and train on one, and test on others.

Explanation based learning Idea: can learn better when the background theory is known Use the domain theory to explain the instances taught Generalize the explanation to come up with a “learned rule”

Example We would like the system to learn what a cup is, i.e., we would like it to learn a rule of the form: premise(X)  cup(X) Assume that we have a domain theory: liftable(X)  holds_liquid(X)  cup(X) part (Z,W)  concave(W)  points_up  holds_liquid (Z) light(Y)  part(Y,handle)  liftable (Y) small(A)  light(A) made_of(A,feathers)  light(A) The training example is the following: cup (obj1) small(obj1) small(obj1) part(obj1,handle) owns(bob,obj1) part(obj1,bottom) part(obj1, bowl) points_up(bowl) concave(bowl) color(obj1,red)

First, form a specific proof that obj1 is a cup cup (obj1) liftable (obj1) holds_liquid (obj1) light (obj1) part (obj1, handle) part (obj1, bowl) points_up(bowl) concave(bowl) small (obj1)

Second, analyze the explanation structure to generalize it

Third, adopt the generalized the proof cup (X) liftable (X) holds_liquid (X) light (X) part (X, handle) part (X, W) points_up(W) concave(W) small (X)

The EBL algorithm Initialize hypothesis = { } For each positive training example not covered by hypothesis: 1. Explain how training example satisfies target concept, in terms of domain theory 2. Analyze the explanation to determine the most general conditions under which this explanation (proof) holds 3. Refine the hypothesis by adding a new rule, whose premises are the above conditions, and whose consequent asserts the target concept

Wait a minute! Isn’t this “just a restatement of what the learner already knows?” Not really a theory-guided generalization from examples an example-guided operationalization of theories Even if you know all the rules of chess you get better if you play more Even if you know the basic axioms of probability, you get better as you solve more probability problems

Comments on EBL Note that the “irrelevant” properties of obj1 were disregarded (e.g., color is red, it has a bottom) Also note that “irrelevant” generalizations were sorted out due to its goal-directed nature Allows justified generalization from a single example Generality of result depends on domain theory Still requires multiple examples Assumes that the domain theory is correct (error-free)---as opposed to approximate domain theories which we will not cover. This assumption holds in chess and other search problems. It allows us to assume explanation = proof.

Two formulations for learning Inductive Given: Instances Hypotheses Target concept Training examples of the target concept Analytical Given: Instances Hypotheses Target concept Training examples of the target concept Domain theory for explaining examples Determine: Hypotheses consistent with the training examples Determine: Hypotheses consistent with the training examples and the domain theory

Two formulations for learning (cont’d) Inductive Hypothesis fits data Statistical inference Requires little prior knowledge Syntactic inductive bias Analytical Hypothesis fits domain theory Deductive inference Learns from scarce data Bias is domain theory DT and VS learners are “similarity-based” Prior knowledge is important. It might be one of the reasons for humans’ ability to generalize from as few as a single training instance. Prior knowledge can guide in a space of an unlimited number of generalizations that can be produced by training examples.

An example: META-DENDRAL Learns rules for DENDRAL Remember that DENDRAL infers structure of organic molecules from their chemical formula and mass spectrographic data. Meta-DENDRAL constructs an explanation of the site of a cleavage using structure of a known compound mass and relative abundance of the fragments produced by spectrography a “half-order” theory (e.g., double and triple bonds do not break; only fragments larger than two carbon atoms show up in the data) These explanations are used as examples for constructing general rules

Analogical reasoning Idea: if two situations are similar in some respects, then they will probably be in others Define the source of an analogy to be a problem solution. It is a theory that is relatively well understood. The target of an analogy is a theory that is not completely understood. Analogy constructs a mapping between corresponding elements of the target and the source.

Example: atom/solar system analogy The source domain contains: yellow(sun) blue(earth) hotter-than(sun,earth) causes(more-massive(sun,earth), attract(sun,earth)) causes(attract(sun,earth), revolves-around(earth,sun)) The target domain that the analogy is intended to explain includes: more-massive(nucleus, electron) revolves-around(electron, nucleus) The mapping is: sun  nucleus and earth  electron The extension of the mapping leads to the inference: causes(more-massive(nucleus,electron), attract(nucleus,electron)) causes(attract(nucleus,electron), revolves-around(electron,nucleus))

A typical framework Retrieval: Given a target problem, select a potential source analog. Elaboration: Derive additional features and relations of the source. Mapping and inference: Mapping of source attributes into the target domain. Justification: Show that the mapping is valid.