Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Wednesday, 11 April 2007 William H. Hsu Department of Computing and Information Sciences, KSU Readings: Chapter 10, Mitchell Intro to Rule Learning Lecture 33 of 42
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Lecture Outline Readings: Sections , Mitchell; Section 21.4, Russell and Norvig Suggested Exercises: 10.5, Mitchell Induction as Inverse of Deduction –Problem of inductive learning revisited –Operators for automated deductive inference Resolution rule for deduction First-order predicate calculus (FOPC) and resolution theorem proving –Inverting resolution Propositional case First-order case Inductive Logic Programming (ILP) –Cigol –Progol
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Induction as Inverted Deduction: Design Principles Recall: Definition of Induction –Induction: finding h such that D. (B D x i ) | f(x i ) A | B means A logically entails B x i ith target instance f(x i ) is the target function value for example x i (data set D = { }) Background knowledge B (e.g., inductive bias in inductive learning) Idea –Design inductive algorithm by inverting operators for automated deduction –Same deductive operators as used in theorem proving Theorem Prover Deductive System for Inductive Learning Training Examples New Instance Assertion { c H } Inductive bias made explicit Classification of New Instance (or “Don’t Know”)
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Induction as Inverted Deduction: Example Deductive Query –“Pairs of people such that u is a child of v” –Relations (predicates) Child (target predicate) Father, Mother, Parent, Male, Female Learning Problem –Formulation Concept learning: target function f is Boolean-valued i.e., target predicate –Components Target function f(x i ): Child (Bob, Sharon) x i : Male (Bob), Female (Sharon), Father (Sharon, Bob) B: {Parent (x, y) Father (x, y). Parent (x, y) Mother (x, y).} –What satisfies D. (B D x i ) | f(x i )? h 1 : Child (u, v) Father (v, u).- doesn’t use B h 2 : Child (u, v) Parent (v, u).- uses B
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Perspectives on Learning and Inference Jevons (1874) –First published insight that induction can be interpreted as inverted deduction –“Induction is, in fact, the inverse operation of deduction, and cannot be conceived to exist without the corresponding operation, so that the question of relative importance cannot arise. Who thinks of asking whether addition or subtraction is the more important process in arithmetic? But at the same time much difference in difficulty may exist between a direct and inverse operation; … it must be allowed that inductive investigations are of a far higher degree of difficulty and complexity that any questions of deduction…” Aristotle (circa 330 B.C.) –Early views on learning from observations (examples) and interplay between induction and deduction –“… scientific knowledge through demonstration [i.e., deduction] is impossible unless a man knows the primary immediate premises… we must get to know the primary premises by induction; for the method by which even sense-perception implants the universal is inductive…”
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Induction as Inverted Deduction: Operators Deductive Operators –Have mechanical operators (F) for finding logically entailed conclusions (C) –F(A, B) = C where A B | C A, B, C: logical formulas F: deduction algorithm –Intuitive idea: apply deductive inference (aka sequent) rules to A, B to generate C Inductive Operators –Need operators O to find inductively inferred hypotheses (h, “primary premises”) –O(B, D) = h where D. (B D x i ) | f(x i ) B, D, h: logical formulas describing observations O: induction algorithm
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Induction as Inverted Deduction: Advantages and Disadvantages Advantages (Pros) –Subsumes earlier idea of finding h that “fits” training data –Domain theory B helps define meaning of “fitting” the data: B D x i | f(x i ) –Suggests algorithms that search H guided by B Theory-guided constructive induction [Donoho and Rendell, 1995] aka Knowledge-guided constructive induction [Donoho, 1996] Disadvantages (Cons) –Doesn’t allow for noisy data Q: Why not? A: Consider what D. (B D x i ) | f(x i ) stipulates –First-order logic gives a huge hypothesis space H Overfitting… Intractability of calculating all acceptable h’s
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Deduction: Resolution Rule Intuitive Idea –Suppose we know P L L R –Can infer: P R –Use this to reason over logical statements (propositions, first-order clauses) Resolution Rule –Sequent rule –1. Given: initial clauses C 1 and C 2, find literal L from clause C 1 such that L occurs in clause C 2 –2. Form the resolvent C by including all literals from C 1 and C 2, except L and L Set of literals occurring in conclusion C is C = (C 1 - {L}) (C 2 - { L}) denotes set union, “-” denotes set difference
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Inverting Resolution: Example C: Pass-Exam Study C 2 : Know-Material Study C 1 : Pass-Exam Know-Material Resolution C: Pass-Exam Study C 2 : Know-Material Study C 1 : Pass-Exam Know-Material Inverse Resolution
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Inverted Resolution: Propositional Logic Problem Definition –Given: initial clauses C 1 and C –Return: clause C 2 such that C is resolvent of C 1 and C 2 Intuitive Idea –Reason from consequent and partial set of premises to unknown premises –Premise hypothesis Inverted Resolution Procedure –1. Find literal L that occurs in C 1 but not in C –2. Form second clause C 2 by including the following literals: C 2 = (C - (C 1 - {L})) { L}
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Quick Review: First-Order Predicate Calculus (FOPC) Components of FOPC Formulas: Quick Intro to Terminology –Constants: e.g., John, Kansas, 42 –Variables: e.g., Name, State, x –Predicates: e.g., Father-Of, Greater-Than –Functions: e.g., age, cosine –Term: constant, variable, or function(term) –Literals (atoms): Predicate(term) or negation (e.g., Greater-Than (age (John), 42) –Clause: disjunction of literals with implicit universal quantification –Horn clause: at most one positive literal (H L 1 L 2 … L n ) FOPC: Representation Language for First-Order Resolution –aka First-Order Logic (FOL) –Applications Resolution using Horn clauses: logic programming (Prolog) Automated deduction (deductive inference), theorem proving –Goal: learn first-order rules by inverting first-order resolution
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition First-Order Resolution Intuitive Idea: Same as For Propositional Resolution Resolution Rule –Sequent rule: also same as for propositional resolution –1. Given: initial clauses C 1 and C 2, find literal L 1 from clause C 1, literal L 2 from clause C 2, and substitution such that L 1 = L 2 (found using unification) –2. Form the resolvent C by including all literals from C 1 and C 2 , except L 1 and L 2 Set of literals occurring in conclusion C is C = (C 1 - {L 1 }) (C 2 - { L 2 }) denotes set union, “-” denotes set difference Substitution applied to sentences with matched literals removed
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Inverted Resolution: First-Order Logic Problem Definition –As for inverted propositional resolution Given: initial clauses C 1 and C Return: clause C 2 such that C is resolvent of C 1 and C 2 –Difference: must find, apply substitutions and inverse substitutions Inverted Resolution Procedure –1. Find literal L 1 that occurs in C 1 but doesn’t match any literal in C –2. Form second clause C 2 by including the following literals: C 2 = (C - (C 1 - {L 1 }) 1 ) 2 -1 { L 1 1 2 -1 }
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Inverse Resolution Algorithm (Cigol): Example GrandChild (Bob, Shannon) Father (Shannon, Tom) Father (Tom, Bob) GrandChild (Bob, x) Father (x, Tom) {Shannon / x} GrandChild (y, x) Father (x, z) Father (z, y) {Bob / y, Tom / z}
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Progol Problem: Searching Resolution Space Results in Combinatorial Explosion Solution Approach –Reduce explosion by generating most specific acceptable h –Conduct general-to-specific search (cf. Find-G, CN2 Learn-One-Rule) Procedure –1. User specifies H by stating predicates, functions, and forms of arguments allowed for each –2. Progol uses sequential covering algorithm FOR each DO Find most specific hypothesis h i such that B h i x i | f(x i ) Actually, considers only entailment within k steps –3. Conduct general-to-specific search bounded by specific hypothesis h i, choosing hypothesis with minimum description length
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Learning First-Order Rules: Numerical versus Symbolic Approaches Numerical Approaches –Method 1: learning classifiers and extracting rules Simultaneous covering: decision trees, ANNs NB: extraction methods may not be simple enumeration of model –Method 2: learning rules directly using numerical criteria Sequential covering algorithms and search Criteria: MDL (information gain), accuracy, m-estimate, other heuristic evaluation functions Symbolic Approaches –Invert forward inference (deduction) operators Resolution rule Propositional and first-order variants –Issues Need to control search Ability to tolerate noise (contradictions): paraconsistent reasoning
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Learning Disjunctive Sets of Rules Method 1: Rule Extraction from Trees –Learn decision tree –Convert to rules One rule per root-to-leaf path Recall: can post-prune rules (drop pre-conditions to improve validation set accuracy) Method 2: Sequential Covering –Idea: greedily (sequentially) find rules that apply to (cover) instances in D –Algorithm Learn one rule with high accuracy, any coverage Remove positive examples (of target attribute) covered by this rule Repeat
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Sequential Covering: Algorithm Algorithm Sequential-Covering (Target-Attribute, Attributes, D, Threshold) –Learned-Rules {} –New-Rule Learn-One-Rule (Target-Attribute, Attributes, D) –WHILE Performance (Rule, Examples) > Threshold DO Learned-Rules += New-Rule// add new rule to set D.Remove-Covered-By (New-Rule)// remove examples covered by New-Rule New-Rule Learn-One-Rule (Target-Attribute, Attributes, D) –Sort-By-Performance (Learned-Rules, Target-Attribute, D) –RETURN Learned-Rules What Does Sequential-Covering Do? –Learns one rule, New-Rule –Takes out every example in D to which New-Rule applies (every covered example)
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition IF {Humidity = Normal} THEN Play-Tennis = Yes IF {Wind = Strong} THEN Play-Tennis = No IF {Wind = Light} THEN Play-Tennis = Yes IF {Humidity = High} THEN Play-Tennis = No … Learn-One-Rule: (Beam) Search for Preconditions IF {} THEN Play-Tennis = Yes … IF {Humidity = Normal, Outlook = Sunny} THEN Play-Tennis = Yes IF {Humidity = Normal, Wind = Strong} THEN Play-Tennis = Yes IF {Humidity = Normal, Wind = Light} THEN Play-Tennis = Yes IF {Humidity = Normal, Outlook = Rain} THEN Play-Tennis = Yes
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Learn-One-Rule: Algorithm Algorithm Sequential-Covering (Target-Attribute, Attributes, D) –Pos D.Positive-Examples() –Neg D.Negative-Examples() –WHILE NOT Pos.Empty() DO// learn new rule Learn-One-Rule (Target-Attribute, Attributes, D) Learned-Rules.Add-Rule (New-Rule) Pos.Remove-Covered-By (New-Rule) –RETURN (Learned-Rules) Algorithm Learn-One-Rule (Target-Attribute, Attributes, D) –New-Rule most general rule possible –New-Rule-Neg Neg –WHILE NOT New-Rule-Neg.Empty() DO// specialize New-Rule 1. Candidate-Literals Generate-Candidates()// NB: rank by Performance() 2. Best-Literal argmax L Candidate-Literals Performance (Specialize-Rule (New-Rule, L), Target-Attribute, D)// all possible new constraints 3. New-Rule.Add-Precondition (Best-Literal)// add the best one 4. New-Rule-Neg New-Rule-Neg.Filter-By (New-Rule) –RETURN (New-Rule)
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Learn-One-Rule: Subtle Issues How Does Learn-One-Rule Implement Search? –Effective approach: Learn-One-Rule organizes H in same general fashion as ID3 –Difference Follows only most promising branch in tree at each step Only one attribute-value pair (versus splitting on all possible values) –General to specific search (depicted in figure) Problem: greedy depth-first search susceptible to local optima Solution approach: beam search (rank by performance, always expand k best) Easily generalizes to multi-valued target functions (how?) Designing Evaluation Function to Guide Search –Performance (Rule, Target-Attribute, D) –Possible choices Entropy (i.e., information gain) as for ID3 Sample accuracy (n c / n correct rule predictions / total predictions) m estimate: (n c + mp) / (n + m) where m weight, p prior of rule RHS
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Variants of Rule Learning Programs Sequential or Simultaneous Covering of Data? –Sequential: isolate components of hypothesis (e.g., search for one rule at a time) –Simultaneous: whole hypothesis at once (e.g., search for whole tree at a time) General-to-Specific or Specific-to-General? –General-to-specific: add preconditions, Find-G –Specific-to-general: drop preconditions, Find-S Generate-and-Test or Example-Driven? –Generate-and-test: search through syntactically legal hypotheses –Example-driven: Find-S, Candidate-Elimination, Cigol (next time) Post-Pruning of Rules? –Recall (Lecture 5): very popular overfitting recovery method What Statistical Evaluation Method? –Entropy –Sample accuracy (aka relative frequency) –m-estimate of accuracy
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition First-Order Rules What Are First-Order Rules? –Well-formed formulas (WFFs) of first-order predicate calculus (FOPC) –Sentences of first-order logic (FOL) –Example (recursive) Ancestor (x, y) Parent (x, y). Ancestor (x, y) Parent (x, z) Ancestor (z, y). Components of FOPC Formulas: Quick Intro to Terminology –Constants: e.g., John, Kansas, 42 –Variables: e.g., Name, State, x –Predicates: e.g., Father-Of, Greater-Than –Functions: e.g., age, cosine –Term: constant, variable, or function(term) –Literals (atoms): Predicate(term) or negation (e.g., Greater-Than (age(John), 42)) –Clause: disjunction of literals with implicit universal quantification –Horn clause: at most one positive literal (H L 1 L 2 … L n )
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Learning First-Order Rules Why Do That? –Can learn sets of rules such as Ancestor (x, y) Parent (x, y). Ancestor (x, y) Parent (x, z) Ancestor (z, y). –General-purpose (Turing-complete) programming language PROLOG Programs are such sets of rules (Horn clauses) Inductive logic programming (next time): kind of program synthesis Caveat –Arbitrary inference using first-order rules is semi-decidable Recursive enumerable but not recursive (reduction to halting problem L H ) Compare: resolution theorem-proving; arbitrary queries in Prolog –Generally, may have to restrict power Inferential completeness Expressive power of Horn clauses Learning part
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition First-Order Rule: Example Prolog (FOPC) Rule for Classifying Web Pages –[Slattery, 1997] –Course (A) Has-Word (A, “instructor”), not Has-Word (A, “good”), Link-From (A, B), Has-Word (B, “assign”), not Link-From (B, C). –Train: 31/31, test: 31/34 How Are Such Rules Used? –Implement search-based (inferential) programs –References Chapters 1-10, Russell and Norvig Online resources at
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition First-Order Inductive Learning (FOIL): Algorithm Algorithm FOIL (Target-Predicate, Predicates, D) –Pos D.Filter-By(Target-Predicate)// examples for which it is true –Neg D.Filter-By(Not (Target-Predicate))// examples for which it is false –WHILE NOT Pos.Empty() DO// learn new rule Learn-One-First-Order-Rule (Target-Predicate, Predicates, D) Learned-Rules.Add-Rule (New-Rule) Pos.Remove-Covered-By (New-Rule) –RETURN (Learned-Rules) Algorithm Learn-One-First-Order-Rule (Target-Predicate, Predicate, D) –New-Rule the rule that predicts Target-Predicate with no preconditions –New-Rule-Neg Neg –WHILE NOT New-Rule-Neg.Empty() DO// specialize New-Rule 1. Candidate-Literals Generate-Candidates()// based on Predicates 2. Best-Literal argmax L Candidate-Literals FOIL-Gain (L, New-Rule, Target-Predicate, D)// all possible new literals 3. New-Rule.Add-Precondition (Best-Literal)// add the best one 4. New-Rule-Neg New-Rule-Neg,Filter-By (New-Rule) –RETURN (New-Rule)
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Specializing Rules in FOIL Learning Rule: P(x 1, x 2, …, x k ) L 1 L 2 … L n. Candidate Specializations –Add new literal to get more specific Horn clause –Form of literal Q(v 1, v 2, …, v r ), where at least one of the v i in the created literal must already exist as a variable in the rule Equal(x j, x k ), where x j and x k are variables already present in the rule The negation of either of the above forms of literals
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Information Gain in FOIL Function FOIL-Gain (L, R, Target-Predicate, D) Where –L candidate predicate to add to rule R –p 0 number of positive bindings of R –n 0 number of negative bindings of R –p 1 number of positive bindings of R + L –n 1 number of negative bindings of R + L –t number of positive bindings of R also covered by R + L Note –- lg (p 0 / p 0 + n 0 ) is optimal number of bits to indicate the class of a positive binding covered by R –Compare: entropy (information gain) measure in ID3
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition FOIL: Learning Recursive Rule Sets Recursive Rules –So far: ignored possibility of recursive WFFs New literals added to rule body could refer to target predicate itself i.e., predicate occurs in rule head –Example Ancestor (x, y) Parent (x, z) Ancestor (z, y). Rule: IF Parent (x, z) Ancestor (z, y) THEN Ancestor (x, y) Learning Recursive Rules from Relations –Given: appropriate set of training examples –Can learn using FOIL-based search Requirement: Ancestor Predicates (symbol is member of candidate set) Recursive rules still have to outscore competing candidates at FOIL-Gain –NB: how to ensure termination? (well-founded ordering, i.e., no infinite recursion) –[Quinlan, 1990; Cameron-Jones and Quinlan, 1993]
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition FOIL: Summary Extends Sequential-Covering Algorithm –Handles case of learning first-order rules similar to Horn clauses –Result: more powerful rules for performance element (automated reasoning) General-to-Specific Search –Adds literals (predicates and negations over functions, variables, constants) –Can learn sets of recursive rules Caveat: might learn infinitely recursive rule sets Has been shown to successfully induce recursive rules in some cases Overfitting –If no noise, might keep adding new literals until rule covers no negative examples –Solution approach: tradeoff (heuristic evaluation function on rules) Accuracy, coverage, complexity FOIL-Gain: an MDL function Overfitting recovery in FOIL: post-pruning
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Terminology Induction and Deduction –Induction: finding h such that D. (B D x i ) | f(x i ) –Inductive learning: B background knowledge (inductive bias, etc.) –Developing inverse deduction operators Deduction: finding entailed logical statements F(A, B) = C where A B | C Inverse deduction: finding hypotheses O(B, D) = h where D. (B D x i ) | f(x i ) –Resolution rule: deductive inference rule (P L, L R | P R) Propositional logic: boolean terms, connectives ( , , , ) First-order predicate calculus (FOPC): well-formed formulas (WFFs), aka clauses (defined over literals, connectives, implicit quantifiers) –Inverse entailment: inverse of resolution operator Inductive Logic Programming (ILP) –Cigol: ILP algorithm that uses inverse entailment –Progol: sequential covering (general-to-specific search) algorithm for ILP
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Summary Points Induction as Inverse of Deduction –Problem of induction revisited Definition of induction Inductive learning as specific case Role of induction, deduction in automated reasoning –Operators for automated deductive inference Resolution rule (and operator) for deduction First-order predicate calculus (FOPC) and resolution theorem proving –Inverting resolution Propositional case First-order case (inverse entailment operator) Inductive Logic Programming (ILP) –Cigol: inverse entailment (very susceptible to combinatorial explosion) –Progol: sequential covering, general-to-specific search using inverse entailment Next Week: Knowledge Discovery in Databases (KDD), Final Review