Extending Bayesian Logic Programs for Plan Recognition and Machine Reading Sindhu V. Raghavan Advisor: Raymond Mooney PhD Proposal May 12 th, 2011 1.

Extending Bayesian Logic Programs for Plan Recognition and Machine Reading Sindhu V. Raghavan Advisor: Raymond Mooney PhD Proposal May 12 th, 2011 1

$ cd my-dir $ cp test1.txt test-dir $ rm test1.txt $ cd my-dir $ cp test1.txt test-dir $ rm test1.txt What task is the user performing? move-file Which files and directories are involved? test1.tx and test-dir Plan Recognition in Intelligent User Interfaces 2 Can the task be performed more efficiently? Data is relational in nature - several files and directories and several relations between them

Characteristics of Real World Data  Relational or structured data  Several entities in the domain  Several relations between entities  Do not always follow the i.i.d assumption  Presence of noise or uncertainty  Uncertainty in the types of entities  Uncertainty in the relations 3 Traditional approaches like first-order logic or probabilistic models can handle either structured data or uncertainty, but not both.

Statistical Relational Learning (SRL)  Integrates first-order logic and probabilistic graphical models [Getoor and Taskar, 2007] –Combines strengths of both first-order logic and probabilistic models  SRL formalisms –Stochastic Logic Programs (SLPs) [Muggleton, 1996] –Probabilistic Relational Models (PRMs) [Friedman et al., 1999] –Bayesian Logic Programs (BLPs) [Kersting and De Raedt, 2001] –Markov Logic Networks (MLNs) [Richardson and Dominogs, 2006] 4

Statistical Relational Learning (SRL)  Integrates first-order logic and probabilistic graphical models [Getoor and Taskar, 2007] –Combines strengths of both first-order logic and probabilistic models  SRL formalisms –Stochastic Logic Programs (SLPs) [Muggleton, 1996] –Probabilistic Relational Models (PRMs) [Friedman et al., 1999] –Bayesian Logic Programs (BLPs) [Kersting and De Raedt, 2001] –Markov Logic Networks (MLNs) [Richardson and Dominogs, 2006] 5

Bayesian Logic Programs (BLPs)  Integrate first-order logic and Bayesian networks  Why BLPs?  Efficient grounding mechanism that includes only those variables that are relevant to the query  Easy to extend by incorporating any type of logical inference to construct networks  Well suited for capturing causal relations in data 6

Objectives 7 BLPs for Plan Recognition BLPs for Machine Reading

Objectives 8 Plan recognition involves predicting the top-level plan of an agent based on its actions BLPs for Machine Reading

Objectives 9 BLPs for Plan Recognition Machine Reading involves automatic extraction of knowledge from natural language text

Outline Motivation  Background  First-order logic  Logical Abduction  Bayesian Logic Programs (BLPs)  Completed Work  Part 1 – Extending BLPs for Plan Recognition  Part 2 – Extending BLPs for Machine Reading  Proposed Work  Conclusions 10

First-order Logic  Terms  Constants – individual entities like anna, bob  Variables – placeholders for objects like X, Y  Predicates  Relations over entities like worksFor, capitalOf  Literal – predicate or its negation applied to terms  Atom – Positive literal like worksFor(X,Y)  Ground literal – literal with no variables like worksFor(anna,bob)  Clause – disjunction of literals  Horn clause has at most one positive literal  Definite clause has exactly one positive literal 11

First-order Logic  Quantifiers  Universal quantification - true for all objects in the domain  Existential quantification - true for some objects in the domain  Logical Inference  Forward Chaining– For every implication p  q, if p is true, then q is concluded to be true  Backward Chaining – For a query literal q, if an implication p  q is present and p is true, then q is concluded to be true, otherwise backward chaining tries to prove p 12

Logical Abduction  Abduction  Process of finding the best explanation for a set of observations  Given  Background knowledge, B, in the form of a set of (Horn) clauses in first-order logic  Observations, O, in the form of atomic facts in first-order logic  Find  A hypothesis, H, a set of assumptions (atomic facts) that logically entail the observations given the theory: B  H  O  Best explanation is the one with the fewest assumptions 13

Bayesian Logic Programs (BLPs) [Kersting and De Raedt, 2001]  Set of Bayesian clauses a|a 1,a 2,....,a n  Definite clauses that are universally quantified  Head of the clause - a  Body of the clause - a 1, a 2, …, a n  Range-restricted, i.e variables{head} variables{body}  Associated conditional probability table (CPT) o P(head|body)  Bayesian predicates a, a 1, a 2, …, a n have finite domains  Combining rule like noisy-or for mapping multiple CPTs into a single CPT. 14

Logical Inference in BLPs SLD Resolution 15 BLP lives(james,yorkshire). lives(stefan,freiburg). neighborhood(james). tornado(yorkshire). burglary(X) | neighborhood(X). alarm(X) | burglary(X). alarm(X) | lives(X,Y), tornado(Y). Query alarm(james) Proof alarm(james) Example from Ngo and Haddawy, 1997

Logical Inference in BLPs SLD Resolution 16 BLP lives(james,yorkshire). lives(stefan,freiburg). neighborhood(james). tornado(yorkshire). burglary(X) | neighborhood(X). alarm(X) | burglary(X). alarm(X) | lives(X,Y), tornado(Y). Query alarm(james) Proof alarm(james) burglary(james) Example from Ngo and Haddawy, 1997

Logical Inference in BLPs SLD Resolution 17 BLP lives(james,yorkshire). lives(stefan,freiburg). neighborhood(james). tornado(yorkshire). burglary(X) | neighborhood(X). alarm(X) | burglary(X). alarm(X) | lives(X,Y), tornado(Y). Query alarm(james) Proof alarm(james) burglary(james) neighborhood(james) Example from Ngo and Haddawy, 1997

Logical Inference in BLPs SLD Resolution 21 BLP lives(james,yorkshire). lives(stefan,freiburg). neighborhood(james). tornado(yorkshire). burglary(X) | neighborhood(X). alarm(X) | burglary(X). alarm(X) | lives(X,Y), tornado(Y). Query alarm(james) Proof alarm(james) burglary(james) neighborhood(james) lives(james,Y) tornado(Y) Example from Ngo and Haddawy, 1997

Logical Inference in BLPs SLD Resolution 22 BLP lives(james,yorkshire). lives(stefan,freiburg). neighborhood(james). tornado(yorkshire). burglary(X) | neighborhood(X). alarm(X) | burglary(X). alarm(X) | lives(X,Y), tornado(Y). Query alarm(james) Proof alarm(james) burglary(james) neighborhood(james) lives(james,Y) tornado(Y) lives(james,yorkshire) Example from Ngo and Haddawy, 1997

Logical Inference in BLPs SLD Resolution 23 BLP lives(james,yorkshire). lives(stefan,freiburg). neighborhood(james). tornado(yorkshire). burglary(X) | neighborhood(X). alarm(X) | burglary(X). alarm(X) | lives(X,Y), tornado(Y). Query alarm(james) Proof alarm(james) burglary(james) neighborhood(james) lives(james,Y) tornado(Y) lives(james,yorkshire) tornado(yorkshire) Example from Ngo and Haddawy, 1997

Logical Inference in BLPs SLD Resolution 24 BLP lives(james,yorkshire). lives(stefan,freiburg). neighborhood(james). tornado(yorkshire). burglary(X) | neighborhood(X). alarm(X) | burglary(X). alarm(X) | lives(X,Y), tornado(Y). Query alarm(james) Proof alarm(james) burglary(james) neighborhood(james) lives(james,Y) tornado(Y) lives(james,yorkshire) tornado(yorkshire) Example from Ngo and Haddawy, 1997

Bayesian Network Construction 25 alarm(james) burglary(james) neighborhood(james) lives(james,yorkshire)tornado(yorkshire) Each ground atom becomes a node (random variable) in the Bayesian network Edges are added from ground atoms in the body of a clause to the ground atom in the head Specify probabilistic parameters using the CPTs associated with Bayesian clauses

Bayesian Network Construction 26 alarm(james) burglary(james) neighborhood(james) lives(james,yorkshire)tornado(yorkshire) Each ground atom becomes a node (random variable) in the Bayesian network Edges are added from ground atoms in the body of a clause to the ground atom in the head Specify probabilistic parameters using the CPTs associated with Bayesian clauses

Bayesian Network Construction 27 alarm(james) burglary(james) neighborhood(james) lives(james,yorkshire) tornado(yorkshire) Each ground atom becomes a node (random variable) in the Bayesian network Edges are added from ground atoms in the body of a clause to the ground atom in the head Specify probabilistic parameters using the CPTs associated with Bayesian clauses

Bayesian Network Construction 28 alarm(james) burglary(james) neighborhood(james) lives(james,yorkshire)tornado(yorkshire) Each ground atom becomes a node (random variable) in the Bayesian network Edges are added from ground atoms in the body of a clause to the ground atom in the head Specify probabilistic parameters using the CPTs associated with Bayesian clauses Use combining rule to combine multiple CPTs into a single CPT lives(stefan,freiburg) ✖

Probabilistic Inference and Learning  Probabilistic inference  Marginal probability given evidence  Most Probable Explanation (MPE) given evidence  Learning [Kersting and De Raedt, 2008]  Parameters o Expectation Maximization o Gradient-ascent based learning  Structure o Hill climbing search through the space of possible structures 29

Part 1 Extending BLPs for Plan Recognition [Raghavan and Mooney, 2010] 30

Plan Recognition  Predict an agent’s top-level plans based on the observed actions  Abductive reasoning involving inference of cause from effect  Applications  Story Understanding o Recognize character’s motives or plans based on its actions to answer questions about the story  Strategic planning o Predict other agents’ plans so as to work co-operatively  Intelligent User Interfaces o Predict the task that the user is performing so as to provide valuable tips to perform the task more efficiently 31

Related Work  First-order logic based approaches [Kautz and Allen, 1986; Ng and Mooney, 1992]  Knowledge base of plans and actions  Default reasoning or logical abduction to predict the best plan based on the observed actions  Unable to handle uncertainty in data or estimate likelihood of alternative plans  Probabilistic graphical models [Charniak and Goldman, 1989; Huber et al., 1994; Pynadath and Wellman, 2000; Bui, 2003; Blaylock and Allen, 2005]  Encode the domain knowledge using Bayesian networks, abstract hidden Markov models, or statistical n-gram models  Unable to handle relational/structured data 32

Related Work  Markov Logic based approaches [Kate and Mooney, 2009; Singla and Mooney, 2011]  Logical inference in MLNs is deductive in nature  MLN-PC [Kate and Mooney, 2009] o Add reverse implications to handle abductive reasoning o Does not scale to large domains  MLN-HC [Singla and Mooney, 2011] o Improves over MLN-PC by adding hidden causes o Still does not scale to large domains  MLN-HCAM [Singla and Mooney, 2011] uses logical abduction to constrain grounding o Incorporates ideas from the BLP approach for plan recognition [Raghavan and Mooney, 2010] o MLN-HCAM performs better than MLN-PC and MLN-HC 33

BLPs for Plan Recognition  Why BLPs ?  Directed model captures causal relationships well  Efficient grounding process results in smaller networks, unlike in MLNs  SLD resolution is deductive inference, used for predicting observed actions from top-level plans  Plan recognition is abductive in nature and involves predicting the top-level plan from observed actions 34 BLPs cannot be used as is for plan recognition

Extending BLPs for Plan Recognition 35 BLPs Logical Abduction BALPs BALPs – Bayesian Abductive Logic Programs

Logical Abduction in BALPs  Given  A set of observation literals O = {O 1, O 2,….O n } and a knowledge base KB  Compute a set abductive proofs of O using Stickel’s abduction algorithm [Stickel 1988]  Backchain on each O i until it is proved or assumed  A literal is said to be proved if it unifies with a fact or the head of some rule in KB, otherwise it is said to be assumed  Construct a Bayesian network using the resulting set of proofs as in BLPs. 36

Example – Intelligent User Interfaces  Top-level plans predicates  copy-file, move-file, remove-file  Actions predicates  cp, rm  Knowledge Base (KB)  cp(filename,destdir) | copy-file(filename,destdir)  cp(filename,destdir) | move-file(filename,destdir)  rm(filename) | move-file(filename,destdir)  rm(filename) | remove-file(filename)  Observed actions  cp(Test1.txt, Mydir)  rm(Test1.txt) 37

Abductive Inference 38 copy-file(Test1.txt,Mydir) cp(Test1.txt,Mydir) cp(filename,destdir) | copy-file(filename,destdir) Assumed literal

Abductive Inference 39 copy-file(Test1.txt,Mydir) cp(Test1.txt,Mydir) move-file(Test1.txt,Mydir) cp(filename,destdir) | move-file(filename,destdir) Assumed literal

Abductive Inference 40 copy-file(Test1.txt,Mydir) cp(Test1.txt,Mydir) move-file(Test1.txt,Mydir) rm(filename) | move-file(filename,destdir) rm(Test1.txt) Match existing assumption

Abductive Inference 41 copy-file(Test1.txt,Mydir) cp(Test1.txt,Mydir) move-file(Test1.txt,Mydir) rm(filename) | remove-file(filename) rm(Test1.txt) remove-file(Test1) Assumed literal

Structure of Bayesian network 42 copy-file(Test1.txt,Mydir) cp(Test1.txt,Mydir) move-file(Test1.txt,Mydir) rm(Test1.txt) remove-file(Test1)

Probabilistic Inference  Specifying probabilistic parameters  Noisy-and o Specify the CPT for combining the evidence from conjuncts in the body of the clause  Noisy-or o Specify the CPT for combining the evidence from disjunctive contributions from different ground clauses with the same head o Models “explaining away”  Noisy-and and noisy-or models reduce the number of parameters learned from data 43

Probabilistic Inference 44 copy-file(Test1,txt,Mydir) cp(Test1.txt,Mydir) move-file(Test1,txt,Mydir) rm(Test1.txt) remove-file(Test1) 4 parameters

Probabilistic Inference 45 copy-file(Test1,txt,Mydir) cp(Test1.txt,Mydir) move-file(Test1,txt,Mydir) rm(Test1.txt) remove-file(Test1) 2 parameters θ1θ1 θ2θ2 θ3θ3 θ4θ4 Noisy models require parameters linear in the number of parents

Probabilistic Inference  Most Probable Explanation (MPE)  For multiple plans, compute MPE, the most likely combination of truth values to all unknown literals given this evidence  Marginal Probability  For single top level plan prediction, compute marginal probability for all instances of plan predicate and pick the instance with maximum probability  When exact inference is intractable, SampleSearch [Gogate and Dechter, 2007], an approximate inference algorithm for graphical models with deterministic constraints is used 46

Probabilistic Inference 47 copy-file(Test1,txt,Mydir) cp(Test1.txt,Mydir) move-file(Test1,txt,Mydir) rm(Test1.txt) remove-file(Test1) Noisy-or

Probabilistic Inference 48 copy-file(Test1,txt,Mydir) cp(Test1.txt,Mydir) move-file(Test1,txt,Mydir) rm(Test1.txt) remove-file(Test1) Noisy-or Evidence

Probabilistic Inference 49 copy-file(Test1,txt,Mydir) cp(Test1.txt,Mydir) move-file(Test1,txt,Mydir) rm(Test1.txt) remove-file(Test1) Noisy-or Evidence Query variables

Probabilistic Inference 50 copy-file(Test1,txt,Mydir) cp(Test1.txt,Mydir) move-file(Test1,txt,Mydir) rm(Test1.txt) remove-file(Test1) Noisy-or Evidence Query variables TRUE FALSE MPE

Probabilistic Inference 51 copy-file(Test1,txt,Mydir) cp(Test1.txt,Mydir) move-file(Test1,txt,Mydir) rm(Test1.txt) remove-file(Test1) Noisy-or Evidence Query variables TRUE FALSE MPE

Parameter Learning  Learn noisy-or/noisy-and parameters using the EM algorithm adapted for BLPs [Kersting and De Raedt, 2008]  Partial observability  In plan recognition domain, data is partially observable  Evidence is present only for observed actions and top- level plans; sub-goals, noisy-or, and noisy-and nodes are not observed  Simplify learning problem  Learn noisy-or parameters only  Used logical-and instead of noisy-and to combine evidence from conjuncts in the body of a clause 52

Experimental Evaluation  Monroe (Strategic planning)  Linux (Intelligent user interfaces)  Story Understanding (Story understanding) 53

Monroe and Linux [Blaylock and Allen, 2005]  Task  Monroe involves recognizing top level plans in an emergency response domain (artificially generated using HTN planner)  Linux involves recognizing top-level plans based on linux commands  Single correct plan in each example  Data 54 No. examples Avg. observations / example Total top-level plan predicates Total observed action predicates Monroe100010.1910 30 Linux4576.119 43

Monroe and Linux  Methodology  Manually encoded the knowledge base  Learned noisy-or parameters using EM  Computed marginal probability for plan instances  Systems compared  BALPs  MLN-HCAM [Singla and Mooney, 2011] o MLN-PC and MLN-HC do not run on Monroe and Linux due to scaling issues  Blaylock and Allen’s system [Blaylock and Allen, 2005]  Performance metric  Convergence score - measures the fraction of examples for which the plan schema was predicted correctly 55

Results on Monroe 56 94.2 * Convergence Score BALPs MLN-HCAMBlaylock & Allen * - Differences are statistically significant wrt BALPs

Results on Linux 57 Convergence Score BALPs MLN-HCAMBlaylock & Allen 36.1 * * - Differences are statistically significant wrt BALPs

Story Understanding [Charniak and Goldman, 1991; Ng and Mooney, 1992]  Task  Recognize character’s top level plans based on actions described in narrative text  Multiple top-level plans in each example  Data  25 examples in development set and 25 examples in test set  12.6 observations per example  8 top-level plan predicates 58

Story Understanding  Methodology  Knowledge base was created for ACCEL [Ng and Mooney, 1992]  Parameters set manually o Insufficient number of examples in the development set to learn parameters  Computed MPE to get the best set of plans  Systems compared  BALPs  MLN-HCAM [Singla and Mooney, 2011] o Best performing MLN model  ACCEL-Simplicity [Ng and Mooney, 1992]  ACCEL-Coherence [Ng and Mooney, 1992] o Specific for Story Understanding 59

Results on Story Understanding 60 * - Differences are statistically significant wrt BALPs

Results on Story Understanding 61 * - Differences are statistically significant wrt BALPs * *

Summary of BALPs for Plan Recognition  Extend BLPs for plan recognition by employing logical abduction to construct Bayesian networks  Automatic learning of model parameters using EM  Empirical results on all benchmark datasets demonstrate advantages over existing methods 62

Part 2 Extending BLPs for Machine Reading 63

Machine Reading  Machine reading involves automatic extraction of knowledge from natural language text  Information extraction (IE) systems extract factual information like entities and relations between entities that occur in text [Cohen, 1999; Craven et al., 2000; Bunescu and Mooney, 2007; Etzioni et al, 2008]  Extracted information can be used for answering queries automatically 64

Limitations of IE Systems  Extract information that is stated explicitly in text  Natural language text is not necessarily complete  Commonsense information is not explicitly stated in text  Well known facts are omitted from the text  Missing information cannot be inferred from text  Human brain performs deeper inference using commonsense knowledge  IE systems have no access to commonsense knowledge  Errors in extraction process result in some facts not being extracted 65

Our Objective  Improve performance of IE  Learn general knowledge or commonsense information in the form of rules using the facts extracted by the IE system  Infer additional facts that are not stated explicitly in text using the learned rules 66

Example 67 “Malaysian Prime Minister Mahathir Mohamad Wednesday announced for the first time that he has appointed his deputy Abdullah Ahmad Badawi as his successor.’’

Example 68 “Malaysian Prime Minister Mahathir Mohamad Wednesday announced for the first time that he has appointed his deputy Abdullah Ahmad Badawi as his successor." Extracted facts nationState(malaysian) Person(mahathir-mohamad) isLedBy(malaysian,mahathir-mohamad) Extracted facts nationState(malaysian) Person(mahathir-mohamad) isLedBy(malaysian,mahathir-mohamad)

Example 69 “Malaysian Prime Minister Mahathir Mohamad Wednesday announced for the first time that he has appointed his deputy Abdullah Ahmad Badawi as his successor." Query isLedBy(X,Y) Query isLedBy(X,Y) Extracted facts nationState(malaysian) Person(mahathir-mohamad) isLedBy(malaysian,mahathir-mohamad) Extracted facts nationState(malaysian) Person(mahathir-mohamad) isLedBy(malaysian,mahathir-mohamad)

Example 70 “Malaysian Prime Minister Mahathir Mohamad Wednesday announced for the first time that he has appointed his deputy Abdullah Ahmad Badawi as his successor." Extracted facts nationState(malaysian) Person(mahathir-mohamad) isLedBy(malaysian,mahatir-mohamad) Extracted facts nationState(malaysian) Person(mahathir-mohamad) isLedBy(malaysian,mahatir-mohamad) Query isLedBy(X,Y) Query isLedBy(X,Y) X = malaysian Y = mahathir-mohamad X = malaysian Y = mahathir-mohamad

Example 71 “Malaysian Prime Minister Mahathir Mohamad Wednesday announced for the first time that he has appointed his deputy Abdullah Ahmad Badawi as his successor." Query citizenOf(mahathir-mohamad,Y) Query citizenOf(mahathir-mohamad,Y) Extracted facts nationState(malaysian) Person(mahathir-mohamad) isLedBy(malaysian,mahathir-mohamad) Extracted facts nationState(malaysian) Person(mahathir-mohamad) isLedBy(malaysian,mahathir-mohamad)

Example 72 “Malaysian Prime Minister Mahathir Mohamad Wednesday announced for the first time that he has appointed his deputy Abdullah Ahmad Badawi as his successor." Extracted facts nationState(malaysian) Person(mahathir-mohamad) isLedBy(malaysian,mahatir-mohamad) Extracted facts nationState(malaysian) Person(mahathir-mohamad) isLedBy(malaysian,mahatir-mohamad) Query citizenOf(mahathir-mohamad,Y) Query citizenOf(mahathir-mohamad,Y) Y = ?

Related Work  Learning propositional rules [Nahm and Mooney, 2000]  Learn propositional rules from the output of an IE system on computer-related job postings  Perform deductive inference to infer new facts  Learning first-order rules [Carlson at el., 2010, Schoenmackers et al., 2010; Doppa et al., 2010]  Carlson et al. and Doppa et al. modify existing rule learners like FOIL and FARMER to learn probabilistic rules o Perform deductive inference to infer additional facts  Schoenmackers et al. develop a new first-order rule learner based on statistical relevance o Use MLN based approach to infer additional facts 73

Our Approach  Use an off-the shelf IE system to extract facts  Learn commonsense knowledge from the extracted facts in the form of first-order-rules  Infer additional facts based on the learned rules using BLPs  Pure logical deduction results in inferring a large number of facts  Inference using BLPs is probabilistic in nature, i.e inferred facts are assigned probabilities. Probabilities can be used to filter out facts with high confidence from the rest  Efficient grounding mechanism in BLPs enables our approach to scale to natural language text 74

Extending BLPs for Machine Reading  SLD resolution does not result in inference of new facts  Given a query, SLD resolution will generated deductive proofs that prove the query  Employ forward chaining to infer additional facts  Forward chain on extracted facts to infer new facts  Use the deductive proofs from forward chaining to construct Bayesian network for probabilistic inference 75

Learning First-order Rules 76 “ Barack Obama is the current President of USA……. Obama was born on August 4, 1961, in Hawaii, USA……. ’’ Extracted facts nationState(USA) Person(BarackObama) isLedBy(USA,BarackObama) hasBirthPlace(BarackObama,USA) citizenOf(BarackObama,USA) Extracted facts nationState(USA) Person(BarackObama) isLedBy(USA,BarackObama) hasBirthPlace(BarackObama,USA) citizenOf(BarackObama,USA)

Learning Patterns from Extracted Facts nationState(USA) ∧ isLedBy(USA,BarackObama)  citizenOf(BarackObama,USA) nationState(USA) ∧ isLedBy(USA,BarackObama)  hasBirthPlace(BarackObama,USA) hasBirthPlace(BarackObama,USA)  citizenOf(BarackObama,USA). 77

Generalizing Patterns to Rules 78 nationState(Y) ∧ isLedBy(Y,X)  citizenOf(X,Y) nationState(Y) ∧ isLedBy(Y,X)  hasBirthPlace(X,Y) hasBirthPlace(X,Y)  citizenOf(X,Y) nationState(Y) ∧ isLedBy(Y,X)  citizenOf(X,Y) nationState(Y) ∧ isLedBy(Y,X)  hasBirthPlace(X,Y) hasBirthPlace(X,Y)  citizenOf(X,Y)

Inductive Logic Programming (ILP) [Muggleton, 1992] 79 ILP Rule Learner ILP Rule Learner Target relation citizenOf(X,Y) Positive instances citizenOf(BarackObama, USA) citizenOf(GeorgeBush, USA) citizenOf(IndiraGandhi,I ndia). Negative instances citizenOf(BarackObama, India) citizenOf(GeorgeBush, India) citizenOf(IndiraGandhi,U SA). KB hasBirthPlace(BarackObama,USA) person(BarackObama) nationState(USA) nationState(India). Rules nationState(Y) ∧ isLedBy(Y,X)  citizenOf(X,Y). Rules nationState(Y) ∧ isLedBy(Y,X)  citizenOf(X,Y).

Example 80 Extracted facts nationState(malaysian) Person(mahathir-mohamad) isLedBy(malaysian,mahathir-mohamad) Extracted facts nationState(malaysian) Person(mahathir-mohamad) isLedBy(malaysian,mahathir-mohamad) Learned rule nationState(B) ∧ isLedBy(B,A)  citizenOf(A,B) Learned rule nationState(B) ∧ isLedBy(B,A)  citizenOf(A,B) “Malaysian Prime Minister Mahathir Mohamad Wednesday announced for the first time that he has appointed his deputy Abdullah Ahmad Badawi as his successor.’’

Inference of Additional Facts 81 nationState(B) ∧ isLedBy(B,A)  citizenOf(A,B) nationState(malaysian ) isLedBy(malaysian,mahathir- mohamad) citizenOf(mahathir-mohamad, malaysian)

Experimental Evaluation  Data  DARPA’s intelligent community (IC) data set  Consists of news articles on politics, terrorism, and others international events  2000 documents, split into training and test in the ratio 9:1  Approximately 30,000 sentences  Information Extraction  Information extraction using SIRE, IBM’s information extraction system [Florin et al., 2004] 82

Experimental Evaluation  Learning first-order rules using LIME [McCreath and Sharma, 1998]  Learns rules from only positive instances  Handles noise in data  Models  BLP-Pos-Only o Learn rules using only positive instances  BLP-Pos-Neg o Learn rules using both positive and negative instances o Apply closed world assumption to automatically generate negative instances  BLP-All o Includes rules learned from BLP-Pos-Only and BLP-Pos- Neg 83

Experimental Evaluation  Specifying probabilistic parameters  Manually specified parameters  Logical-and model to combine evidence from conjuncts in the body of the clause  Set noisy-or parameters to 0.9  Set priors to maximum likelihood estimates 84

Experimental Evaluation  Methodology  Learned rules for 13 relations including hasCitizenship, hasMember, attendedSchool  Baseline o Perform pure logical deduction to infer all possible additional facts  BLP-0.9 o Facts inferred using the BLP approach with marginal probability greater or equal to 0.9  BLP-0.95 o Facts inferred using the BLP approach with marginal probability greater or equal to 0.95 85

Performance Evaluation  Precision  Evaluate if the inferred facts are correct, i.e if they can be inferred from the document o No ground truth information available o Perform manual evaluation o Sample 20 documents randomly from the test set and evaluate inferred facts manually for all models 86

Precision (preliminary results) Model BLP-ALLBLP-Pos-NegBLP-Pos-Only Baseline – Purely logical approach [No. facts inferred] 3813751849 Precision 24.9416.0031.80 87

Precision (preliminary results) Model BLP-ALLBLP-Pos-NegBLP-Pos-Only Baseline – Purely logical approach [No. facts inferred] 3813751849 Precision 24.9416.0031.80 BLP-0.9 [No. facts inferred] 1208661158 Precision 34.6813.6335.49 88

Precision (preliminary results) Model BLP-ALLBLP-Pos-NegBLP-Pos-Only Baseline – Purely logical approach [No. facts inferred] 3813751849 Precision 24.9416.0031.80 BLP-0.9 [No. facts inferred] 1208661158 Precision 34.6813.6335.49 BLP-0.95 [No. of facts inferred] 5610 Precision 91.07100Nil 89

Performance Evaluation  Estimated recall  For each target relation, eliminate instances of it from the extracted facts in the test set  Try to infer the eliminated instances correctly based on the remaining facts using the BLP approach o For each threshold point (0.1 to 1), compute fraction of eliminated instances that were inferred correctly o True recall cannot be calculated because of lack of ground truth 90

Estimated Recall (preliminary results) 91 Estimated recall Confidence threshold

First-order Rules from LIME  politicalParty(A) ^ employs(A,B)  hasMemberPerson(A,B)  building(B) ^ eventLocation(A,B) ^ bombing(A)  thingPhysicallyDamaged(A,B)  employs(A,B)  hasMember(A,B)  citizenOf(A,B)  hasCitizenship(A,B)  nationState(B) ^ person(A) ^ employs(B,A)  hasBirthPlace(A,B) 95

Proposed Work  Learning parameters automatically from data  EM or gradient-ascent based algorithm adapted for BLPs by Kersting and De Raedt [2008]  Evaluation  Human evaluation for inferred facts using Amazon Mechanical Turk [Callison-Burch and Dredze, 2010]  Use inferred facts to answer queries constructed for evaluation in the DARPA sponsored machine reading project  Comparison to existing approaches like MLNs  Develop a structure learner for incomplete and uncertain data 96

Future Extensions  Multiple predicate learning  Learn rules for multiple relations at the same time  Discriminative parameter learning for BLPs  EM and gradient-ascent based algorithms for learning BLP parameters optimize likelihood of the data  No algorithm that learns parameters discriminatively for BLPs has been developed to date  Comparison of BALPs to other probabilistic logics for plan recognition  Comparison to PRISM [Sato, 1995], Poole’s Horn Abduction [Poole, 1993], Abductive Stochastic Logic Programs [Tammadoni- Nezhad et al., 2006] 97

Conclusions  Extended BLPs to abductive plan recognition  Enhance BLPs with logical abduction, resulting in Bayesian Abductive Logic Programs (BALPs)  BALPs outperform state-of-the-art approaches to plan recognition on several benchmark data sets  Extended BLPs to machine reading  Preliminary results on the IC data set are encouraging  Perform an extensive evaluation in immediate future  Develop a first-order rule learner that learns from incomplete and uncertain data in future 98

Questions 99

Backup 100

Timeline  Additional experiments comparing MLN-HCAM and BALPs for plan recognition [Target completion – August 2011]  Use Sample Search to perform inference in MLN-HCAM  Evaluation of BLPs for Machine Reading [Target completion – November 2011]  Learn parameters automatically  Evaluation using Amazon Mechanical Turk  Evaluation of BLPs for answering queries  Comparison of BLPs with MLNs for machine reading  Developing structure learner from positive and uncertain examples [Target completion – May 2012]  Proposed defense and graduation [August 2012] 101

Completeness in First-order Logic  Completeness - If a sentence is entailed by a KB, then it is possible to find the proof that entails it  Entailment in first-order logic is semidecidable, i.e it is not possible to know if a sentence is entailed by a KB or not  Resolution is complete in first-order logic  If a set of sentences is unsatisfiable, then it is possible to find a contradiction 102

Forward chaining  For every implication p  q, if p is true, then q is concluded to be true  Results in addition of a new fact to KB  Efficient, but incomplete  Inference can explode and forward chaining may never terminate  Addition of new facts might result in rules being satisfied  It is data-driven, not goal-driven  Might result in irrelevant conclusions 103

Backward chaining  For a query literal q, if an implication p  q is present and p is true, then q is concluded to be true, otherwise backward chaining tries to prove p  Efficient, but not complete  May never terminate, might get stuck in infinite loop  Exponential  Goal-driven 104

Herbrand Model Semantics  Herbrand universe  All constants in the domain  Herbrand base  All ground atoms atoms over Herbrand universe  Herbrand interpretation  A set of ground atoms from Herbrand base that are true  Herbrand model  Herbrand interpretation that satisfies all clauses in the knowledge base 105

Advantages of SRL models over vanilla probabilistic models  Compactly represent domain knowledge in first- order logic  Employ logical inference to construct ground networks  Enables parameter sharing 106

Parameter sharing in SRL 107 father(john) parent(john) father(mary) parent(mary) father(alice) parent(alice) dummy θ1θ1 θ1θ1 θ2θ2 θ2θ2 θ3θ3 θ3θ3

Parameter sharing in SRL father(X)  parent(X) 108 father(john) parent(john) father(mary) parent(mary) father(alice) parent(alice) dummy θ θ θ θ θ θ θ θ

Noisy-and Model  Several causes c i have to occur simultaneously if event e has to occur  c i fails to trigger e with probability p i  inh accounts for some unknown cause due to which e has failed to trigger P(e) = (I – inh) Π i (1-p i )^(1-c i ) 109

Noisy-or Model  Several causes c i cause event e has to occur  c i independently triggers e with probability p i  leak accounts for some unknown cause due to which e has triggered P(e) = 1 – [(I – inh) Π i (1-p i )^(1-c i )]  Models explaining away  If there are several causes of an event, and if there is evidence for one of the causes, then the probability that the other causes have caused the event goes down 110

Noisy-and And Noisy-or Models 111 alarm(james) burglary(james) neighborhood(james) lives(james,yorkshire)tornado(yorkshire)

Noisy-and And Noisy-or Models 112 alarm(james) burglary(james) neighborhood(james) lives(james,yorkshire)tornado(yorkshire) dummy1 dummy2 Noisy/logical-and Noisy-or

Inference in BLPs [Kersting and De Raedt, 2001]  Logical inference  Given a BLP and a query, SLD resolution is used to construct proofs for the query  Bayesian network construction  Each ground atom is a random variable  Edges are added from ground atoms in the body to the ground atom in head  CPTs specified by the conditional probability distribution for the corresponding clause  P(X) = P(X i | Pa(X i ))  Probabilistic inference  Marginal probability given evidence  Most Probable Explanation (MPE) given evidence 113

Learning in BLPs [Kersting and De Raedt, 2008]  Parameter learning  Expectation Maximization  Gradient-ascent based learning  Both approaches optimize likelihood  Structure learning  Hill climbing search through the space of possible structures  Initial structure obtained from CLAUDIEN [De Raedt and Dehaspe, 1997]  Learns from only positive examples 114

Expectation Maximization for BLPs/BALPs Perform logical inference to construct a ground Bayesian network for each example Let r denote rule, X denote a node, and Pa(X) denote parents of X E Step The inner sum is over all groundings of rule r M Step 115 * * * From SRL tutorial at ECML 07 115

Decomposable Combining Rules  Express different influences using separate nodes  These nodes can be combined using a deterministic function 116

Combining Rules 117 alarm(james) burglary(james) neighborhood(james) lives(james,yorkshire)tornado(yorkshire) dummy1 dummy2 Logical-and Noisy-or

Decomposable Combining Rules 118 alarm(james) burglary(james) neighborhood(james) lives(james,yorkshire)tornado(yorkshire) dummy1 dummy2 Logical-and Noisy-or dummy1-newdummy2 -new Logical-or

BLPs vs. PLPs  Differences in representation  In BLPs, Bayesian atoms take finite set of values, but in PLPs, each atom is logical in nature and it can take true or false  Instead of having neighborhood(x) = bad, in PLPs, we have neighborhood(x,bad)  To compute probability of a query alarm(james), PLPs have to construct one proof tree for all possible values for all predicates  Inference is cumbersome  BLPs subsume PLPs 119

BLPs vs. Poole's Horn Abduction  Differences in representation  For example, if P(x) and R(x) are two competing hypothesis, then either P(x) could be true or R(x) could be true  Prior probabilities of P(x) and R(x) should sum to 1  Restrictions of these kind are are not these in BLPs  PLPs and hence BLPs are more flexible and have a richer representation 120

BLPs vs. PRMs  BLPs subsume PRMs  PRMs use entity-relationship models to represent knowledge and they use KBMC-like construction to construct a ground Bayesian network  Each attribute becomes a random variable in the ground network and relations over the entities are logical constraints  In BLP, each attribute becomes a Bayesian atom and relations become logical atoms  Aggregator functions can be transformed into combining rules 121

BLPs vs. RBNs  BLPs subsume RBNs  In RBNs, each node in BN is a predicate and probability formulae are used to specify probabilities  Combining rules can be used to represent these probability formulae in BLPs. 122

BALPs vs. BLPs  Like BLPs, BALPs use logic programs as templates for constructing Bayesian networks  Unlike BLPs, BALPs uses logical abduction instead of deduction to construct the network 123

Monroe [Blaylock and Allen, 2005]  Task  Recognize top level plans in an emergency response domain (artificially generated using HTN planner)  Plans include set-up-shelter, clear-road-wreck, provide- medical-attention  Single correct plan in each example  Domain consists of several entities and sub-goals  Test the ability to scale to large domains  Data  Contains 1000 examples 124

Monroe  Methodology  Knowledge base constructed based on the domain knowledge encoded in planner  Learn noisy-or parameters using EM  Compute marginal probability for instances of top level plans and pick the one with the highest marginal probability  Systems compared o BALPs o MLN-HCAM [Singla and Mooney, 2011] o Blaylock and Allen’s system [Blaylock and Allen, 2005]  Convergence score - measures the fraction of examples for which the plan schema was predicted correctly 125

Learning Results - Monroe 126 MWMW-StartRand-Start Conv Score98.4 Acc-10079.16 79.86 Acc-7546.0644.6344.73 Acc-5020.6720.2619.7 Acc-257.27.3310.46

Linux [Blaylock and Allen, 2005]  Task  Recognize top level plans based on Linux commands  Human users asked to perform tasks in Linux and commands were recorded  Top-level plans include find-file-by-ext, remove-file-by-ext, copy-file, move-file  Single correct plan in each example  Tests the ability to handle noise in data o Users indicate success even when they have not achieved the task correctly o Some top-level plans like find-file-by-ext and file-file-by- name have identical actions  Data  Contains 457 examples 127

Linux  Methodology  Knowledge base constructed based on the knowledge of Linux commands  Learn noisy-or parameters using EM  Compute marginal probability for instances of top level plans and pick the one with the highest marginal probability  Systems compared o BALPs o MLN-HCAM [Singla and Mooney, 2011] o Blaylock and Allen’s system [ Blaylock and Allen, 2005]  Convergence score - measures the fraction of examples for which the plan schema was predicted correctly o 128

Learning Results - Linux 129 Accuracy Partial Observability MWMW-StartRand-Start Conv Score39.8246.641.57

Story Understanding [Charniak and Goldman, 1991; Ng and Mooney, 1992]  Task  Recognize character’s top level plans based on actions described in narrative text  Logical representation of actions literals provided  Top-level plans include shopping, robbing, restaurant dining, partying  Multiple top-level plans in each example  Tests the ability to predict multiple plans  Data  25 development examples  25 test examples 130

Story Understanding  Methodology  Knowledge base constructed for ACCEL by Ng and Mooney [1992]  Insufficient number of examples to learn parameters o Noisy-or parameters set to 0.9 o Noisy-and parameters set to 0.9 o Priors tuned on development set  Compute MPE to get the best set of plans  Systems compared o BALPs o MLN-HCAM [Singla and Mooney, 2011] o ACCEL-Simplicity [Ng and Mooney, 1992 ] o ACCEL-Coherence [Ng and Mooney, 1992] –Specific for Story Understanding 131

Other Applications of BALPs  Medical diagnosis  Textual entailment  Computational biology  Inferring gene relations based on the output of micro-array experiments  Any application that requires abductive reasoning 132

ACCEL [Ng and Mooney, 1992]  First-order logic based system for plan recognition  Simplicity metric selects explanations that have the least number of assumptions  Coherence metric selects explanations that connect maximum number of observations  Measures explanatory coherence  Specific to text interpretation 133

System by Blaylock and Allen [2005]  Statistical n-gram models to predict plans based on observed actions  Performs plan recognition in two phases  Predicts the plan schema first  Predicts arguments based on the predicted schema 134

Machine Reading  Machine reading involves automatic extraction of knowledge from natural language text  Approaches to machine reading  Extract factual information like entities and relations between entities that occur in text [Cohen, 1999; Craven et al., 2000; Bunescu and Mooney, 2007; Etzioni et al, 2008]  Extract commonsense knowledge about the domain [Nahm and Mooney, 2000; Carlson at el., 2010, Schoenmackers et al., 2010; Doppa et al., 2010] 135

Natural Language Text  Natural language text is not necessarily complete  Commonsense information is not explicitly stated in text  Well known facts are omitted from the text  Missing information has to be inferred from text  Human brain performs deeper inference using commonsense knowledge  IE systems have no access to commonsense knowledge 136 IE systems are limited to extracting information stated explicitly in text

IC ontology  57 entities  79 relations 137

Estimated Precision (preliminary results) Model BLP-ALLBLP-Pos-NegBLP-Pos-Only Baseline [No. facts inferred] 38,1511,51018,501 Estimated precision 24.94 (951/3813)* 16.00 (12/75)* 31.80 (588/1849)* 138 Total facts extracted – 12,254 * - x out of y inferred facts were correct. These inferred facts are from the 20 randomly sampled documents from test set for manual evaluation

Estimated Precision (preliminary results) Model BLP-ALLBLP-Pos-NegBLP-Pos-Only Baseline [No. facts inferred] 38,1511,51018,501 Estimated precision 24.94 (951/3813)* 16.00 (12/75)* 31.80 (588/1849)* BLP-0.9 [No. facts inferred] 12,82388011,717 Estimated precision 34.68 (419/1208)* 13.63 (9/66)* 35.49 (411/1158)* 139 Total facts extracted – 12,254 * - x out of y inferred facts were correct. These inferred facts are from the 20 randomly sampled documents from test set for manual evaluation

Estimated Precision (preliminary results) Model BLP-ALLBLP-Pos-NegBLP-Pos-Only Baseline [No. facts inferred] 38,1511,51018,501 Estimated precision 24.94 (951/3813)* 16.00 (12/75)* 31.80 (588/1849)* BLP-0.9 [No. facts inferred] 12,82388011,717 Estimated precision 34.68 (419/1208)* 13.63 (9/66)* 35.49 (411/1158)* BLP-0.95 [No. of facts inferred] 82611949 Estimated precision 91.07 (51/56)* 100 (1/1)* Nil (0/0)* 140 Total facts extracted – 12,254 * - x out of y inferred facts were correct. These inferred facts are from the 20 randomly sampled documents from test set for manual evaluation

Estimated Recall 141 “Barack Obama is the current President of USA……. Obama was born on August 4, 1961, in Hawaii, USA……." Extracted facts nationState(USA) hasBirthPlace(BarackObama,USA) Person(BarackObama) citizenOf(BarackObama,USA) isLedBy(USA,BarackObama) Extracted facts nationState(USA) hasBirthPlace(BarackObama,USA) Person(BarackObama) citizenOf(BarackObama,USA) isLedBy(USA,BarackObama) Rule hasBirthPlace(A,B)  citizenOf(A,B) Rule hasBirthPlace(A,B)  citizenOf(A,B)

Estimated Recall for citizenOf 142 Extracted facts nationState(USA) hasBirthPlace(BarackObama,USA) Person(BarackObama) citizenOf(BarackObama,USA) isLedBy(USA,BarackObama) Extracted facts nationState(USA) hasBirthPlace(BarackObama,USA) Person(BarackObama) citizenOf(BarackObama,USA) isLedBy(USA,BarackObama) hasBirthPlace(A,B)  citizenOf(A,B) hasBirthPlace(BarackObama,USA) citizenOf(BarackObama,USA) Estimated Recall – 1/1

Estimated Recall for hasBirthPlace 143 Extracted facts nationState(USA) hasBirthPlace(BarackObama,USA) Person(BarackObama) citizenOf(BarackObama,USA) isLedBy(USA,BarackObama) Extracted facts nationState(USA) hasBirthPlace(BarackObama,USA) Person(BarackObama) citizenOf(BarackObama,USA) isLedBy(USA,BarackObama) hasBirthPlace(A,B)  citizenOf(A,B) Estimated Recall – 0/1 ? ?

Proposed Work  Learn first-order rules from incomplete and uncertain data  Facts extracted from natural language text are incomplete o Existing rule/structure learners assume that data is complete  Extractions from IE system have associated confidence scores o Existing structure learners cannot use these extraction probabilities  Absence of negative instances o Most rule learners require both positive and negative instances o Closed world assumption does not hold for machine reading 144

Extending Bayesian Logic Programs for Plan Recognition and Machine Reading Sindhu V. Raghavan Advisor: Raymond Mooney PhD Proposal May 12 th, 2011 1.

Similar presentations

Presentation on theme: "Extending Bayesian Logic Programs for Plan Recognition and Machine Reading Sindhu V. Raghavan Advisor: Raymond Mooney PhD Proposal May 12 th, 2011 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Extending Bayesian Logic Programs for Plan Recognition and Machine Reading Sindhu V. Raghavan Advisor: Raymond Mooney PhD Proposal May 12 th, 2011 1.

Similar presentations

Presentation on theme: "Extending Bayesian Logic Programs for Plan Recognition and Machine Reading Sindhu V. Raghavan Advisor: Raymond Mooney PhD Proposal May 12 th, 2011 1."— Presentation transcript:

Similar presentations

About project

Feedback