Montague meets Markov: Combining Logical and Distributional Semantics

Slides:



Advertisements
Similar presentations
Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
Advertisements

Discriminative Training of Markov Logic Networks
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Discriminative Structure and Parameter.
Bayesian Abductive Logic Programs Sindhu Raghavan Raymond J. Mooney The University of Texas at Austin 1.
Online Max-Margin Weight Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
CPSC 322, Lecture 30Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30 March, 25, 2015 Slide source: from Pedro Domingos UW.
Agents That Reason Logically Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 7 Spring 2004.
Markov Logic Networks: Exploring their Application to Social Network Analysis Parag Singla Dept. of Computer Science and Engineering Indian Institute of.
Undirected Probabilistic Graphical Models (Markov Nets) (Slides from Sam Roweis Lecture)
Markov Logic Networks Instructor: Pedro Domingos.
Department of Computer Science The University of Texas at Austin Probabilistic Abduction using Markov Logic Networks Rohit J. Kate Raymond J. Mooney.
Markov Logic: Combining Logic and Probability Parag Singla Dept. of Computer Science & Engineering Indian Institute of Technology Delhi.
1 Unsupervised Semantic Parsing Hoifung Poon and Pedro Domingos EMNLP 2009 Best Paper Award Speaker: Hao Xiong.
Islam Beltagy, Cuong Chau, Gemma Boleda, Dan Garrette, Katrin Erk, Raymond Mooney The University of Texas at Austin Richard Montague Andrey Markov Montague.
Review Markov Logic Networks Mathew Richardson Pedro Domingos Xinran(Sean) Luo, u
Speeding Up Inference in Markov Logic Networks by Preprocessing to Reduce the Size of the Resulting Grounded Network Jude Shavlik Sriraam Natarajan Computer.
Markov Logic: A Unifying Framework for Statistical Relational Learning Pedro Domingos Matthew Richardson
Speaker:Benedict Fehringer Seminar:Probabilistic Models for Information Extraction by Dr. Martin Theobald and Maximilian Dylla Based on Richards, M., and.
School of Computing Science Simon Fraser University Vancouver, Canada.
CPSC 322, Lecture 19Slide 1 Propositional Logic Intro, Syntax Computer Science cpsc322, Lecture 19 (Textbook Chpt ) February, 23, 2009.
A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,
University of Texas at Austin
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Statistical Relational Learning Pedro Domingos Dept. of Computer Science & Eng. University of Washington.
Recursive Random Fields Daniel Lowd University of Washington June 29th, 2006 (Joint work with Pedro Domingos)
CSE 574: Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Inference. Overview The MC-SAT algorithm Knowledge-based model construction Lazy inference Lifted inference.
Recursive Random Fields Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Relational Models. CSE 515 in One Slide We will learn to: Put probability distributions on everything Learn them from data Do inference with them.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Markov Logic: A Simple and Powerful Unification Of Logic and Probability Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint.
Learning, Logic, and Probability: A Unified View Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Stanley Kok, Matt.
On the Proper Treatment of Quantifiers in Probabilistic Logic Semantics Islam Beltagy and Katrin Erk The University of Texas at Austin IWCS 2015.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Statistical Relational Learning Pedro Domingos Dept. Computer Science & Eng. University of Washington.
Markov Logic Parag Singla Dept. of Computer Science University of Texas, Austin.
Machine Learning For the Web: A Unified View Pedro Domingos Dept. of Computer Science & Eng. University of Washington Includes joint work with Stanley.
Bayesian Logic Programs for Plan Recognition and Machine Reading Sindhu Raghavan Advisor: Raymond Mooney PhD Oral Defense Nov 29 th,
Markov Logic And other SRL Approaches
Joint Models of Disagreement and Stance in Online Debate Dhanya Sridhar, James Foulds, Bert Huang, Lise Getoor, Marilyn Walker University of California,
Markov Logic and Deep Networks Pedro Domingos Dept. of Computer Science & Eng. University of Washington.
Lifted First-Order Probabilistic Inference Rodrigo de Salvo Braz SRI International joint work with Eyal Amir and Dan Roth.
Markov Logic Networks Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Matt Richardson)
Learning to “Read Between the Lines” using Bayesian Logic Programs Sindhu Raghavan, Raymond Mooney, and Hyeonseo Ku The University of Texas at Austin July.
Logical Agents Chapter 7. Knowledge bases Knowledge base (KB): set of sentences in a formal language Inference: deriving new sentences from the KB. E.g.:
First-Order Logic and Inductive Logic Programming.
CPSC 322, Lecture 31Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 33 Nov, 25, 2015 Slide source: from Pedro Domingos UW & Markov.
Natural Language Semantics using Probabilistic Logic Islam Beltagy Doctoral Dissertation Proposal Supervising Professors: Raymond J. Mooney, Katrin Erk.
© Copyright 2008 STI INNSBRUCK Intelligent Systems Propositional Logic.
1 First order theories (Chapter 1, Sections 1.4 – 1.5) From the slides for the book “Decision procedures” by D.Kroening and O.Strichman.
CPSC 322, Lecture 30Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30 Nov, 23, 2015 Slide source: from Pedro Domingos UW.
Happy Mittal (Joint work with Prasoon Goyal, Parag Singla and Vibhav Gogate) IIT Delhi New Rules for Domain Independent Lifted.
Markov Logic: A Representation Language for Natural Language Semantics Pedro Domingos Dept. Computer Science & Eng. University of Washington (Based on.
CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.
Progress Report ekker. Problem Definition In cases such as object recognition, we can not include all possible objects for training. So transfer learning.
Scalable Statistical Relational Learning for NLP William Y. Wang William W. Cohen Machine Learning Dept and Language Technologies Inst. joint work with:
1 11 Natural Language Semantics Combining Logical and Distributional Methods using Probabilistic Logic Raymond J. Mooney Katrin Erk Islam Beltagy, Stephen.
Logical Agents. Outline Knowledge-based agents Logic in general - models and entailment Propositional (Boolean) logic Equivalence, validity, satisfiability.
New Rules for Domain Independent Lifted MAP Inference
An Introduction to Markov Logic Networks in Knowledge Bases
Scalable Statistical Relational Learning for NLP
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 29
Natural Language Semantics using Probabilistic Logic
Ensembling Diverse Approaches to Question Answering
Logic for Artificial Intelligence
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 20
Lifted First-Order Probabilistic Inference [de Salvo Braz, Amir, and Roth, 2005] Daniel Lowd 5/11/2005.
Sanjna Kashyap 11th March 2019
Markov Networks.
Presentation transcript:

Montague meets Markov: Combining Logical and Distributional Semantics Raymond J. Mooney Katrin Erk Islam Beltagy University of Texas at Austin 1 1 1 1

 Unable to handle uncertain knowledge and probabilistic reasoning. Logical AI Paradigm Represents knowledge and data in a binary symbolic logic such as FOPC. + Rich representation that handles arbitrary sets of objects, with properties, relations, quantifiers, etc.  Unable to handle uncertain knowledge and probabilistic reasoning.

Probabilistic AI Paradigm Represents knowledge and data as a fixed set of random variables with a joint probability distribution. + Handles uncertain knowledge and probabilistic reasoning.  Unable to handle arbitrary sets of objects, with properties, relations, quantifiers, etc.

Statistical Relational Learning (SRL) SRL methods attempt to integrate methods from predicate logic (or relational databases) and probabilistic graphical models to handle structured, multi-relational data.

SRL Approaches (A Taste of the “Alphabet Soup”) Stochastic Logic Programs (SLPs) (Muggleton, 1996) Probabilistic Relational Models (PRMs) (Koller, 1999) Bayesian Logic Programs (BLPs) (Kersting & De Raedt, 2001) Markov Logic Networks (MLNs) (Richardson & Domingos, 2006) Probabilistic Soft Logic (PSL) (Kimmig et al., 2012)

SRL Methods Based on Probabilistic Graphical Models BLPs use definite-clause logic (Prolog programs) to define abstract templates for large, complex Bayesian networks (i.e. directed graphical models). MLNs use full first order logic to define abstract templates for large, complex Markov networks (i.e. undirected graphical models). PSL uses logical rules to define templates for Markov nets with real-valued propositions to support efficient inference. McCallum’s FACTORIE uses an object-oriented programming language to define large, complex factor graphs. Goodman & Tanenbaum’s CHURCH uses a functional programming language to define, large complex generative models.

Markov Logic Networks [Richardson & Domingos, 2006] Set of weighted clauses in first-order predicate logic. Larger weight indicates stronger belief that the clause should hold. MLNs are templates for constructing Markov networks for a given set of constants MLN Example: Friends & Smokers 7

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) 8

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A) 9

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A) 10

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A) 11

Probability of a possible world Weight of formula i No. of true groundings of formula i in x A possible world becomes exponentially less likely as the total weight of all the grounded clauses it violates increases. 12

MLN Inference Infer probability of a particular query given a set of evidence facts. P(Cancer(Anna) | Friends(Anna,Bob), Smokes(Bob)) Use standard algorithms for inference in graphical models such as Gibbs Sampling or belief propagation.

MLN Learning Learning weights for an existing set of clauses EM Max-margin On-line Learning logical clauses (a.k.a. structure learning) Inductive Logic Programming methods Top-down and bottom-up MLN clause learning On-line MLN clause learning

Strengths of MLNs Fully subsumes first-order predicate logic Just give  weight to all clauses Fully subsumes probabilistic graphical models. Can represent any joint distribution over an arbitrary set of discrete random variables. Can utilize prior knowledge in both symbolic and probabilistic forms. Large existing base of open-source software (Alchemy)

Weaknesses of MLNs Inherits computational intractability of general methods for both logical and probabilistic inference and learning. Inference in FOPC is semi-decidable Inference in general graphical models is P-space complete Just producing the “ground” Markov Net can produce a combinatorial explosion. Current “lifted” inference methods do not help reasoning with many kinds of nested quantifiers.

PSL: Probabilistic Soft Logic [Kimmig & Bach & Broecheler & Huang & Getoor, NIPS 2012] Probabilistic logic framework designed with efficient inference in mind. Input: set of weighted First Order Logic rules and a set of evidence, just as in BLP or MLN MPE inference is a linear-programming problem that can efficiently draw probabilistic conclusions. 17

PSL vs. MLN PSL MLN Atoms have boolean truth values {0, 1}. Atoms have continuous truth values in the interval [0,1]. Inference finds truth value of all atoms that best satisfy the rules and evidence. MPE inference: Most Probable Explanation. Linear optimization problem. MLN Atoms have boolean truth values {0, 1}. Inference finds probability of atoms given the rules and evidence. Calculates conditional probability of a query atom given evidence. Combinatorial counting problem. 18

PSL Example First Order Logic weighted rules Evidence Inference I(friend(John,Alex)) = 1 I(spouse(John,Mary)) = 1 I(votesFor(Alex,Romney)) = 1 I(votesFor(Mary,Obama)) = 1 Inference I(votesFor(John, Obama)) = 1 I(votesFor(John, Romney)) = 0 19

PSL’s Interpretation of Logical Connectives Łukasiewicz relaxation of AND, OR, NOT I(ℓ1 ∧ ℓ2) = max {0, I(ℓ1) + I(ℓ2) – 1} I(ℓ1 ∨ ℓ2) = min {1, I(ℓ1) + I(ℓ2) } I(¬ ℓ1) = 1 – I(ℓ1) Distance to satisfaction Implication: ℓ1 → ℓ2 is Satisfied iff I(ℓ1) ≤ I(ℓ2) d = max {0, I(ℓ1) - I(ℓ2) } Example I(ℓ1) = 0.3, I(ℓ2) = 0.9 ⇒ d = 0 I(ℓ1) = 0.9, I(ℓ2) = 0.3 ⇒ d = 0.6 20

PSL Probability Distribution PDF: Distance to satisfaction of rule r a possible continuous truth assignment Normalization constant For all rules Weight of formula r 21

PSL Inference MPE Inference: (Most probable explanation) Find interpretation that maximizes PDF Find interpretation that minimizes summation Distance to satisfaction is a linear function Linear optimization problem 22

Semantic Representations Formal Semantics Uses first-order logic Deep Brittle Distributional Semantics Statistical method Robust Shallow Combining both logical and distributional semantics Represent meaning using a probabilistic logic Markov Logic Network (MLN) Probabilistic Soft Logic (PSL) Generate soft inference rules From distributional semantics 23

System Architecture [Garrette et al. 2011, 2012; Beltagy et al., 2013] Sent1 LF1 BOXER Dist. Rule Constructor Rule Base Sent2 LF2 Vector Space MLN/PSL Inference BOXER [Bos, et al. 2004]: maps sentences to logical form Distributional Rule constructor: generates relevant soft inference rules based on distributional similarity MLN/PSL: probabilistic inference Result: degree of entailment or semantic similarity score (depending on the task) result 24

Markov Logic Networks [Richardson & Domingos, 2006] Two constants: Anna (A) and Bob (B) P(Cancer(Anna) | Friends(Anna,Bob), Smokes(Bob)) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A) 25

Recognizing Textual Entailment (RTE) Premise: “A man is cutting pickles” x,y,z. man(x) ∧ cut(y) ∧ agent(y, x) ∧ pickles(z) ∧ patient(y, z) Hypothesis: “A guy is slicing cucumber” x,y,z. guy(x) ∧ slice(y) ∧ agent(y, x) ∧ cucumber(z) ∧ patient(y, z) Inference: Pr(Hypothesis | Premise) Degree of entailment 26

Distributional Lexical Rules For all pairs of words (a, b) where a is in S1 and b is in S2 add a soft rule relating the two x a(x) → b(x) | wt(a, b) wt(a, b) = f( cos(a, b) ) Premise: “A man is cutting pickles” Hypothesis: “A guy is slicing cucumber” x man(x) → guy(x) | wt(man, guy) x cut(x) → slice(x) | wt(cut, slice) x pickle(x) → cucumber(x) | wt(pickle, cucumber) x man(x) → cucumber(x) | wt(man, cucumber) x pickle(x) → guy(x) | wt(pickle, guy) → → 27

Distributional Phrase Rules Premise: “A boy is playing” Hypothesis: “A little kid is playing” Need rules for phrases x boy(x) → little(x) ∧ kid(x) | wt(boy, "little kid") Compute vectors for phrases using vector addition [Mitchell & Lapata, 2010] "little kid" = little + kid 28

Paraphrase Rules [by: Cuong Chau] Generate inference rules from pre-compiled paraphrase collections like Berant et al. [2012] e.g, “X solves Y” => “X finds a solution to Y ” | w 29 29 29

Evaluation (RTE using MLNs) Dataset RTE-1, RTE-2, RTE-3 Each dataset is 800 training pairs and 800 testing pairs Use multiple parses to reduce impact of misparses 30

Evaluation (RTE using MLNs) [by: Cuong Chau] Logic-only baseline KB is wordnet RTE-1 RTE-2 RTE-3 Bos & Markert[2005] 0.52 – – MLN 0.57 0.58 0.55 MLN-multi-parse 0.56 0.58 0.57 MLN-paraphrases 0.60 0.60 0.60 31

Semantic Textual Similarity (STS) Rate the semantic similarity of two sentences on a 0 to 5 scale Gold standards are averaged over multiple human judgments Evaluate by measuring correlation to human rating S1 S2 score A man is slicing a cucumber A guy is cutting a cucumber 5 A man is slicing a cucumber A guy is cutting a zucchini 4 A man is slicing a cucumber A woman is cooking a zucchini 3 A man is slicing a cucumber A monkey is riding a bicycle 1 32

Softening Conjunction for STS Premise: “A man is driving” x,y. man(x) ∧ drive(y) ∧ agent(y, x) Hypothesis: “A man is driving a bus” x,y,z. man(x) ∧ drive(y) ∧ agent(y, x) ∧ bus(z) ∧ patient(y, z) Break the sentence into “mini-clauses” then combine their evidences using an “averaging combiner” [Natarajan et al., 2010] Becomes x,y,z. man(x) ∧ agent(y, x)→ result() x,y,z. drive(y) ∧ agent(y, x)→ result() x,y,z. drive(y) ∧ patient(y, z) → result() x,y,z. bus(z) ∧ patient(y, z) → result() 33

Evaluation (STS using MLN) Microsoft video description corpus (SemEval 2012) Short video descriptions System Pearson r Our System with no distributional rules [Logic only] 0.52 Our System with lexical rules 0.60 Our System with lexical and phrase rules 0.63 34

PSL: Probabilistic Soft Logic [Kimmig & Bach & Broecheler & Huang & Getoor, NIPS 2012] MLN's inference is very slow PSL is a probabilistic logic framework designed with efficient inference in mind Inference is a linear program 35

STS using PSL - Conjunction Łukasiewicz relaxation of AND is very restrictive I(ℓ1 ∧ ℓ2) = max {0, I(ℓ1) + I(ℓ2) – 1} Replace AND with weighted average I(ℓ1 ∧ … ∧ ℓn) = w_avg( I(ℓ1), …, I(ℓn)) Learning weights (future work) For now, they are equal Inference “weighted average” is a linear function no changes in the optimization problem 36

Evaluation (STS using PSL) msr-vid: Microsoft video description corpus (SemEval 2012) Short video description sentences msr-par: Microsoft paraphrase corpus (SemEval 2012) Long news sentences SICK: (SemEval 2014) msr-vid msr-par SICK vec-add (dist. only) 0.78 0.24 0.65 vec-mul (dist. only) 0.76 0.12 0.62 MLN (logic + dist.) 0.63 0.16 0.47 PSL-no-DIR (logic only) 0.74 0.46 0.68 PSL (logic + dist.) 0.79 0.53 0.70 PSL+vec-add (ensemble) 0.83 0.49 0.71 37

Evaluation (STS using PSL) msr-vid msr-par SICK PSL time/pair 8s 30s 10s MLN time/pair 1m 31s 11m 49s 4m 24s MLN timeouts(10 min) 9% 97% 36% 38