1 11 Natural Language Semantics Combining Logical and Distributional Methods using Probabilistic Logic Raymond J. Mooney Katrin Erk Islam Beltagy, Stephen.

Slides:

Advertisements

Similar presentations

1 Unsupervised Ontology Induction From Text Hoifung Poon Dept. Computer Science & Eng. University of Washington (Joint work with Pedro Domingos)

Advertisements

Slide 1 of 18 Uncertainty Representation and Reasoning with MEBN/PR-OWL Kathryn Blackmond Laskey Paulo C. G. da Costa The Volgenau School of Information.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Discriminative Structure and Parameter.

Bayesian Abductive Logic Programs Sindhu Raghavan Raymond J. Mooney The University of Texas at Austin 1.

Online Max-Margin Weight Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.

The Logic of Intelligence Pei Wang Department of Computer and Information Sciences Temple University.

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

CPSC 322, Lecture 30Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30 March, 25, 2015 Slide source: from Pedro Domingos UW.

Structured SVM Chen-Tse Tsai and Siddharth Gupta.

2 Information Retrieval System IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3.

Markov Logic Networks: Exploring their Application to Social Network Analysis Parag Singla Dept. of Computer Science and Engineering Indian Institute of.

Markov Logic Networks Instructor: Pedro Domingos.

Markov Logic: Combining Logic and Probability Parag Singla Dept. of Computer Science & Engineering Indian Institute of Technology Delhi.

1 Unsupervised Semantic Parsing Hoifung Poon and Pedro Domingos EMNLP 2009 Best Paper Award Speaker: Hao Xiong.

Islam Beltagy, Cuong Chau, Gemma Boleda, Dan Garrette, Katrin Erk, Raymond Mooney The University of Texas at Austin Richard Montague Andrey Markov Montague.

Montague meets Markov: Combining Logical and Distributional Semantics

Review Markov Logic Networks Mathew Richardson Pedro Domingos Xinran(Sean) Luo, u

Speaker:Benedict Fehringer Seminar:Probabilistic Models for Information Extraction by Dr. Martin Theobald and Maximilian Dylla Based on Richards, M., and.

A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,

University of Texas at Austin

CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.

Statistical Relational Learning Pedro Domingos Dept. of Computer Science & Eng. University of Washington.

Recursive Random Fields Daniel Lowd University of Washington June 29th, 2006 (Joint work with Pedro Domingos)

CSE 574: Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.

Inference. Overview The MC-SAT algorithm Knowledge-based model construction Lazy inference Lifted inference.

Semantics For the Semantic Web: The Implicit, the Formal and The Powerful Amit Sheth, Cartic Ramakrishnan, Christopher Thomas CS751 Spring 2005 Presenter:

Recursive Random Fields Daniel Lowd University of Washington (Joint work with Pedro Domingos)

Relational Models. CSE 515 in One Slide We will learn to: Put probability distributions on everything Learn them from data Do inference with them.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

1 Human Detection under Partial Occlusions using Markov Logic Networks Raghuraman Gopalan and William Schwartz Center for Automation Research University.

Learning, Logic, and Probability: A Unified View Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Stanley Kok, Matt.

On the Proper Treatment of Quantifiers in Probabilistic Logic Semantics Islam Beltagy and Katrin Erk The University of Texas at Austin IWCS 2015.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Statistical Relational Learning Pedro Domingos Dept. Computer Science & Eng. University of Washington.

February 2009Introduction to Semantics1 Logic, Representation and Inference Introduction to Semantics What is semantics for? Role of FOL Montague Approach.

Markov Logic Parag Singla Dept. of Computer Science University of Texas, Austin.

November 2003CSA4050: Semantics I1 CSA4050: Advanced Topics in NLP Semantics I What is semantics for? Role of FOL Montague Approach.

Markov Logic And other SRL Approaches

Markov Logic and Deep Networks Pedro Domingos Dept. of Computer Science & Eng. University of Washington.

BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Markov Logic Networks Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Matt Richardson)

Learning to “Read Between the Lines” using Bayesian Logic Programs Sindhu Raghavan, Raymond Mooney, and Hyeonseo Ku The University of Texas at Austin July.

CPSC 322, Lecture 31Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 33 Nov, 25, 2015 Slide source: from Pedro Domingos UW & Markov.

Natural Language Semantics using Probabilistic Logic Islam Beltagy Doctoral Dissertation Proposal Supervising Professors: Raymond J. Mooney, Katrin Erk.

DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.

CPSC 422, Lecture 32Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 32 Nov, 27, 2015 Slide source: from Pedro Domingos UW & Markov.

CPSC 322, Lecture 30Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30 Nov, 23, 2015 Slide source: from Pedro Domingos UW.

Happy Mittal (Joint work with Prasoon Goyal, Parag Singla and Vibhav Gogate) IIT Delhi New Rules for Domain Independent Lifted.

1 First Order Logic CS 171/271 (Chapter 8) Some text and images in these slides were drawn from Russel & Norvig’s published material.

Progress Report ekker. Problem Definition In cases such as object recognition, we can not include all possible objects for training. So transfer learning.

Scalable Statistical Relational Learning for NLP William Y. Wang William W. Cohen Machine Learning Dept and Language Technologies Inst. joint work with:

Happy Mittal Advisor : Parag Singla IIT Delhi Lifted Inference Rules With Constraints.

Learning Bayesian Networks for Complex Relational Data

New Rules for Domain Independent Lifted MAP Inference

More precise fuzziness, more fuzzy precision

Ensembling Diverse Approaches to Question Answering

An Introduction to Markov Logic Networks in Knowledge Bases

Scalable Statistical Relational Learning for NLP

Semantic Parsing for Question Answering

Markov Logic Networks for NLP CSCI-GA.2591

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 29

Knowledge Representation

Vector-Space (Distributional) Lexical Semantics

Natural Language Semantics using Probabilistic Logic

Integrating Learning of Dialog Strategies and Semantic Parsing

Ensembling Diverse Approaches to Question Answering

Sanjna Kashyap 11th March 2019

Natural Language Processing

Presentation transcript:

1 11 Natural Language Semantics Combining Logical and Distributional Methods using Probabilistic Logic Raymond J. Mooney Katrin Erk Islam Beltagy, Stephen Roller, Pengxiang Cheng University of Texas at Austin

Logical AI Paradigm Represents knowledge and data in a binary symbolic logic such as FOPC. + Rich representation that handles arbitrary sets of objects, with properties, relations, logical connectives, and quantifiers.  Unable to handle uncertain knowledge and probabilistic reasoning.

Logical Semantics for Language Richard Montague (1970) developed a formal method for mapping natural- language to FOPC using Church’s lambda calculus of functions and the fundamental principle of semantic compositionality for 3 recursively computing the meaning of each syntactic constituent from the meanings of its sub-constituents. Later called “Montague Grammar” or “Montague Semantics”

Interesting Book on Montague 4 See Aifric Campbell’s (2009) novel The Semantics of Murder for a fictionalized account of his mysterious death in 1971 (homicide or homoerotic asphyxiation??).

Semantic Parsing Mapping a natural-language sentence to a detailed representation of its complete meaning in a fully formal language that: –Has a rich ontology of types, properties, and relations. –Supports automated reasoning or execution. 5

6 Geoquery: A Database Query Application Query application for a U.S. geography database containing about 800 facts [Zelle & Mooney, 1996] What is the smallest state by area? Query answer(x1,smallest(x2,(state(x1),area(x1,x2)))) Semantic Parsing Rhode Island Answer

Composing Meanings from Parse Trees 7 What is the capital of Ohio? S NP VP WP What answer(capital(loc_2(stateid('ohio')))) capital(loc_2(stateid('ohio'))) answer() NP capital(loc_2(stateid('ohio'))) VBZ V is DT N PP loc_2(stateid('ohio')) capital() IN NP NNP Ohio stateid('ohio') the capital of loc_2() capital() stateid('ohio') loc_2()     

Distributional (Vector-Space) Lexical Semantics Represent word meanings as points (vectors) in a (high-dimensional) Euclidian space. Dimensions encode aspects of the context in which the word appears (e.g. how often it co- occurs with another specific word). Semantic similarity defined as distance between points in this semantic space. Many specific mathematical models for computing dimensions and similarity –1 st model (1990): Latent Semantic Analysis (LSA) 8

Sample Lexical Vector Space (reduced to 2 dimensions) 9 dog cat man woman bottle cup water rock computer robot

Issues with Distributional Semantics How to compose meanings of larger phrases and sentences from lexical representations? (many recent proposals involving matrices, tensors, etc…) None of the proposals for compositionality capture the full representational or inferential power of FOPC (Grefenstette, 2013). My impassioned reaction to this work: 10 “You can’t cram the meaning of a whole %&!$# sentence into a single $&!#* vector!”

Limits of Distributional Representations How would a distributional approach represent and answer complex questions requiring aggregation of data? Given IMDB or FreeBase data, answer the question: –Did Woody Allen make more movies with Diane Keaton or Mia Farrow? –Answer: Mia Farrow (12 vs. 7) 11

Using Distributional Semantics with Standard Logical Form Recent work on unsupervised semantic parsing (Poon & Domingos, 2009) and work by Lewis and Steedman (2013) automatically create an ontology of predicates by clustering based using distributional information. But they do not allow gradedness and uncertainty in the final semantic representation and inference. 12

Probabilistic AI Paradigm Represents knowledge and data as a fixed set of random variables with a joint probability distribution. + Handles uncertain knowledge and probabilistic reasoning.  Unable to handle arbitrary sets of objects, with properties, relations, quantifiers, etc.

Statistical Relational Learning (SRL) SRL methods attempt to integrate methods from predicate logic (or relational databases) and probabilistic graphical models to handle structured, multi-relational data.

15 SRL Approaches (A Taste of the “Alphabet Soup”) Stochastic Logic Programs (SLPs) (Muggleton, 1996) Probabilistic Relational Models (PRMs) (Koller, 1999) Bayesian Logic Programs (BLPs) (Kersting & De Raedt, 2001) Markov Logic Networks (MLNs) (Richardson & Domingos, 2006) Probabilistic Soft Logic (PSL) (Kimmig et al., 2012)

Formal Semantics for Natural Language using Probabilistic Logical Form Represent the meaning of natural language in a formal probabilistic logic (Beltagy et al., 2013, 2014, 2015) “Montague meets Markov” 16

17 Markov Logic Networks [ Richardson & Domingos, 2006]  Set of weighted clauses in first-order predicate logic.  Larger weight indicates stronger belief that the clause should hold.  MLNs are templates for constructing Markov networks for a given set of constants MLN Example: Friends & Smokers

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) 18

Example: Friends & Smokers Cancer(A) Smokes(A)Friends(A,A) Friends(B,A) Smokes(B) Friends(A,B) Cancer(B) Friends(B,B) Two constants: Anna (A) and Bob (B) 19

Example: Friends & Smokers Cancer(A) Smokes(A)Friends(A,A) Friends(B,A) Smokes(B) Friends(A,B) Cancer(B) Friends(B,B) Two constants: Anna (A) and Bob (B) 20

Example: Friends & Smokers Cancer(A) Smokes(A)Friends(A,A) Friends(B,A) Smokes(B) Friends(A,B) Cancer(B) Friends(B,B) Two constants: Anna (A) and Bob (B) 21

Weight of formula iNo. of true groundings of formula i in x 22 Probability of a possible world A possible world becomes exponentially less likely as the total weight of all the grounded clauses it violates increases. a possible world

MLN Inference  Infer probability of a particular query given a set of evidence facts.  P(Cancer(Anna) | Friends(Anna,Bob), Smokes(Bob))  Use standard algorithms for inference in graphical models such as Gibbs Sampling or belief propagation.

Strengths of MLNs Fully subsumes first-order predicate logic –Just give  weight to all clauses Fully subsumes probabilistic graphical models. –Can represent any joint distribution over an arbitrary set of discrete random variables. Can utilize prior knowledge in both symbolic and probabilistic forms. Existing open-source software (Alchemy, Tuffy) 24

Weaknesses of MLNs Inherits computational intractability of general methods for both logical and probabilistic inference and learning. –Inference in FOPC is semi-decidable –Inference in general graphical models is P-space complete Just producing the “ground” Markov Net can produce a combinatorial explosion. –Current “lifted” inference methods do not help reasoning with many kinds of nested quantifiers. 25

Semantic Representations Formal Semantics o Uses first-order logic o Deep o Brittle 26 Combine both logical and distributional semantics –Represent meaning using a probabilistic logic Markov Logic Network (MLN) Probabilistic Soft Logic (PSL) –Generate soft inference rules from distributional semantics. Distributional Semantics o Statistical method o Robust o Shallow

System Architecture [Garrette et al. 2011, 2012; Beltagy et al., 2013, 2014, 2015] 27 Sent1 BOXER Rule Base result Sent2 LF1 LF2 Dist. Rule Constructor Vector Space MLN/PSL Inference BOXER (Bos, et al. 2004) : CCG-based parser maps sentences to logical form Distributional Rule constructor: generates relevant soft inference rules based on distributional similarity MLN/PSL: probabilistic inference Result: degree of entailment or semantic similarity score (depending on the task)

Recognizing Textual Entailment (RTE) Premise: “A man is cutting a pickle”  x,y,z [man(x) ∧ cut(y) ∧ agent(y, x) ∧ pickle(z) ∧ patient(y, z)] Hypothesis: “A guy is slicing a cucumber”  x,y,z[guy(x) ∧ slice(y) ∧ agent(y, x) ∧ cucumber(z) ∧ patient(y, z)] Inference: Pr(Hypothesis | Premise) –Degree of entailment 28

29 Distributional Lexical Rules For all pairs of words (a, b) where a is in S1 and b is in S2 add a soft rule relating the two: –  x a(x) → b(x) | wt(a, b) – wt(a, b) = f(cos(a, b)) Premise: “A man is cutting a pickle” Hypothesis: “A guy is slicing a cucumber” –  x man(x) → guy(x) | wt(man, guy) –  x cut(x) → slice(x) | wt(cut, slice) –  x pickle(x) → cucumber(x) | wt(pickle, cucumber) –  x man(x) → cucumber(x) | wt(man, cucumber) –  x pickle(x) → guy(x) | wt(pickle, guy) →

Rules from WordNet Extract “hard” rules from WordNet: 30

Rules from Paraphrase Databases (PPDB) Translate paraphrase rules to logic: –“person riding a bike”  “biker” – Learn a scaling factor that maps PPDB weights to MLN weights to maximize performance on training data. 31

Entailment Rule Construction Alternative to constructing rules for all word pairs. Construct a specific rule just sufficient to allow entailing Hypothesis from Premise. –Uses a version of resolution theorem proving. Construct a weight for this rule using distributional information. 32

Sample Lexical Entailment Rule Construction Premise: “A groundhog sat on a hill.”  x,y,z [groundhog(x) ∧ sat(y) ∧ agent(y, x) ∧ on(y,z) ∧ hill(z)] 33 Hypothesis: “A woodchuck sat on a hill”  x,y,z [woodchuck(x) ∧ sat(y) ∧ agent(y, x) ∧ on(y,z) ∧ hill(z)] Constructed Rule:  x [groundhog(x) → woodchuck(x)]

Sample Phrasal Entailment Rule Construction Premise: “A person solved a problem.”  x,y,z [person(x) ∧ solved(y) ∧ agent(y, x) ∧ patient(y,z) ∧ _______problem(z)] 34 Hypothesis: “A person found a solution to a problem”  x,y,z,w [person(x) ∧ found(y) ∧ agent(y, x) ∧ patient(y,w) ∧ ________solution(w) ∧ to(y,z) ∧ problem(z)] Constructed Rule:  x,y [solved(y) ∧ patient(y,x) →  w,z (found(y) ∧ patient(y,w) ∧ _____solution(w) ∧ to(y,z)) ]

Entailment Rule Classifier Use distributional information to recognize lexical relationships (e.g. synonymy, hypernymy, meronomy) (Baroni et al, 2012; Roller et al, 2014). Train a supervised classifier to recognize semantic relationships using distributional (and other) features of the words. For phrasal entailment rules, use features from the compositional distributional representation of the phrases (Paperno, et al., 2014). For SICK RTE, classify rules as entails, contradicts, or neutral. 35

Lexical Rule Features 36

Phrasal Rule Features 37

Employing Multiple CCG Parsers Boxer relies on C&C CCG parser which frequently makes mistakes. EasyCCG (Lewis & Steedman, 2014) is a newer CCG parser that makes fewer (different) mistakes. MultiParse integrates both parse results into the RTE inference process. 38

Experimental Evaluation SICK RTE Task SICK (Sentences Involving Compositional Knowledge) SemEval Task from RTE task is to classify pairs of sentences as: –Entailment –Contradiction –Neutral 39

SICK RTE Results System Components EnabledTest Accuracy MLN Logic73.37 MLN Logic + PPDB76.33 MLN Logic + PPDB + WordNet78.40 MLN Logic + PPDB + WordNet + MultiParse80.37 MLN Logic + Distributional Rules MultiParse WordNet Remember Training Entailment Rules PPDB84.94 Competition Winner (Lai & Hockenmaier, 2014)

Future Work Improve inference efficiency for MLNs by exploiting latest in “lifted inference” Improve logical form construction using the latest methods in semantic parsing. Improve entailment rule classifier. Improve distributional representation of phrases. Enable question answering by developing efficient constructive existential theorem proving in MLNs. 41

Conclusions Traditional logical and distributional approaches to natural language semantics have complementary strengths and weaknesses. These competing approaches can be combined using a probabilistic logic (e.g. MLNs) as a uniform semantic representation. Allows easy integration of additional knowledge sources and parsers. State-of-the-Art results for SICK RTE Challenge.

Questions? See recent in-review journal paper available on Arxiv: –Representing Meaning with a Combination of Logical Form and Vectors. I.Beltagy, S.Roller, P. Cheng, K. Erk & R.J. Mooney. arXiv preprint: [cs.CL],