Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Texas at Austin

Similar presentations


Presentation on theme: "University of Texas at Austin"— Presentation transcript:

1 University of Texas at Austin
Natural Language Semantics Combining Logical and Distributional Methods using Probabilistic Logic Raymond J. Mooney Katrin Erk Islam Beltagy University of Texas at Austin 1 1 1 1

2  Unable to handle uncertain knowledge and probabilistic reasoning.
Logical AI Paradigm Represents knowledge and data in a binary symbolic logic such as FOPC. + Rich representation that handles arbitrary sets of objects, with properties, relations, quantifiers, etc.  Unable to handle uncertain knowledge and probabilistic reasoning.

3 Logical Semantics for Language
Richard Montague (1970) developed a formal method for mapping natural-language to FOPC using Church’s lambda calculus of functions and the fundamental principle of semantic compositionality for recursively computing the meaning of each syntactic constituent from the meanings of its sub-constituents. Later called “Montague Grammar” or “Montague Semantics”

4 Interesting Book on Montague
See Aifric Campbell’s (2009) novel The Semantics of Murder for a fictionalized account of his mysterious death in 1971 (homicide or homoerotic asphyxiation??).

5 Semantic Parsing Mapping a natural-language sentence to a detailed representation of its complete meaning in a fully formal language that: Has a rich ontology of types, properties, and relations. Supports automated reasoning or execution.

6 Geoquery: A Database Query Application
Query application for a U.S. geography database containing about 800 facts [Zelle & Mooney, 1996] What is the smallest state by area? Rhode Island Answer Semantic Parsing Query answer(x1,smallest(x2,(state(x1),area(x1,x2))))

7 Distributional (Vector-Space) Lexical Semantics
Represent word meanings as points (vectors) in a (high-dimensional) Euclidian space. Dimensions encode aspects of the context in which the word appears (e.g. how often it co-occurs with another specific word). Semantic similarity defined as distance between points in this semantic space. Many specific mathematical models for computing dimensions and similarity 1st model (1990): Latent Semantic Analysis (LSA)

8 Sample Lexical Vector Space (reduced to 2 dimensions)
bottle cup dog water cat computer robot woman rock man

9 Issues with Distributional Semantics
How to compose meanings of larger phrases and sentences from lexical representations? (many recent proposals…) None of the proposals for compositionality capture the full representational or inferential power of FOPC (Grefenstette, 2013). “You can’t cram the meaning of a whole %&!$# sentence into a single $&!#* vector!”

10 Using Distributional Semantics with Standard Logical Form
Recent work on unsupervised semantic parsing (Poon & Domingos, 2009) and work by Lewis and Steedman (2013) automatically create an ontology of predicates by clustering based using distributional information. But they do not allow gradedness and uncertainty in the final semantic representation and inference.

11 Probabilistic AI Paradigm
Represents knowledge and data as a fixed set of random variables with a joint probability distribution. + Handles uncertain knowledge and probabilistic reasoning.  Unable to handle arbitrary sets of objects, with properties, relations, quantifiers, etc.

12 Statistical Relational Learning (SRL)
SRL methods attempt to integrate methods from predicate logic (or relational databases) and probabilistic graphical models to handle structured, multi-relational data.

13 SRL Approaches (A Taste of the “Alphabet Soup”)
Stochastic Logic Programs (SLPs) (Muggleton, 1996) Probabilistic Relational Models (PRMs) (Koller, 1999) Bayesian Logic Programs (BLPs) (Kersting & De Raedt, 2001) Markov Logic Networks (MLNs) (Richardson & Domingos, 2006) Probabilistic Soft Logic (PSL) (Kimmig et al., 2012)

14 Formal Semantics for Natural Language using Probabilistic Logical Form
Represent the meaning of natural language in a formal probabilistic logic (Beltagy et al., 2013, 2014). Markov Logic Networks (MLNs) Probabilistic Similarity Logic (PSL) “Montague meets Markov”

15 Markov Logic Networks [Richardson & Domingos, 2006]
Set of weighted clauses in first-order predicate logic. Larger weight indicates stronger belief that the clause should hold. MLNs are templates for constructing Markov networks for a given set of constants MLN Example: Friends & Smokers 15

16 Example: Friends & Smokers
Two constants: Anna (A) and Bob (B) 16

17 Example: Friends & Smokers
Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A) 17

18 Example: Friends & Smokers
Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A) 18

19 Example: Friends & Smokers
Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A) 19

20 Probability of a possible world
Weight of formula i No. of true groundings of formula i in x A possible world becomes exponentially less likely as the total weight of all the grounded clauses it violates increases. 20

21 MLN Inference Infer probability of a particular query given a set of evidence facts. P(Cancer(Anna) | Friends(Anna,Bob), Smokes(Bob)) Use standard algorithms for inference in graphical models such as Gibbs Sampling or belief propagation.

22 MLN Learning Learning weights for an existing set of clauses
EM Max-margin On-line Learning logical clauses (a.k.a. structure learning) Inductive Logic Programming methods Top-down and bottom-up MLN clause learning On-line MLN clause learning

23 Strengths of MLNs Fully subsumes first-order predicate logic
Just give  weight to all clauses Fully subsumes probabilistic graphical models. Can represent any joint distribution over an arbitrary set of discrete random variables. Can utilize prior knowledge in both symbolic and probabilistic forms. Large existing base of open-source software (Alchemy)

24 Weaknesses of MLNs Inherits computational intractability of general methods for both logical and probabilistic inference and learning. Inference in FOPC is semi-decidable Inference in general graphical models is P-space complete Just producing the “ground” Markov Net can produce a combinatorial explosion. Current “lifted” inference methods do not help reasoning with many kinds of nested quantifiers.

25 PSL: Probabilistic Soft Logic [Kimmig & Bach & Broecheler & Huang & Getoor, NIPS 2012]
Probabilistic logic framework designed with efficient inference in mind. Input: set of weighted First Order Logic rules and a set of evidence, just as in BLP or MLN MPE inference is a linear-programming problem that can efficiently draw probabilistic conclusions. 25

26 PSL vs. MLN PSL MLN Atoms have boolean truth values {0, 1}.
Atoms have continuous truth values in the interval [0,1]. Inference finds truth value of all atoms that best satisfy the rules and evidence. MPE inference: Most Probable Explanation. Linear optimization problem. MLN Atoms have boolean truth values {0, 1}. Inference finds probability of atoms given the rules and evidence. Calculates conditional probability of a query atom given evidence. Combinatorial counting problem. 26

27 PSL Example First Order Logic weighted rules Evidence Inference
I(friend(John,Alex)) = 1 I(spouse(John,Mary)) = 1 I(votesFor(Alex,Romney)) = 1 I(votesFor(Mary,Obama)) = 1 Inference I(votesFor(John, Obama)) = 1 I(votesFor(John, Romney)) = 0 27

28 PSL’s Interpretation of Logical Connectives
Łukasiewicz relaxation of AND, OR, NOT I(ℓ1 ∧ ℓ2) = max {0, I(ℓ1) + I(ℓ2) – 1} I(ℓ1 ∨ ℓ2) = min {1, I(ℓ1) + I(ℓ2) } I(¬ ℓ1) = 1 – I(ℓ1) Distance to satisfaction Implication: ℓ1 → ℓ2 is Satisfied iff I(ℓ1) ≤ I(ℓ2) d = max {0, I(ℓ1) − I(ℓ2) } Example I(ℓ1) = 0.3, I(ℓ2) = ⇒ d = 0 I(ℓ1) = 0.9, I(ℓ2) = ⇒ d = 0.6 28

29 PSL Probability Distribution
PDF: Distance to satisfaction of rule r a possible continuous truth assignment Normalization constant For all rules Weight of formula r 29

30 PSL Inference MPE Inference: (Most probable explanation)
Find interpretation that maximizes PDF Find interpretation that minimizes summation Distance to satisfaction is a linear function Linear optimization problem 30

31 Semantic Representations
Formal Semantics Uses first-order logic Deep Brittle Distributional Semantics Statistical method Robust Shallow Combining both logical and distributional semantics Represent meaning using a probabilistic logic Markov Logic Network (MLN) Probabilistic Soft Logic (PSL) Generate soft inference rules From distributional semantics 31

32 System Architecture [Garrette et al. 2011, 2012; Beltagy et al., 2013]
Sent1 LF1 BOXER Dist. Rule Constructor Rule Base Sent2 LF2 Vector Space MLN/PSL Inference BOXER [Bos, et al. 2004]: maps sentences to logical form Distributional Rule constructor: generates relevant soft inference rules based on distributional similarity MLN/PSL: probabilistic inference Result: degree of entailment or semantic similarity score (depending on the task) result 32

33 Recognizing Textual Entailment (RTE)
Premise: “A man is cutting a pickle” x,y,z [man(x) ∧ cut(y) ∧ agent(y, x) ∧ pickle(z) ∧ patient(y, z)] Hypothesis: “A guy is slicing a cucumber” x,y,z [guy(x) ∧ slice(y) ∧ agent(y, x) ∧ cucumber(z) ∧ patient(y, z)] Inference: Pr(Hypothesis | Premise) Degree of entailment 33

34 Distributional Lexical Rules
For all pairs of words (a, b) where a is in S1 and b is in S2 add a soft rule relating the two x a(x) → b(x) | wt(a, b) wt(a, b) = f( cos(a, b) ) Premise: “A man is cutting pickles” Hypothesis: “A guy is slicing cucumber” x man(x) → guy(x) | wt(man, guy) x cut(x) → slice(x) | wt(cut, slice) x pickle(x) → cucumber(x) | wt(pickle, cucumber) x man(x) → cucumber(x) | wt(man, cucumber) x pickle(x) → guy(x) | wt(pickle, guy) → → 34

35 Distributional Phrase Rules
Premise: “A boy is playing” Hypothesis: “A little kid is playing” Need rules for phrases x boy(x) → little(x) ∧ kid(x) | wt(boy, "little kid") Compute vectors for phrases using vector addition [Mitchell & Lapata, 2010] "little kid" = little + kid 35

36 “X solves Y” => “X finds a solution to Y ” | w
Paraphrase Rules Generate inference rules from pre-compiled paraphrase collections like Berant et al. [2012] e.g, “X solves Y” => “X finds a solution to Y ” | w 36 36 36

37 Evaluation (RTE using MLNs)
Dataset RTE-1, RTE-2, RTE-3 Each dataset is 800 training pairs and 800 testing pairs Use multiple parses to reduce impact of misparses 37

38 Evaluation (RTE using MLNs)
Logic-only baseline KB is wordnet RTE-1 RTE-2 RTE-3 Bos & Markert[2005] – – MLN MLN-multi-parse MLN-paraphrases 38

39 Enhancing MLN inference for the RTE task
Inference algorithm to compute probabilities of complete formulas, not just individual ground atoms(QF) Pr(Q|R) = ratio between Z of the MLN with and without Q added as a hard clause Use SampleSearch to estimate Z A modified closed-world assumption (MCW) that removes unnecessary ground atoms from the ground network All ground atoms are False by default unless they are reachable from the evidence 39 39

40 Evaluation (RTE using enhanced MLN inference)
Using the SICK dataset from SemEval 2014 System Accuracy CPU Time Timeouts(30 min) mln % 2min 27sec 96% mln+qf 69% 1min 51sec 30% mln+mcw 66% 10sec % mln+qf+mcw 72% 7sec % 40 40

41 Semantic Textual Similarity (STS)
Rate the semantic similarity of two sentences on a 0 to 5 scale Gold standards are averaged over multiple human judgments Evaluate by measuring correlation to human rating S S2 score A man is slicing a cucumber A guy is cutting a cucumber 5 A man is slicing a cucumber A guy is cutting a zucchini 4 A man is slicing a cucumber A woman is cooking a zucchini 3 A man is slicing a cucumber A monkey is riding a bicycle 1 41

42 Softening Conjunction for STS
Premise: “A man is driving” x,y. man(x) ∧ drive(y) ∧ agent(y, x) Hypothesis: “A man is driving a bus” x,y,z. man(x) ∧ drive(y) ∧ agent(y, x) ∧ bus(z) ∧ patient(y, z) Break the sentence into “mini-clauses” then combine their evidences using an “averaging combiner” [Natarajan et al., 2010] Becomes x,y,z. man(x) ∧ agent(y, x)→ result() x,y,z. drive(y) ∧ agent(y, x)→ result() x,y,z. drive(y) ∧ patient(y, z) → result() x,y,z. bus(z) ∧ patient(y, z) → result() 42

43 Evaluation (STS using MLN)
Microsoft video description corpus (SemEval 2012) Short video descriptions System Pearson r Our System with no distributional rules [Logic only] 0.52 Our System with lexical rules Our System with lexical and phrase rules 43

44 PSL: Probabilistic Soft Logic [Kimmig & Bach & Broecheler & Huang & Getoor, NIPS 2012]
MLN's inference is very slow PSL is a probabilistic logic framework designed with efficient inference in mind Inference is a linear program 44

45 STS using PSL - Conjunction
Łukasiewicz relaxation of AND is very restrictive I(ℓ1 ∧ ℓ2) = max {0, I(ℓ1) + I(ℓ2) – 1} Replace AND with weighted average I(ℓ1 ∧ … ∧ ℓn) = w_avg( I(ℓ1), …, I(ℓn)) Learning weights (future work) For now, they are equal Inference “weighted average” is a linear function no changes in the optimization problem 45

46 Evaluation (STS using PSL)
msr-vid: Microsoft video description corpus (SemEval 2012) Short video description sentences msr-par: Microsoft paraphrase corpus (SemEval 2012) Long news sentences SICK: (SemEval 2014) msr-vid msr-par SICK vec-add (dist. only) vec-mul (dist. only) MLN (logic + dist.) PSL-no-DIR (logic only) PSL (logic + dist.) 46

47 Evaluation (STS using PSL)
msr-vid msr-par SICK PSL time/pair s 30s 10s MLN time/pair 1m 31s 11m 49s 4m 24s MLN timeouts(10 min) 9% 97% 36% 47

48 Future Work Improve inference efficiency for MLNs by exploiting latest in “lifted inference” Improve parsing into logical form using the latest improvements in Boxer and semantic parsing. Improve extraction and distributional representation of phrases. Use asymmetric distributional similarity Add additional knowledge sources to MLN WordNet PPDB

49 Conclusions Traditional logical and distributional approaches to natural language semantics have limitations and weaknesses. These competing approaches can be combined using a probabilistic logic (e.g. MLN, PSL) as a uniform semantic representation. We have initial promising results using MLNs for RTE and PSL for STS.


Download ppt "University of Texas at Austin"

Similar presentations


Ads by Google