Two Aspects of the Problem of Natural Language Inference

Two Aspects of the Problem of Natural Language Inference
Hi, I’m Bill MacCartney, and I’m going to be talking about two aspects of the problem of natural language inference. But first I want to thank you for giving me the opportunity to speak to you today. It’s an honor and a privilege, and I hope you’ll find it interesting. [>] Bill MacCartney NLP Group Stanford University 8 October 2008

Two talks for the price of one!
Both concern the problem of natural language inference (NLI) Modeling semantic containment and exclusion in NLI Presented at Coling-08, won best paper award A computational model of natural logic for NLI Doesn’t solve all NLI problems, but handles an interesting subset Depends on alignments from other sources A phrase-based model of alignment for NLI To be presented at EMNLP-08 Addresses the problem of alignment for NLI & relates it MT Made possible by annotated data produced here at MSR Today’s talk is really a two-part talk. Each part concerns a different aspect of the problem of natural language inference, or NLI, which I’ll define in a moment. …

Bill MacCartney and Christopher D. Manning
Modeling Semantic Containment and Exclusion in Natural Language Inference The first part of the talk is about modeling semantic containment and exclusion for natural language inference. This is joint work with my advisor, Chris Manning. [>] Bill MacCartney and Christopher D. Manning NLP Group Stanford University 8 October 2008

Natural language inference (NLI)
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Natural language inference (NLI) Aka recognizing textual entailment (RTE) Does premise P justify an inference to hypothesis H? An informal, intuitive notion of inference: not strict logic Emphasis on variability of linguistic expression P Every firm polled saw costs grow more than expected, even after adjusting for inflation. H Every big company in the poll reported cost increases. yes Some no Natural language inference -- also known as recognizing textual entailment -- is the problem of determining whether a premise P justifies an inference to a hypothesis H. This is an informal, intuitive notion of inference: the emphasis is on short, local inference steps and variability of linguistic expression, rather than long chains of formal reasoning. Here’s an example. The premise is … while the hypothesis is … and in this a valid inference. I want to make two observations about this example: [!] First, if the quantifier were “some” instead of “every”, the inference would NOT be valid, because it could be that only SMALL firms saw costs grow; And second, it would be difficult or impossible to translate these sentences fully and accurately into formal logic. The importance of these facts will become clear in a moment. Natural language inference is necessary to the ultimate goal of full natural language understanding, and can also enable more immediate applications, including semantic search, question answering, and others. Necessary to goal of natural language understanding (NLU) Can also enable semantic search, question answering, …

NLI: a spectrum of approaches
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion NLI: a spectrum of approaches natural logic (this work) Solution? robust, but shallow deep, but brittle lexical/ semantic overlap Jijkoun & de Rijke 2005 FOL & theorem proving Bos & Markert 2006 patterned relation extraction Romano et al. 2006 semantic graph matching Hickl et al. 2006 MacCartney et al. 2006 Work on natural language inference has explored a broad spectrum of approaches. [!] At one end of the spectrum are approaches based on lexical or semantic overlap, pattern-based relation extraction, or approximate matching of predicate-argument structure. Such approaches are robust and broadly effective, but imprecise, and are easily confounded by inferences involving negation, quantifiers, and other phenomena -- including the example on the previous slide. [!] At the other end of the spectrum, we have approaches which rely on translation to FOL and theorem proving. Such approaches have the power and precision we’re looking for, but they tend to founder on the many well-known difficulties involved in accurately translating natural language to FOL. [!] In this work, we explore a different point on the spectrum, by developing a computational model of natural logic, which I’ll define in a moment. [>] Problem: hard to translate NL to FOL idioms, anaphora, ellipsis, intensionality, tense, aspect, vagueness, modals, indexicals, reciprocals, propositional attitudes, scope ambiguities, anaphoric adjectives, non-intersective adjectives, temporal & causal relations, unselective quantifiers, adverbs of quantification, donkey sentences, generic determiners, comparatives, phrasal verbs, … Problem: imprecise  easily confounded by negation, quantifiers, conditionals, factive & implicative verbs, etc.

Outline Introduction A Theory of Natural Logic The NatLog System
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Outline Introduction A Theory of Natural Logic The NatLog System Experiments with FraCaS Experiments with RTE Conclusion Here’s the outline of the talk. First I’ll talk about the theoretical foundations of natural logic. Then I’ll introduce our computational model of natural logic, the NatLog system. Then I’ll describe experiments with two different data sets: the FraCaS test suite, and the RTE data. And then I’ll conclude. [>]

What is natural logic? ( natural deduction)
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion What is natural logic? ( natural deduction) Characterizes valid patterns of inference via surface forms precise, yet sidesteps difficulties of translating to FOL A long history traditional logic: Aristotle’s syllogisms, scholastics, Leibniz, … modern natural logic begins with Lakoff (1970) van Benthem & Sánchez Valencia ( ): monotonicity calculus Nairn et al. (2006): an account of implicatives & factives We introduce a new theory of natural logic extends monotonicity calculus to account for negation & exclusion incorporates elements of Nairn et al.’s model of implicatives So what is natural logic? The term was introduced by Lakoff, who defined natural logic as a logic whose vehicle of inference is natural language. That is, it characterizes valid patterns of reasoning in terms of surface forms. It thus permits us to do precise reasoning, while sidestepping the myriad difficulties of full semantic interpretation. Natural logic has a very long history, stretching back to the syllogisms of Aristotle. It was revived in the 1980s as the monotonicity calculus of van Benthem & Sanchez Valencia. Also, the account of implicatives & factives developed by Nairn et al. at PARC arguably belongs to the natural logic tradition, though it wasn’t presented as such. In this work, we present a new theory of natural logic which extends the monotonicity calculus to account for negation & exclusion, and also incorporates elements of Nairn’s model of implicatives Over the next few slides, I’ll sketch this model, but at a very high level. For more details, please see the paper. [>]

7 basic entailment relations
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion 7 basic entailment relations Venn symbol name example P = Q equivalence couch = sofa P ⊏ Q forward entailment (strict) crow ⊏ bird P ⊐ Q reverse entailment European ⊐ French P ^ Q negation (exhaustive exclusion) human ^ nonhuman P | Q alternation (non-exhaustive exclusion) cat | dog P _ Q cover (exhaustive non-exclusion) animal _ nonhuman P # Q independence hungry # hippo First, we propose an inventory of 7 mutually exclusive basic entailment relations. This slide is kind of important, because these relations -- and the symbols I’ve chosen to represent them -- will reappear throughout the rest of the talk. The relations are defined by analogy with set relations, and they include representations of both semantic containment and semantic exclusion. The seven relations are first: equivalence, forward entailment, and reverse entailment -- these are pretty self-explanatory, and these are the containment relations; then: negation, …, alternation, …, and cover, …; and finally, independence, which covers all other cases. It’s important to note that these relations are defined for expressions of every semantic type: not merely sentences, but also common nouns, adjectives, transitive and intransitive verbs, temporal and locative modifiers, quantifiers, and so on. Relations are defined for all semantic types: tiny ⊏ small, hover ⊏ fly, kick ⊏ strike, this morning ⊏ today, in Beijing ⊏ in China, everyone ⊏ someone, all ⊏ most ⊏ some

Entailment & semantic composition
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Entailment & semantic composition Ordinarily, semantic composition preserves entailment relations: eat pork ⊏ eat meat, big bird | big fish But many semantic functions behave differently: tango ⊏ dance  refuse to tango ⊐ refuse to dance French | German  not French _ not German We categorize functions by how they project entailment a generalization of monotonicity classes, implication signatures e.g., not has projectivity {=:=, ⊏:⊐, ⊐:⊏, ^:^, |:_, _:|, #:#} e.g., refuse has projectivity {=:=, ⊏:⊐, ⊐:⊏, ^:|, |:#, _:#, #:#} The next question is, how are entailment relations affected by semantic composition? How do the entailments of a compound expression depend on the entailments of its parts? In the most common case, semantic composition preserves entailment relations. So “eat pork” entails “eat meat”, and “big bird” excludes “big fish”. But many semantic functions behave differently. For example, “refuse” projects forward entailment as reverse entailment, so that “refuse to tango” is entailed by “refuse to dance”. And “not” projects exclusion as exhaustion, so that “not French” stands in the cover relation to “not German”. In our model, we categorize semantic functions according to how they project each of the seven basic entailment relations. This is a generalization of both the three monotonicity classes of the monotonicity calculus, and the nine implication signatures of Nairn et al. For example, “not” and “refuse” are alike in projecting equivalence as equivalence and independence as independence, and in swapping forward and reverse entailments. But whereas “not” projects exclusion as exhaustion, “refuse” projects it as independence. Thus “refuse to tango” and “refuse to waltz” are independent. [>]

Projecting entailment relations upward
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Projecting entailment relations upward If two compound expressions differ by a single atom, their entailment relation can be determined compositionally Assume idealized semantic composition trees Propagate entailment relation between atoms upward, according to projectivity class of each node on path to root ⊏ ⊐ a shirt nobody can without enter @ clothes A typology of projectivity allows us to determine the entailments of a compound expression compositionally, by projecting lexical entailment relations upward through a semantic composition tree. Consider this example: if nobody can enter without a shirt, then it follows that nobody can enter without clothes. To explain this compositionally, assume that we have idealized semantic composition trees, representing the compositional structure of the semantics of these sentences. We begin from the lexical entailment relation between “shirt” and “clothes”: shirt forward-entails clothes But “without” is downward-monotone, so “without a shirt” is entailed by “without clothes”. This modifier is applied to “enter”, and then is modified by “can”, which is upward-monotone. But “nobody” is downward-monotone, so we get another reversal, and we find a forward entailment relation between the sentences, as expected. [>] ⊐ ⊐ ⊏

A (weak) inference procedure
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion A (weak) inference procedure Find sequence of edits connecting P and H Insertions, deletions, substitutions, … Determine lexical entailment relation for each edit Substitutions: depends on meaning of substituends: cat | dog Deletions: ⊏ by default: red socks ⊏ socks But some deletions are special: not ill ^ ill, refuse to go | go Insertions are symmetric to deletions: ⊐ by default Project up to find entailment relation across each edit Join entailment relations across sequence of edits à la Tarski’s relation algebra Now we come to the third element of the theory, which builds on the preceding to prove a hypothesis from a premise. Suppose we can find a sequence of atomic edits which transforms the premise into the hypothesis. These could be insertions, deletions, substitutions, or more complex edit operations. We begin by determining a lexical entailment relation for each atomic edit. For substitutions, this depends on the relation between the meanings of the substituends. Deletions ordinarily generate the forward entailment relation, but some lexical items have special behavior. For example, deleting “not” generates the negation relation. Insertions are symmetric to deletions. Next, we project each lexical entailment relation upward through a semantic composition tree, as on the previous slide, to determine the entailment relation across each atomic edit. Finally, we join these atomic entailment relations across the sequence of edits, as in Tarskian relation algebra, to obtain our final answer. [>]

The NatLog system 1 2 3 4 5 NLI problem linguistic analysis alignment
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion The NatLog system NLI problem linguistic analysis 1 alignment 2 lexical entailment classification 3 OK, let’s switch gears and talk about what we built. The NatLog system is a computational model of natural logic. It consists of five stages. In the following slides, I’ll talk about each of the five stages in turn. [>] entailment projection 4 entailment joining 5 prediction

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Running example P Jimmy Dean refused to move without blue jeans. H James Dean didn’t dance without pants yes OK, the example is contrived, but it compactly exhibits containment, exclusion, and implicativity But first -- to illustrate the operation of the system, I’m going to use a running example, shown here. The example is quite contrived, but it compactly exhibits the three phenomena I’m trying to model: containment, exclusion, and implicativity. So the premise is, “Jimmy Dean refused to move without blue jeans” and the hypothesis is, “James Dean didn’t dance without pants” and this is a valid inference. [>]

Step 1: Linguistic analysis
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Step 1: Linguistic analysis Tokenize & parse input sentences (future: & NER & coref & …) Identify items w/ special projectivity & determine scope Problem: PTB-style parse tree  semantic structure! S category: –/o implicatives examples: refuse, forbid, prohibit, … scope: S complement pattern: __ > (/VB.*/ > VP $. S=arg) projectivity: {=:=, ⊏:⊐, ⊐:⊏, ^:|, |:#, _:#, #:#} VP refuse move Jimmy Dean without jeans blue S VP VP In the first stage, we do linguistic pre-processing. We begin by tokenizing and parsing the input sentences, using the Stanford parser, a broad-coverage statistical parser trained on the Penn Treebank. But the most important task at this stage is to identify any semantic functions with non-default projectivity, and to compute their scope, in order to determine the effective projectivity at each token. What makes this tricky is that the phrase structure trees produced by the parser may not correspond exactly to the semantic structure of the sentence. If we had idealized semantic composition trees [!], then determining effective projectivity would be easy. Since we don’t, we use a somewhat awkward workaround. [!] We define categories of items with special projectivity, and for each category we specify its default scope in phrase structure trees using a tree-pattern language called Tregex, which is similar to Tgrep. This enables us to identify the constituents over which the projective properties should be applied, and thereby to compute the final effective projectivity at each token. [>] PP NP NP NNP NNP VBD TO VB IN JJ NNS Jimmy Dean refused to move without blue jeans + – Solution: specify scope in PTB trees using Tregex [Levy & Andrew 06]

Step 2: Alignment Alignment as sequence of atomic phrase edits
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Step 2: Alignment P Jimmy Dean refused to move without blue jeans H James Dean did n’t dance pants edit index 1 2 3 4 5 6 7 8 edit type SUB DEL INS MAT Alignment as sequence of atomic phrase edits Ordering of edits defines path through intermediate forms Need not correspond to sentence order Decomposes problem into atomic inference problems We haven’t (yet) invested much effort here Experimental results use alignments from other sources In the second stage, we establish an alignment between the premise and hypothesis, represented by a sequence of atomic edits over spans of word tokens. I’ve shown an alignment for our running example here. We use four types of edit: deletion, insertion, substitution, and match. The edits are ordered, and this ordering defines a path from premise to hypothesis through intermediate forms -- but the ordering need not correspond to sentence order, as it does in this example. Thus, the alignment effectively decomposes the inference problem into a sequence of atomic inference problems, one for each atomic edit. Alignment will be the subject of the second part of the talk. [>]

Step 3: Lexical entailment classification
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Step 3: Lexical entailment classification Goal: predict entailment relation for each edit, based solely on lexical features, independent of context Approach: use lexical resources & machine learning Feature representation: WordNet features: synonymy (=), hyponymy (⊏/⊐), antonymy (|) Other relatedness features: Jiang-Conrath (WN-based), NomBank Fallback: string similarity (based on Levenshtein edit distance) Also lexical category, quantifier category, implication signature Decision tree classifier Trained on 2,449 hand-annotated lexical entailment problems E.g., SUB(gun, weapon): ⊏, SUB(big, small): |, DEL(often): ⊏ The next stage is the heart of the system: lexical entailment classification. Here we try to predict an entailment relation for each atomic edit, based solely on the features of the lexical items involved, independent of the surrounding context, such as falling under a downward-monotone operator. We do this by exploiting available resources on lexical semantics, and applying machine learning. Our feature representation includes: semantic relatedness information based on WordNet, NomBank, and other lexical resources; string and lemma similarity scores; and information about lexical categories, including special-purpose categories for quantifiers and implicatives. We use a decision tree classifier, trained on about 2,500 hand-annotated lexical entailment problems, like the examples shown here. [>]

Step 3: Lexical entailment classification
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Step 3: Lexical entailment classification P Jimmy Dean refused to move without blue jeans H James Dean did n’t dance pants edit index 1 2 3 4 5 6 7 8 edit type SUB DEL INS MAT lex feats strsim= 0.67 implic: –/o cat:aux cat:neg hypo hyper lex entrel = | ^ ⊐ ⊏ So, back to the running example. I’ve added rows which show features generated for each edit, and the lexical entailment relation predicted from these features. The 1st edit is a substitution; string similarity is high, so we predict equivalence. In the 2nd edit, we delete an implicative, “refuse”; the model knows that deleting implicatives in this category generates the alternation relation. The 3rd edit inserts an auxiliary verb; since auxiliaries are more or less semantically vacuous, the model predicts equivalence. The 4th edit inserts a negation; this generates the negation relation. The 5th edit is a substitution; WordNet tells us that these are hyponyms, so we predict reverse entailment. The 6th edit is a match: equivalence. The 7th edit is the deletion of a generic modifier; by default, this generates forward entailment. Finally, the 8th edit is a hypernym substitution: forward entailment. [>]

Step 4: Entailment projection
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Step 4: Entailment projection P Jimmy Dean refused to move without blue jeans H James Dean did n’t dance pants edit index 1 2 3 4 5 6 7 8 edit type SUB DEL INS MAT lex feats strsim= 0.67 implic: –/o cat:aux cat:neg hypo hyper lex entrel = | ^ ⊐ ⊏ projec-tivity   atomic entrel The fourth stage is entailment projection. I covered this earlier: it means projecting lexical entailment relations upward by taking account of the projective properties of the surrounding context. I’m going to simplify things a bit here by only considering upward and downward monotonicity. I’ve added two new rows. The first row shows the effective monotonicity at the locus of each edit. Everything is upward monotone until we insert the negation, after which the next two edits occur in a downward-monotone context. But “without” creates another inversion, so the last two edits occur in an upward-monotone context. The last row shows [!] how the lexical entailment relations are projected into atomic entailment relations; that is, entailment relations across each atomic edit. The only interesting case is [!] HERE, where a reverse entailment is changed into a forward entailment by a downward-monotone context. [>] inversion

Step 5: Entailment joining
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Step 5: Entailment joining P Jimmy Dean refused to move without blue jeans H James Dean did n’t dance pants edit index 1 2 3 4 5 6 7 8 edit type SUB DEL INS MAT lex feats strsim= 0.67 implic: –/o cat:aux cat:neg hypo hyper lex entrel = | ^ ⊐ ⊏ projec-tivity   atomic entrel join fish | human human ^ nonhuman fish ⊏ nonhuman For example: The final stage is entailment joining, in which we combine atomic entailment relations one-by-one to obtain our final answer. [!] We start at the left with equivalence, and the first couple of joins are quite intuitive: [!] equivalence joined with alternation yields alternation, and [!] alternation joined with equivalence yields alternation again. [!] The next one is more interesting: alternation joined with negation yields forward entailment. That may not be immediately obvious, but it makes sense if you think about it for a bit. [!] For example, “fish” alternates with “human”, and “human” negates “nonhuman”, so “fish” forward-entails “nonhuman”. [!] After that, we’re just joining forward entailment with itself [!] or with equivalence, so forward entailment is preserved [!] [!] all the way through, [!] and that’s our final answer -- and it’s the CORRECT answer for this problem. [>] final answer

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion The FraCaS test suite FraCaS: a project in computational semantics [Cooper et al. 96] 346 “textbook” examples of NLI problems 3 possible answers: yes, no, unknown (not balanced!) 55% single-premise, 45% multi-premise (excluded) P At most ten commissioners spend time at home. H At most ten commissioners spend a lot of time at home. yes Dumbo is a large animal. Dumbo is a small animal. no Smith believed that ITEL had won the contract in 1992. ITEL won the contract in 1992. unk In order to evaluate our system, we used the FraCaS test suite, which came out of a mid-90s project in computational semantics. It contains 346 problems which look like they could have come out of a textbook on formal semantics. FraCaS involves 3-way classification: it distinguishes contradiction from mere non-entailment. In this work, we consider only problems which contain a single premise. Here are a few example problems. The 1st one inserts a restrictive modifier in a downward-monotone context. The 2nd involves predicates which stand in the alternation relation. The 3rd involves a non-factive verb with a clausal complement. [>]

Results on FraCaS System # prec % rec % acc % most common class 183
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Results on FraCaS System # prec % rec % acc % most common class 183 55.7 100.0 MacCartney & Manning 07 68.9 60.8 59.6 this work 89.3 65.7 70.5 27% error reduction Here are results for a baseline classifier, our system last year, and our current system. The columns indicate the number of problems, precision and recall for the YES class, and accuracy. [!] Overall, we’ve made good progress since last year, achieving a 27% reduction in error, and [!] reaching almost 90% in precision. [>]

even outside areas of expertise
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Results on FraCaS System # prec % rec % acc % most common class 183 55.7 100.0 MacCartney & Manning 07 68.9 60.8 59.6 this work 89.3 65.7 70.5 Category 1 Quantifiers 44 95.2 97.7 2 Plurals 24 90.0 64.3 75.0 3 Anaphora 6 60.0 50.0 4 Ellipsis 25 5.3 24.0 5 Adjectives 15 71.4 83.3 80.0 Comparatives 16 88.9 81.3 7 Temporal 36 85.7 70.6 58.3 8 Verbs 66.7 62.5 9 Attitudes 1, 2, 5, 6, 9 108 90.4 85.5 87.0 27% error reduction in largest category, all but one correct high accuracy in sections most amenable to natural logic high precision even outside areas of expertise What’s more interesting is the breakdown by section. The FraCaS problems are divided into nine sections, each focused on a different category of semantic phenomena. [!] In the section on quantifiers, which is both the largest and the most amenable to natural logic, we answer all but one problem correctly. [!] In fact, performance is good on all the sections where we expect NatLog to have relevant expertise. Our average accuracy on the five sections most amenable to natural logic is 87%. Not surprisingly, we make little headway with things like anaphora and ellipsis, but even here, [!] precision is high -- the system rarely predicts entailment when none exists. Of course, this doesn’t constitute a proper evaluation on unseen test data -- but on the other hand, the system was never trained on FraCaS data -- only on lexical entailment problems -- and it’s had no opportunity to learn biases implicit in the data. Our main goal in testing on FraCaS is to evaluate the representational and inferential adequacy of our model of natural logic, and from that perspective, the results are encouraging.

The RTE3 test suite Somewhat more “natural”, but not ideal for NatLog
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion The RTE3 test suite Somewhat more “natural”, but not ideal for NatLog Many kinds of inference not addressed by NatLog: paraphrase, temporal reasoning, relation extraction, … Big edit distance  propagation of errors from atomic model P As leaders gather in Argentina ahead of this weekends regional talks, Hugo Chávez, Venezuela’s populist president is using an energy windfall to win friends and promote his vision of 21st-century socialism. H Hugo Chávez acts as Venezuela’s president. yes Democrat members of the Ways and Means Committee, where tax bills are written and advanced, do not have strong small business voting records. Democrat members had strong small business voting records. no Since the FraCaS test suite is not well-known, we also wanted to do an evaluation using the familiar RTE data. Relative to FraCaS, the RTE problems are more natural-seeming, and the premises are much longer, averaging 35 words rather than 11. The RTE problems are not an ideal match to the strengths of the NatLog system. First, RTE includes many kinds of inference not addressed in natural logic, such as paraphrase, temporal reasoning, and relation extraction. Second, in most RTE problems, the edit distance between premise and hypothesis is quite large. More atomic edits means a greater chance that prediction errors made by the atomic entailment model will propagate, via entailment composition, to the system’s final output. Here are a couple of example problems. The first example is not a good match to the strengths of the NatLog system -- it’s essentially a relation extraction problem, and the NatLog system is thrown off by the introduction of the words “acts as” in the hypothesis. The second example is a much better fit for NatLog -- it hinges on recognizing that deleting a negation yields a contradiction, and NatLog gets this problem right. [>]

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Results on RTE3: NatLog System Data % Yes Prec % Rec % Acc % Stanford RTE dev 50.2 68.7 67.0 67.2 test 50.0 61.8 60.2 60.5 NatLog 22.5 73.9 32.4 59.2 26.4 70.1 36.1 59.4 (each data set contains 800 problems) Accuracy is unimpressive, but precision is relatively high Strategy: hybridize with Stanford RTE system As in Bos & Markert 2006 But NatLog makes positive prediction far more often (~25% vs. 4%) Here are results on the RTE3 development and test sets for the Stanford RTE system -- a broad coverage RTE system -- and for NatLog. For each system, I show the percentage of problems answered YES, along with precision and recall for the YES class, and accuracy. Not surprisingly, [!] the overall accuracy of the NatLog system is unimpressive. [!] But NatLog achieves relatively high precision -- over 70% -- on its YES predictions. This suggests a strategy of hybridizing the high-precision, low-recall NatLog system with the broad-coverage Stanford system. Bos & Markert pursued a similar strategy in their 2006 system based on first-order logic and theorem proving. However, that system was able to make a positive prediction in only about 4% of cases. [!] NatLog makes positive predictions far more often, in about 25% of cases.

Results on RTE3: hybrid system
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Results on RTE3: hybrid system System Data % Yes Prec % Rec % Acc % Stanford RTE dev 50.2 68.7 67.0 67.2 test 50.0 61.8 60.2 60.5 NatLog 22.5 73.9 32.4 59.2 26.4 70.1 36.1 59.4 Hybrid 56.0 69.2 75.2 70.0 54.5 64.4 68.5 64.5 4% gain (significant, p < 0.05) (each data set contains 800 problems) The results are quite satisfying. As we hoped, hybridization yields substantial gains. On the RTE3 test set, the hybrid system attained an accuracy 4% better than the Stanford system alone, corresponding to an extra 32 questions answered correctly. [>] [Unfortunately, the gain cannot be attributed to NatLog’s success in handling complicated inferences involving inversions of monotonicity which are the staple of natural logic. Indeed, such inferences are rather rare in the RTE data. Instead, NatLog seems to have gained primarily through a more precise handling of conceptual broadening in ordinary upward monotone contexts.]

Conclusion: what natural logic can’t do
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Conclusion: what natural logic can’t do Not a universal solution for NLI Many types of inference not amenable to natural logic Paraphrase: Eve was let go = Eve lost her job Verb/frame alternation: he drained the oil ⊏ the oil drained Relation extraction: Aho, a trader at UBS… ⊏ Aho works for UBS Common-sense reasoning: the sink overflowed ⊏ the floor got wet etc. Also, has a weaker proof theory than FOL Can’t explain, e.g., de Morgan’s laws for quantifiers: Not all birds fly = Some birds don’t fly In summary, I want to emphasize that we are NOT proposing natural logic as a universal solution to the problem of natural language inference. There are many important kinds of inference which are simply not amenable to the natural logic approach, including … Moreover, natural logic has less deductive power than first-order logic. [>]

Conclusion: what natural logic can do
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Conclusion: what natural logic can do Natural logic enables precise reasoning about containment, exclusion, and implicativity, while sidestepping the difficulties of translating to FOL. The NatLog system successfully handles a broad range of such inferences, as demonstrated on the FraCaS test suite. Ultimately, open-domain NLI is likely to require combining disparate reasoners, and a facility for natural logic is a good candidate to be a component of such a system. But, natural logic enables precise reasoning about semantic containment, exclusion, and implicativity, while sidestepping the difficulties of full semantic interpretation, and it’s therefore able to explain a broad range of such inferences, as demonstrated on the FraCaS test suite. A full solution to natural language inference will ultimately require combining disparate reasoners, and natural logic is likely to be an important part of such a solution. [>]

A Phrase-Based Model of Alignment for Natural Language Inference
OK, the second part of the talk concerns a phrase-based model of alignment for natural language inference. This is joint work with Michel Galley and Chris Manning. [>] Bill MacCartney, Michel Galley, and Christopher D. Manning Stanford University 8 October 2008

Natural language inference (NLI) (aka RTE)
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Natural language inference (NLI) (aka RTE) Does premise P justify an inference to hypothesis H? An informal notion of inference; variability of linguistic expression P In 1963, JFK was assassinated during a visit to Dallas. H Kennedy was killed in yes Like MT, NLI depends on a facility for alignment I.e., linking corresponding words/phrases in two related sentences Alignment addressed variously by current NLI systems Implicit alignment: NLI via lexical overlap [Glickman et al. 05, Jijkoun & de Rijke 05] Implicit alignment: NLI as proof search [Tatu & Moldovan 07, Bar-Haim et al. 07] Explicit alignment  entailment classification [Marsi & Kramer 05, MacCartney et al. 06] I’ve already introduced the NLI task, but I’d like to make an observation about the example shown here. In order to recognize that “Kennedy was killed” can be inferred from “JFK was assassinated”, one must first recognize the correspondence between “Kennedy” and “JFK”, and between “killed” and “assassinated”. Consequently, most current approaches to NLI depend, implicitly or explicitly, on a facility for alignment, that is establishing links between corresponding entities and predicates in P and H. Different systems do this in different ways. Approaches based on measuring lexical overlap implicitly align each word in H to the word in P with which it is most similar. In approaches which formulate NLI as analogous to proof search, the alignment is implicit in the steps of the proof. But increasingly, the most successful systems make the alignment problem explicit, and then use alignment to drive entailment classification.

Contributions of this paper
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Contributions of this paper In this paper, we: Undertake the first systematic study of alignment for NLI Existing NLI aligners use idiosyncratic methods, are poorly documented, use proprietary data Propose a new model of alignment for NLI: MANLI Uses a phrase-based alignment representation Exploits external lexical resources Capitalizes on new supervised training data Examine the relation between alignment in NLI and MT Can existing MT aligners be applied in the NLI setting?

NLI alignment vs. MT alignment
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion NLI alignment vs. MT alignment Alignment is familiar in MT, with extensive literature Can tools & techniques of MT alignment transfer to NLI? Doubtful — NLI alignment differs in several respects: Monolingual: can exploit resources like WordNet Asymmetric: P often longer & has content unrelated to H Cannot assume semantic equivalence NLI aligner must accommodate frequent unaligned content Little training data available MT aligners use unsupervised training on massive amounts of bitext NLI aligners must rely on supervised training & much less data The alignment problem is familiar in machine translation, and the MT community has developed not only an extensive literature, but also standard, proven tools for alignment. Can off-the-shelf MT aligners be applied to NLI? There is reason to be doubtful -- alignment for NLI differs from alignment for MT in several key respects. 1. It is monolingual, opening the door to utilizing abundant (monolingual) sources of information on semantic relatedness 2. It is intrinsically asymmetric: P is often much longer than H, and commonly contains phrases or clauses which have no counterpart in H. 3. Indeed, one cannot assume even approximate semantic equivalence -- usually a given in MT. Because NLI problems include both valid and invalid inferences, the semantic content of P and H can diverge substantially. NLI aligners must accommodate frequent unaligned content. 4. Little training data is available. MT aligners typically use unsupervised training on massive amounts of bitext. No such data is available for NLI. NLI aligners must depend on smaller amounts of supervised data, supplemented by lexical resources. (MT aligners can use dictionaries but aren’t designed to harness other sources of information on semantic relatedness.)

The MSR RTE2 alignment data
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion The MSR RTE2 alignment data Previously, little supervised data Now, MSR gold alignments for RTE2 [Brockett 2007] dev & test sets, 800 problems each Token-based, but many-to-many allows implicit alignment of phrases 3 independent annotators 3 of 3 agreed on 70% of proposed links 2 of 3 agreed on 99.7% of proposed links merged using majority rule In the past, research on alignment for NLI has been hampered by a paucity of high-quality, publicly available training data. Happily, that picture has begun to change, THANKS TO YOU GUYS. Last year, MSR (Microsoft Research) released a data set containing gold-standard alignments for the RTE2 development and test sets, containing 800 problems each. The alignment representation is token-based, but many-to-many, and thus allows implicit alignment of phrases. I’ve shown an example here. [READ P & H.] Two things to note here: (1) “In most Pacific countries” is unaligned -- you wouldn’t see that in MT alignment; and (2) “very few” has been aligned to “poorly represented” -- an implicit phrase alignment. Each problem was annotated independently by 3 people, and inter-annotator agreement was very high: all 3 agreed on 70% of proposed links, and 2 of the 3 agreed on more than 99% of proposed links, attesting to the high quality of the data. For this work, we merged the three annotations into a single gold standard using majority rule. [MSR only: Following a convention common in MT, the annotations contain both SURE and POSSIBLE links. In this work, we have ignored POSSIBLE links, embracing an argument made by Fraser & Marcu that their use has impeded progress in MT, and that SURE-only annotation is to be preferred.]

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion The MANLI aligner A new model of alignment for natural language inference Phrase-based representation Feature-based scoring function Decoding using simulated annealing Perceptron learning Now I’d like to tell you about a new model of alignment for natural language inference: the MANLI system. I know, it’s a funny name -- you might feel a little silly if you have to say it out loud. But the system itself is very straightforward. It has four components: It uses a phrase-based representation of alignment, and a linear, feature-based scoring function. It performs decoding using a simulated annealing strategy. And it uses a version of averaged perceptron for weight training. Let me tell you more about each of these components in turn.

Phrase-based alignment representation
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Phrase-based alignment representation Represent alignments by sequence of phrase edits: EQ, SUB, DEL, INS DEL(In1) DEL(most2) DEL(Pacific3) DEL(countries4) DEL(there5) EQ(are6, are2) SUB(very7 few8, poorly3 represented4) EQ(women9, Women1) EQ(in10, in5) EQ(parliament11, parliament6) EQ(.12, .7) First, we use a representation of alignment which is phrase-based. We represent an alignment by a sequence of phrase edits of four different types: An EQ edit connects a phrase in P with an equal (by word lemmas) phrase in H. A SUB edit connects a phrase in P with an unequal phrase in H. A DEL edit covers an unaligned phrase in P. An INS edit covers an unaligned phrase in H. I’ve shown an example here, and the interesting edit is the SUB which connects “very few” with “poorly represented”. This representation is constrained to be one-to-one at the phrase level, but it can be many-to-many at the token level. In fact, this is the chief motivation for the phrase-based representation: we can align “very few” and “poorly represented” as units, without being forced to make an arbitrary choice as to which word goes with which word. Also, our scoring function can make use of lexical resources which have information about semantic relatedness of multi-word phrases, not just individual words. For the purpose of model training (but NOT for the evaluations I’ll tell you about later), we converted the token-based MSR data into this phrase-based representation. One-to-one at phrase level (but many-to-many at token level) Avoids arbitrary alignment choices; can use phrase-based resources For training (only!), converted MSR data to this form

A feature-based scoring function
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion A feature-based scoring function Score edits as linear combination of features, then sum: Edit type features: EQ, SUB, DEL, INS Phrase features: phrase sizes, non-constituents Lexical similarity feature: max over similarity scores WordNet: synonymy, hyponymy, antonymy, Jiang-Conrath Distributional similarity à la Dekang Lin Various measures of string/lemma similarity Contextual features: distortion, matching neighbors For scoring alignments, we use a very simple linear feature-based scoring function. The score for an alignment is the sum of the scores of the edits it contains -- including deletions and insertions -- and the score for an edit is the dot product of a weight vector and a feature vector. We use several types of features. First, we have a group of features which encode the type of the edit. Next, we have features which encode the sizes of the phrases involved in the edit, and whether these phrases are non-constituents (in a syntactic parse). For SUB edits, a very important feature represents the lexical similarity of the substituends, as a real value between 0 and 1. We compute this as a max over a number of component functions, some based on external lexical resources. This includes manually constructed lexical resources, such as WordNet, and also automatically constructed resources, such as a measure of distributional similarity in a very large corpus. An MT aligner is basically inducing distributional similarity from massive amounts of bitext; we’re getting it from an external lexical resource. We also use various measures of string and lemma similarity. Finally, high lexical similarity doesn’t necessarily mean good match, esp. if sentences contain multiple occurrences of the same word (e.g. function words). To remedy this, we introduce contextual features. Distortion features measures difference between relative positions of words within their respective sentences. Matching neighbors indicates whether tokens before and after aligned pair are equal or similar. [>]

Decoding using simulated annealing
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Decoding using simulated annealing Start … Generate successors Score Smooth/sharpen P(A) = P(A)1/T Sample Decoding is made more complex by use of phrase-based alignments. With a token-based representation, decoding is trivial, since each token can be aligned independently of its neighbors. But with a phrase-based representation, every aligned phrase pair must be consistent with its neighbors w.r.t. phrase segmentation. To address this problem, we use a stochastic local search based on simulated annealing. Here’s how it works. We start with an empty alignment, and then we generate a set of successors. To do this, we generate every possible edit up to some maximum size, and then we generate a successor by adding the edit to the current alignment and removing any edits which conflict with it. Then we score the successors, using our scoring function, and convert the scores into a probability distribution. Next, we smooth or sharpen the distribution by raising it to a power which depends on a temperature parameter. The temperature starts off high, so that we’re smoothing the distribution, which helps to ensure that we explore space of possibilities. In later iterations, we’ll make the distribution sharper and sharper, so that we converge. Then we sample a new alignment, which may or may not be the most likely one, and we lower the temperature, and repeat the process … 100 times. This might seem like it would be slow, but with clever use of memoization, the average RTE problems takes only about 2 seconds to align. No guarantee of optimality, but guess scores at least as high as gold for >99% of problems. Repeat Lower temp T = 0.9  T … 100 times

Perceptron learning of feature weights
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Perceptron learning of feature weights We use a variant of averaged perceptron [Collins 2002] Initialize weight vector w = 0, learning rate R0 = 1 For training epoch i = 1 to 50: For each problem Pj, Hj with gold alignment Ej: Set Êj = ALIGN(Pj, Hj, w) Set w = w + Ri  ((Ej) – (Êj)) Set w = w / ‖w‖2 (L2 normalization) Set w[i] = w (store weight vector for this epoch) Set Ri = 0.8  Ri–1 (reduce learning rate) Throw away weight vectors from first 20% of epochs Return average weight vector To tune the parameters of the model, we use an adaptation of the averaged perceptron algorithm, which has proven successful on a range of NLP tasks. After initializing w to 0, we perform 50 training epochs. In each epoch, we iterate through the training data, updating the weight vector at each training example according to the difference between the features of the target alignment and the features of the alignment produced by the decoder using the current weight vector. The size of the update is controlled by a learning rate which decreases over time. At the end of each epoch, the weight vector is normalized and stored. The final result is the average of the stored weight vectors, omitting vectors from a fixed proportion of epochs at the beginning of the run (which tend to be of poor quality). Training runs on the RTE2 development set required about 20 hours. Training runs require about 20 hours (for 800 problems)

Evaluation on MSR data We evaluate several systems on MSR data
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Evaluation on MSR data We evaluate several systems on MSR data Baseline, GIZA++ & Cross-EM, Stanford RTE, MANLI How well do they recover gold-standard alignments? We report per-link precision, recall, and F1 Note that AER = 1 – F1 For MANLI, two tokens are considered to be aligned iff they are contained within phrases which are aligned We also report exact match rate What proportion of guessed alignments match gold exactly? OK, let’s talk about evaluation. Over the next several slides, I’ll present evaluations of several alignment systems on the MSR RTE alignment data. Specifically, we’ll look at a baseline aligner, two well-known MT aligners, and two NLI aligners: the Stanford RTE aligner and MANLI. To evaluate each aligner’s ability to recover the gold standard alignments, we’ll report per-link precision, recall, and F1. In the MT community, it’s more conventional to report alignment error rate, but since we’re using SURE-only annotations, AER is just 1 - F1. Also, we’re using the original, token-based version of the MSR data. In evaluating MANLI, we’ll consider two tokens to be aligned iff they are contained within phrases which are aligned. Finally, we’re also going to report the exact match rate: what proportion of guessed alignments matched the gold exactly?

Baseline: bag-of-words aligner
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Baseline: bag-of-words aligner RTE2 dev RTE2 test System P % R % F1 % E % Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3 Match each H token to most similar P token: [cf. Glickman et al. 2005] As a baseline, we’ll use a very simple alignment algorithm inspired by the lexical entailment model of Glickman et al. This just matches each token in H with the token in P to which it is most similar, according to a lexical similarity function. We use a simple lexical similarity function based on the string edit distance between two word lemmas, shown here. Here are some initial results. As I described earlier, I show the precision, recall, F1, and exact match rate for both the development and test sets of RTE2. Despite the simplicity of this alignment model, recall is surprisingly good -- above 80%. Its precision, however, is mediocre---chiefly because, by design, it aligns every hypothesis token with some premise token. The model could surely be improved by allowing it to leave some hypothesis tokens unaligned, but we didn’t pursue this. Surprisingly good recall, despite extreme simplicity But very mediocre precision, F1, & exact match rate Main problem: aligns every token in H

MT aligners: GIZA++ & Cross-EM
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion MT aligners: GIZA++ & Cross-EM Why not use off-the-shelf MT aligner for NLI? Run GIZA++ via Moses, with default parameters Asymmetric alignments in both directions Then symmetrize using INTERSECTION heuristic Initial results are very poor: 56% F1 Doesn’t even align equal words Remedy: add lexicon of equal words as extra training data Do similar experiments with Berkeley Cross-EM aligner Given the importance of alignment for NLI, and the availability of standard, proven tools for MT alignment, an obvious question presents itself: why not use an off-the-shelf MT aligner for NLI? Although we’ve argued that this is unlikely to succeed, to our knowledge, we are the first the investigate the matter empirically. [Though (Dolan et al. 04) explored using an MT aligner to identify paraphrases.] We did experiments with the best-known MT aligner, GIZA++, running it via the Moses toolkit, with default parameters. We generate asymmetric alignments in both directions, and then performed symmetrization using the well-known INTERSECTION heuristic. The initial results were very poor: it aligned words seemingly at random, not even aligning equal words. Because GIZA++ is designed for cross-lingual use, it does not consider word equality between source and target sentences. To remedy this, we supplied GIZA++ with a lexicon, using a trick common in MT: we supplemented the training data with synthetic data consisting of matched pairs of equal words. This gives GIZA++ a better chance of learning that, e.g., “man” should align with “man”. This resulted in a big boost in recall, and a smaller gain in precision. As an additional comparison, we ran similar experiments with the Cross-EM aligner from Berkeley.

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: MT aligners RTE2 dev RTE2 test System P % R % F1 % E % Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3 GIZA++ 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3 Cross-EM 67.6 80.1 1.3 70.3 81.0 74.1 0.8 Similar F1, but GIZA++ wins on precision, Cross-EM on recall Both do best with lexicon & INTERSECTION heuristic Also tried UNION, GROW, GROW-DIAG, GROW-DIAG-FINAL, GROW-DIAG-FINAL-AND, and asymmetric alignments All achieve better recall, but much worse precision & F1 Problem: too little data for unsupervised learning Need to compensate by exploiting external lexical resources So here are the results. This is based on using the lexicon and the INTERSECTION heuristic. Both MT aligners do about the same on F1, but GIZA++ attains better precision, while Cross-EM gets better recall. Both do significantly better than the bag-of-words baseline, especially on precision, though bag-of-words actually does slightly better on recall. We also tried using alternate symmetrization heuristics, and asymmetric alignments, but everything we tried did much worse than the INTERSECTION heuristic on F1. Qualitatively, both MT aligners do a good job of aligning equal words (when using a lexicon), but they continue to align most other word pairs apparently at random. This is not too surprising: the basic problem is that the quantity of data is just far too small for unsupervised learning of word correspondences. A successful NLI aligner will need to exploit supervised training data, and will need access to additional sources of knowledge about lexical relatedness.

The Stanford RTE aligner
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion The Stanford RTE aligner Token-based alignments: map from H tokens to P tokens Phrase alignments not directly representable But named entities, collocations collapsed in pre-processing Exploits external lexical resources WordNet, LSA, distributional similarity, string sim, … Syntax-based features to promote aligning corresponding predicate-argument structures Decoding & learning similar to MANLI A better comparison is thus to an alignment system expressly designed for NLI. For this purpose, we used the alignment component of the Stanford RTE system. The Stanford system represents alignments as a map from hypothesis tokens to premise tokens. Phrase alignments are not directly directly representable, although the effect can be approximated by a pre-processing step which collapses multi-token named entities and certain collocations into single tokens. The scoring function exploits a variety of sources of information about lexical relatedness, and also includes syntax-based features intended to promote the alignment of similar predicate-argument structures. Decoding and learning are handled in similar fashion to MANLI.

Results: Stanford RTE aligner
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: Stanford RTE aligner RTE2 dev RTE2 test System P % R % F1 % E % Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3 GIZA++ 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3 Cross-EM 67.6 80.1 1.3 70.3 81.0 74.1 0.8 Stanford RTE 81.1 75.8 78.4 0.5 82.7 79.1 0.3 * * * * * includes (generous) correction for missed punctuation Better F1 than MT aligners — but recall lags precision Stanford does poor job aligning function words 13% of links in gold are prepositions & articles Stanford misses 67% of these (MANLI only 10%) Also, Stanford fails to align multi-word phrases peace activists ~ protestors, hackers ~ non-authorized personnel Here are results for the Stanford aligner. It outperforms the MT aligners on F1, but recall is substantially lower than precision, and that’s even after applying a correction which generously ignores all recall errors involving punctuation, which is systematically ignored by the Stanford system. Error analysis reveals that the Stanford aligner does a poor job of aligning function words. About 13% of the aligned pairs in the MSR data are matching prepositions or articles; the Stanford aligner misses about two-thirds of such pairs. (By contrast, MANLI misses only 10% of such pairs.) While function words matter less in inference than nouns and verbs, they are not irrelevant, and because sentences often contain multiple instances of a particular function word, matching them properly is by no means trivial. Finally, the Stanford aligner is handicapped by its token-based alignment representation, often failing (partly or completely) to align multi-word phrases such as “peace activists” with “protesters”, or “hackers” with “non-authorized personnel”.

Results: MANLI aligner
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: MANLI aligner RTE2 dev RTE2 test System P % R % F1 % E % Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3 GIZA++ 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3 Cross-EM 67.6 80.1 1.3 70.3 81.0 74.1 0.8 Stanford RTE 81.1 75.8 78.4 0.5 82.7 79.1 0.3 MANLI 83.4 85.5 84.4 21.7 85.4 85.3 21.3 MANLI outperforms all others on every measure F1: 10.5% higher than GIZA++, 6.2% higher than Stanford Good balance of precision & recall Matched >20% exactly Now here are the results for the MANLI aligner. MANLI was found to outperform all other aligners evaluated on every measure, achieving F1 10.5% higher than GIZA++ and 6.2% higher than Stanford, even after the punctuation correction. It also achieves a good balance of precision and recall, and matched the gold standard exactly more than 20% of the time.

MANLI results: discussion
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion MANLI results: discussion Three factors contribute to success: Lexical resources: jail ~ prison, prevent ~ stop , injured ~ wounded Contextual features enable matching function words Phrases: death penalty ~ capital punishment, abdicate ~ give up But phrases help less than expected! If we set max phrase size = 1, we lose just 0.2% in F1 Recall errors: room to improve 40%: need better lexical resources: conservation ~ protecting, organization ~ agencies, bone fragility ~ osteoporosis Precision errors harder to reduce function words (49%), be (21%), punct (7%), equal lemmas (18%) Three factors seem to have contributed most to MANLI's success. First, MANLI is able to outperform the MT aligners principally because it is able to leverage lexical resources to identify the similarity between pairs of words such as … Second, MANLI's contextual features enable it to do better than the Stanford aligner at matching function words. Third, MANLI gains a marginal advantage because its phrase-based representation of alignment permits it to properly align phrase pairs such as … However, the phrase-based representation contributed far less than we had hoped. Setting MANLI's maximum phrase size to 1 caused F1 to fall by just 0.2%. We don’t interpret this to mean that phrases are not useful -- instead, we think it shows that we have failed to fully exploit the advantages of the phrase-based representation, chiefly because we lack lexical resources providing good information on similarity of multi-word phrases. Error analysis suggests that there is ample room for improvement. A large proportion of recall errors (perhaps 40%) occur because the lexical similarity function assigns too low a value to pairs of words or phrases which are clearly similar, such as ... Precision errors may be harder to reduce. These errors are dominated by cases where we mistakenly align two equal function words, two forms of the verb “to be”, two equal punctuation marks, or two words or phrases of other types having equal lemmas. Because such errors often occur because the aligner is forced to choose between nearly equivalent alternatives, they may be difficult to eliminate.

Can aligners predict RTE answers?
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Can aligners predict RTE answers? We’ve been evaluating against gold-standard alignments But alignment is just one component of an NLI system Does a good alignment indicate a valid inference? Not necessarily: negations, modals, non-factives & implicatives, … But alignment score can be strongly predictive And many NLI systems rely solely on alignment Using alignment score to predict RTE answers: Predict YES if score > threshold Tune threshold on development data Evaluate on test data We've evaluated the ability of aligners to recover gold-standard alignments. But since alignment is just one component of the NLI problem, we might also examine the impact of different aligners on the ability to recognize valid inferences. Does a high-scoring alignment indicate a valid inference? Well, there's more to inferential validity than close lexical or structural correspondence: negations, modals, non-factive and implicative verbs, and other linguistic constructs can affect validity in ways hard to capture in alignment. Nevertheless, alignment score can be a strong predictor of inferential validity, and many NLI systems rely entirely on some measure of alignment quality to predict validity. If an aligner generates real-valued alignment scores, we can use the RTE data to test its ability to predict inferential validity with the following simple method. For a given RTE problem, we predict YES if its alignment score exceeds a given threshold and NO otherwise. We tune the threshold to maximize accuracy on the RTE2 development set, and then measure performance on the RTE2 test set using the same threshold.

Results: predicting RTE answers
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: predicting RTE answers RTE2 dev RTE2 test System Acc % AvgP % Bag-of-words 61.3 61.5 57.9 58.9 Stanford RTE 63.1 64.9 60.9 59.2 MANLI 59.3 69.0 60.3 61.0 RTE2 entries (average) — 58.5 59.1 LCC [Hickl et al. 2006] 75.4 80.8 No NLI aligner rivals top LCC system But, Stanford & MANLI beat average entry for RTE2 Many NLI systems could benefit from better alignments! Here are results for several NLI aligners, along with some results for complete RTE systems, including the LCC system (the top performer at RTE2) and an average of all systems participating in RTE2. I show accuracy and average precision in predicting answers for the RTE2 development and test sets. While none of the aligners rivals the performance of the LCC system, all achieve respectable results, and the Stanford and MANLI aligners outperform the average RTE2 entry. Thus, even if alignment quality does not determine inferential validity, many NLI systems could be improved by harnessing a well-designed NLI aligner.

Related work Lots of past work on phrase-based MT
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Related work Lots of past work on phrase-based MT But most systems extract phrases from word-aligned data Despite assumption that many translations are non-compositional Recent work jointly aligns & weights phrases [Marcu & Wong 02, DeNero et al. 06, Birch et al. 06, DeNero & Klein 08] However, this is of limited applicability to the NLI task MANLI uses phrases only when words aren’t appropriate MT uses longer phrases to realize more dependencies (e.g. word order, agreement, subcategorization) MT systems don’t model word insertions & deletions Given the extensive literature on phrase-based MT, it may be helpful to situate our phrase-based alignment model in relation to past work. Phrase-based MT systems usually apply phrase extraction heuristics to word-aligned training data, which stands at odds with the key assumption in phrase-based systems that many translations are non-compositional. More recently, several authors have presented more unified phrase-based systems that jointly align and weight phrases. But, we would argue that this work is of limited applicability to our problem. In MANLI, we use phrases only when word alignments not appropriate, and longer phrases are not needed to achieve good alignment quality. But MT phrase alignment benefits from using longer phrases whenever possible, since this helps to realize more dependencies among translated words (e.g., word order, agreement, subcategorization). Also, MT phrase alignment systems don't model word insertions and deletions, as in MANLI. For example, in the example I showed before, MANLI can just skip "In most Pacific countries there", while an MT phrase-based model would presumably align "In most Pacific countries there are" to "Women are".

:-) Conclusion Thanks! Questions?
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Conclusion MT aligners not directly applicable to NLI They rely on unsupervised learning from massive amount of bitext They assume semantic equivalence of P & H MANLI succeeds by: Exploiting (manually & automatically constructed) lexical resources Accommodating frequent unaligned phrases Phrase-based representation shows potential But not yet proven: need better phrase-based lexical resources [!] That’s it -- thanks very much! :-) Thanks! Questions?

Outline Introduction NLI alignment vs. MT alignment The MSR data
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Outline Introduction NLI alignment vs. MT alignment The MSR data The MANLI aligner Evaluating aligners on the MSR data Using alignment to predict RTE answers Conclusion Here’s the outline of the talk. First I’ll talk about... Then I’ll... And then I’ll conclude. [>]

The MSR RTE2 alignment data
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion The MSR RTE2 alignment data Previously, little supervised data Now, MSR gold alignments for RTE2 dev & test sets, each with 800 P, H pairs Token-based, but many-to-many allows implicit alignment of phrases SURE vs. POSSIBLE: we use only SURE 3 independent annotators 3 of 3 agreed on 70% of proposed links 2 of 3 agreed on 99.7% of proposed links merged using majority rule In the past, research on alignment for NLI has been hampered by a paucity of high-quality, publicly available training data. But that picture has begun to change. Last year, Microsoft Research released a data set containing gold-standard alignments for the RTE2 development and test sets, containing 800 problems each. The alignment representation is token-based, but many-to-many, and thus allows implicit alignment of phrases. I’ve shown an example here. [READ P & H.] Two things to note here: (1) “In most Pacific countries” is unaligned -- you wouldn’t see that in MT alignment; and (2) “very few” has been aligned to “poorly represented” -- an implicit phrase alignment. Following a convention common in MT, every link is marked as either SURE or POSSIBLE. In this work, we ignore the POSSIBLE links, embracing the argument made by Fraser & Marcu 07 that their use has impeded progress in MT, and that SURE-only annotation is to be preferred. Each problem was annotated independently by 3 people, and inter-annotator agreement was very high: all 3 agreed on 70% of proposed links, and 2 of the 3 agreed on more than 99% of proposed links. For this work, we merged the three annotations into a single gold standard using majority rule.

Baseline: bag-of-words aligner
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Baseline: bag-of-words aligner Match each H token to most similar P token: [cf. Glickman et al. 2005] Can also generate an alignment score : Exceedingly simple, but surprisingly robust As a baseline, we’ll use a very simple alignment algorithm inspired by the lexical entailment model of Glickman et al. This just matches each token in H with the token in P to which it is most similar, according to a lexical similarity function. We use a simple lexical similarity function based on the string edit distance between two word lemmas, shown here. This model can also generate an alignment score. We define the score for a specific hypothesis token to be the log of its similarity with the premise token to which it is aligned, and the score for the complete alignment to be the sum of the scores of the hypothesis tokens, weighted by IDF scores (so that common words get less weight), and normalized by the length of the hypothesis. Although the model is exceedingly simple, its performance is surprisingly robust.

Results: bag-of-words aligner
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: bag-of-words aligner RTE2 dev RTE2 test System P % R % F1 % E % Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3 Good recall, despite simplicity of model But very mediocre precision, F1, & exact match rate Main problem: aligns every token in H Here are some initial results. As I described earlier, I show the precision, recall, F1, and exact match rate for both the development and test sets of RTE2. Despite the simplicity of this alignment model, recall is pretty good -- above 80%. Its precision, however, is mediocre---chiefly because, by design, it aligns every hypothesis token with some premise token. The model could surely be improved by allowing it to leave some hypothesis tokens unaligned, but this was not pursued.

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: Stanford RTE aligner RTE2 dev RTE2 test System P % R % F1 % E % Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3 GIZA++ 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3 Cross-EM 67.6 80.1 1.3 70.3 81.0 74.1 0.8 Stanford RTE 81.1 61.2 69.7 0.5 82.7 0.3 Disappointing — especially, poor recall! Explanation: Stanford ignores punctuation (worth 15%) But punctuation matters little in inference So, let’s ignore these errors… The initial evaluation is quite disappointing, and the low recall figures -- around 60% -- are particularly noteworthy. But a partial explanation is readily available: by design, the Stanford system ignores punctuation. Since punctuation tokens account for about 15% of the aligned pairs in the MSR data, this sharply reduces measured recall. However, since punctuation matters little in inference, such recall errors probably should be forgiven…

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: Stanford RTE aligner RTE2 dev RTE2 test System P % R % F1 % E % Bag-of-words 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3 GIZA++ 83.0 66.4 72.1 9.4 85.1 69.1 74.8 11.3 Cross-EM 67.6 80.1 1.3 70.3 81.0 74.1 0.8 Stanford RTE 81.1 61.2 69.7 0.5 82.7 0.3 ignoring punct. 75.8 78.4 — 79.1 Better — but recall is still rather low Stanford does poor job aligning function words 13% of links in gold are prepositions & articles Stanford misses 67% of these (MANLI only 10%) Also, Stanford fails to align multi-word phrases peace activists ~ protestors, hackers ~ non-authorized personnel Here we show adjusted statistics for the Stanford system in which all recall errors involving punctuation are (generously) ignored. But even after this adjustment, the recall figures are unimpressive. Error analysis reveals that the Stanford aligner does a poor job of aligning function words. About 13% of the aligned pairs in the MSR data are matching prepositions or articles; the Stanford aligner misses about 67% of such pairs. (By contrast, MANLI misses only 10% of such pairs.) While function words matter less in inference than nouns and verbs, they are not irrelevant, and because sentences often contain multiple instances of a particular function word, matching them properly is by no means trivial. Finally, the Stanford aligner is handicapped by its token-based alignment representation, often failing (partly or completely) to align multi-word phrases such as “peace activists” with “protesters”, or “hackers” with “non-authorized personnel”.

Performance on MSR RTE2 data
System Data P % R % F1 % Exact % Bag-of-words dev 57.8 81.2 67.5 3.5 (baseline) test 62.1 82.6 70.9 5.3 GIZA++ 83.0 66.4 72.1 — (using lex, ) 85.1 69.1 74.8 Cross-EM 67.6 80.1 70.3 81.0 74.1 Stanford RTE 81.1 61.2 69.7 0.5 82.7 0.3 75.8 78.4 (punct. corr.) 79.1 MANLI 83.4 85.5 84.4 21.7 (this work) 85.4 85.3 21.3 [>] (each data set contains 800 problems)

Performance on MSR RTE2 data
System P % R % F1 % E % Bag-of-words baseline 57.8 81.2 67.5 3.5 62.1 82.6 70.9 5.3 GIZA++ (lex, ) 83.0 66.4 72.1 — 85.1 69.1 74.8 Cross-EM (lex, ) 67.6 80.1 70.3 81.0 74.1 Stanford RTE 81.1 61.2 69.7 0.5 82.7 0.3 Stanford punct. corr. 75.8 78.4 79.1 MANLI (this work) 83.4 85.5 84.4 21.7 85.4 85.3 21.3 [>] (each data set contains 800 problems)

Two Aspects of the Problem of Natural Language Inference

Similar presentations

Presentation on theme: "Two Aspects of the Problem of Natural Language Inference"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Two Aspects of the Problem of Natural Language Inference

Similar presentations

Presentation on theme: "Two Aspects of the Problem of Natural Language Inference"— Presentation transcript:

Similar presentations

About project

Feedback