The Second PASCAL Recognising Textual Entailment Challenge Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampicollo, Bernardo Magnini, Idan Szpektor Bar Ilan, CELCT, ITC-irst, Microsoft Research, MITRE
Variability of Semantic Expression Model variability as relations between text expressions: Equivalence: expr1 expr2 Entailment: expr1 expr2 – more general Dow ends up Dow climbs 255 The Dow Jones Industrial Average closed up 255 Stock market hits a record high Dow gains 255 points All major stock markets surged
Applied Textual Entailment: Definition Directional relation between two text fragments: Text (t) and Hypothesis (h): t entails h (t h) if, typically, a human reading t would infer that h is most likely true” Operational (applied) definition: As in NLP applications Assuming common background knowledge
Why textual entailment? Unified modeling of semantic inference As required by various applications (IR,IE,QA,MDS) Text-to-text mapping Independent of concrete semantic representation
Goals for RTE-2 Support research progress More “realistic” examples Input from common benchmarks Output from real systems Shows entailment potential to improve performance across applications Improve data collection and annotation Revised and expanded guidelines Most pairs triply annotated Provide linguistic processing
The RTE-2 Dataset
Overview 1600 pairs: 800 development; 800 test Followed RTE-1 setting t is 1-2 sentences, h is one (shorter) sentence 50%-50% positive-negative split in all subtasks Focused on primary applications IE, IR, QA, (Multi-document) Summarization
Collecting IE pairs Motivation: a sentence containing a target relation should entail an instantiated template. Pairs were generated in several ways Outputs of IE systems: for ACE-2004 and MUC-4 relations Manually : for ACE-2004 and MUC-4 relations for additional relations in news domain
Collecting IR pairs Motivation: relevant documents should entail a given “propositional” query. Hypotheses are propositional IR queries adapted and simplified from TREC and CLEF Texts selected from documents retrieved by different search engines
Collecting QA pairs Motivation: a passage containing the answer slot filler should entail the corresponding answer statement. QA systems were given TREC and CLEF questions. Hypothesis generated by “plugging” the system answer term into the affirmative form of the question Texts correspond to the candidate answer passages
Collecting SUM (MDS) pairs Motivation: identifying redundant phrases Using web document clusters and system summary Picking sentences having high lexical overlap with summary In final pairs: Texts are original sentences (usually from summary) Hypotheses: Positive pairs: simplify h until entailed by t Negative pairs: simplify h similarly
Creating the final dataset Average pairwise inter-judge agreement: 89.2% Average Kappa 0.78 – substantial agreement Better than RTE-1 Removed 18.2% of pairs due to disagreement (3-4 judges) Disagreement example: (t) Women are under-represented at all political levels... (h) Women are poorly represented in parliament. Additional review removed 25.5% of pairs too difficult / vague / redundant
RTE-2 Systems
Submissions 23 groups - 35% growth compared to RTE-1 41 runs 13 groups participated for the first time (1+2=30) Number of Groups Country 9USA 3.5Italy 3Spain 2Netherlands 1.5UK 1Australia 1Canada 1Ireland 1Germany
Methods and Approaches Measure similarity between t and h (coverage of h by t): Lexical overlap (unigram, N-gram, subsequence) Lexical substitution (WordNet, statistical) Syntactic matching/transformations Lexical-syntactic variations (“paraphrases”) Semantic role labeling and matching Global similarity parameters (e.g. negation, modality) Cross-pair similarity Detect mismatch (for non-entailment) Logical inference
Dominant approach: Supervised Learning Features model both similarity and mismatch Train on development set and auxiliary t-h corpora t,h Features: Lexical, n-gram,syntactic semantic, global Feature vector Classifier YES NO
Evaluation Measures Main task: classification Compare to entailment judgment Evaluation criterion: accuracy Baseline: 60% Simple lexical overlapping system, used as baseline in [Zanzotto et al.] Secondary task: ranking Sorted by entailment confidence Evaluation criterion: average precision
Results Average PrecisionAccuracyFirst Author (Group) 80.8%75.4%Hickl (LCC) 71.3%73.8%Tatu (LCC) 64.4%63.9%Zanzotto (Milan & Rome) 62.8%62.6%Adams (Dallas) 66.9%61.6%Bos (Rome & Leeds) 58.1%-60.5%11 groups 52.9%-55.6%7 groups Average: 60% Median: 59%
Analysis For the first time: deep methods (semantic/ syntactic/ logical) clearly outperform shallow methods (lexical/n-gram) Cf. Kevin Knight’s invited talk in EACL, titled: Isn’t linguistic Structure Important, Asked the Engineer Still, most systems based on deep analysis did not score significantly better than the lexical baseline
Why? System reports point at two directions: Lack of knowledge (syntactic transformation rules, paraphrases, lexical relations, etc.) Lack of training data It seems that systems that coped better with these issues performed best: Hickl et al. - acquisition of large entailment corpora for training Tatu et al. – large knowledge bases (linguistic and world knowledge)
Open Questions Are knowledge and training data more important than inference/matching method? Or perhaps given more knowledge and training data, the difference between inference methods will become more apparent?
Per-task analysis Best ResultAverage Accuracy 84.5%67.9%SUM 74.5%60.8%IR 70.5%58.2%QA 73.0%52.2%IE 75.4%59.8%Total Some systems trained per-task
Some suggested research directions Acquiring larger entailment corpora Beyond parameter tuning – discovering needed linguistic and world knowledge Manual knowledge engineering for concise knowledge E.g. syntactic transformations, logical axioms Further exploration of global information Principled framework for fusing information levels Are we happy with bags of features?
Conclusions RTE-2 introduced a more realistic dataset, based mostly on system outputs Participation shows growing interest in the textual entailment framework Accuracy improvements are very encouraging Many interesting new ideas and approaches
Acknowledgments Funding: PASCAL Network of Excellence PASCAL challenges program managers: Michele Sebag, Florence d’Alche-Buc, Steve Gunn Workshop local organizer: Rodolfo Delmonte Contributing systems : IE – NYU, IBM, ITC-irst QA - AnswerBus, LCC IR – Google, Yahoo, MSN SUM – NewsBlaster (Columbia), NewsInEssence (U. Michigan) Datasets: TREC, TREC-QA, CLEF, MUC, ACE Annotation: Malky Rabinowitz, Dana Mills, Ruthie Mandel, Errol Hayman, Vanessa Sandrini, Allesandro Valin, Elizabeth Lima, Jeff Stevenson, Amy Muia, The Butler Hill Group Advice: Dan Roth Special thanks: Oren Glickman Enjoy the workshop!