The Second PASCAL Recognising Textual Entailment Challenge Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampicollo, Bernardo Magnini, Idan.

Slides:



Advertisements
Similar presentations
ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,
Advertisements

COGEX at the Second RTE Marta Tatu, Brandon Iles, John Slavick, Adrian Novischi, Dan Moldovan Language Computer Corporation April 10 th, 2006.
COGEX at the Second RTE Marta Tatu, Brandon Iles, John Slavick, Adrian Novischi, Dan Moldovan Language Computer Corporation April 10 th, 2006.
Question Answering for Machine Reading Evaluation Evaluation Campaign at CLEF 2011 Anselmo Peñas (UNED, Spain) Eduard Hovy (USC-ISI, USA) Pamela Forner.
Recognizing Textual Entailment Challenge PASCAL Suleiman BaniHani.
1 The PASCAL Recognizing Textual Entailment Challenges - RTE-1,2,3 Ido DaganBar-Ilan University, Israel with …
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
1 Unsupervised Semantic Parsing Hoifung Poon and Pedro Domingos EMNLP 2009 Best Paper Award Speaker: Hao Xiong.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Recognizing Textual Entailment Progress towards RTE 4 Scott Settembre University at Buffalo, SNePS Research Group
Search Engines and Information Retrieval
1 QA in Discussion Boards  Companies (e.g., Dell, IBM) use discussion boards as ways for customers to get answers to their questions  90% of 40 analyzed.
Automatic Classification of Semantic Relations between Facts and Opinions Koji Murakami, Eric Nichols, Junta Mizuno, Yotaro Watanabe, Hayato Goto, Megumi.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
Third Recognizing Textual Entailment Challenge Potential SNeRG Submission.
Employing Two Question Answering Systems in TREC 2005 Harabagiu, Moldovan, et al 2005 Language Computer Corporation.
Flash talk by: Aditi Garg, Xiaoran Wang Authors: Sarah Rastkar, Gail C. Murphy and Gabriel Murray.
Information Retrieval in Practice
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Overview of the Fourth Recognising Textual Entailment Challenge NIST-Nov. 17, 2008TAC Danilo Giampiccolo (coordinator, CELCT) Hoa Trang Dan (NIST)
Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007 UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Search Engines and Information Retrieval Chapter 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Assessing the Impact of Frame Semantics on Textual Entailment Authors: Aljoscha Burchardt, Marco Pennacchiotti, Stefan Thater, Manfred Pinkal Saarland.
Knowledge and Tree-Edits in Learnable Entailment Proofs Asher Stern, Amnon Lotan, Shachar Mirkin, Eyal Shnarch, Lili Kotlerman, Jonathan Berant and Ido.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
Recognizing textual entailment: Rational, evaluation and approaches Source:Natural Language Engineering 15 (4) Author:Ido Dagan, Bill Dolan, Bernardo Magnini.
Textual Entailment 1 Textual Entailment Al Akhwayn Summer School 2015 Horacio Rodríguez TALP Research Center Dept. Computer Science Universitat Politècnica.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Summarization Focusing on Polarity or Opinion Fragments in Blogs Yohei Seki Toyohashi University of Technology Visiting Scholar at Columbia University.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
WEB CONTENT SUMMARIZATION Timothy Washington A Look at Algorithms, Methodologies, and Live Systems.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Evaluating Answer Validation in multi- stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Towards Entailment Based Question Answering: ITC-irst at Clef 2006 Milen Kouylekov, Matteo Negri, Bernardo Magnini & Bonaventura Coppola ITC-irst, Centro.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Overview of Statistical NLP IR Group Meeting March 7, 2006.
UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Recognising Textual Entailment Johan Bos School of Informatics University of Edinburgh Scotland,UK.
1 Predicting Answer Location Using Shallow Semantic Analogical Reasoning in a Factoid Question Answering System Hapnes Toba, Mirna Adriani, and Ruli Manurung.
Relation Extraction (RE) via Supervised Classification See: Jurafsky & Martin SLP book, Chapter 22 Exploring Various Knowledge in Relation Extraction.
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Recognizing Partial Textual Entailment
Introduction to Information Retrieval
Introduction to Search Engines
Presentation transcript:

The Second PASCAL Recognising Textual Entailment Challenge Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampicollo, Bernardo Magnini, Idan Szpektor Bar Ilan, CELCT, ITC-irst, Microsoft Research, MITRE

Variability of Semantic Expression Model variability as relations between text expressions: Equivalence: expr1  expr2 Entailment: expr1  expr2 – more general Dow ends up Dow climbs 255 The Dow Jones Industrial Average closed up 255 Stock market hits a record high Dow gains 255 points All major stock markets surged

Applied Textual Entailment: Definition Directional relation between two text fragments: Text (t) and Hypothesis (h): t entails h (t  h) if, typically, a human reading t would infer that h is most likely true” Operational (applied) definition: As in NLP applications Assuming common background knowledge

Why textual entailment? Unified modeling of semantic inference As required by various applications (IR,IE,QA,MDS) Text-to-text mapping Independent of concrete semantic representation

Goals for RTE-2 Support research progress More “realistic” examples Input from common benchmarks Output from real systems Shows entailment potential to improve performance across applications Improve data collection and annotation Revised and expanded guidelines Most pairs triply annotated Provide linguistic processing

The RTE-2 Dataset

Overview 1600 pairs: 800 development; 800 test Followed RTE-1 setting t is 1-2 sentences, h is one (shorter) sentence 50%-50% positive-negative split in all subtasks Focused on primary applications IE, IR, QA, (Multi-document) Summarization

Collecting IE pairs Motivation: a sentence containing a target relation should entail an instantiated template. Pairs were generated in several ways Outputs of IE systems: for ACE-2004 and MUC-4 relations Manually : for ACE-2004 and MUC-4 relations for additional relations in news domain

Collecting IR pairs Motivation: relevant documents should entail a given “propositional” query. Hypotheses are propositional IR queries adapted and simplified from TREC and CLEF Texts selected from documents retrieved by different search engines

Collecting QA pairs Motivation: a passage containing the answer slot filler should entail the corresponding answer statement. QA systems were given TREC and CLEF questions. Hypothesis generated by “plugging” the system answer term into the affirmative form of the question Texts correspond to the candidate answer passages

Collecting SUM (MDS) pairs Motivation: identifying redundant phrases Using web document clusters and system summary Picking sentences having high lexical overlap with summary In final pairs: Texts are original sentences (usually from summary) Hypotheses: Positive pairs: simplify h until entailed by t Negative pairs: simplify h similarly

Creating the final dataset Average pairwise inter-judge agreement: 89.2% Average Kappa 0.78 – substantial agreement Better than RTE-1 Removed 18.2% of pairs due to disagreement (3-4 judges) Disagreement example: (t) Women are under-represented at all political levels... (h) Women are poorly represented in parliament. Additional review removed 25.5% of pairs too difficult / vague / redundant

RTE-2 Systems

Submissions 23 groups - 35% growth compared to RTE-1 41 runs 13 groups participated for the first time (1+2=30) Number of Groups Country 9USA 3.5Italy 3Spain 2Netherlands 1.5UK 1Australia 1Canada 1Ireland 1Germany

Methods and Approaches Measure similarity between t and h (coverage of h by t): Lexical overlap (unigram, N-gram, subsequence) Lexical substitution (WordNet, statistical) Syntactic matching/transformations Lexical-syntactic variations (“paraphrases”) Semantic role labeling and matching Global similarity parameters (e.g. negation, modality) Cross-pair similarity Detect mismatch (for non-entailment) Logical inference

Dominant approach: Supervised Learning Features model both similarity and mismatch Train on development set and auxiliary t-h corpora t,h Features: Lexical, n-gram,syntactic semantic, global Feature vector Classifier YES NO

Evaluation Measures Main task: classification Compare to entailment judgment Evaluation criterion: accuracy Baseline: 60% Simple lexical overlapping system, used as baseline in [Zanzotto et al.] Secondary task: ranking Sorted by entailment confidence Evaluation criterion: average precision

Results Average PrecisionAccuracyFirst Author (Group) 80.8%75.4%Hickl (LCC) 71.3%73.8%Tatu (LCC) 64.4%63.9%Zanzotto (Milan & Rome) 62.8%62.6%Adams (Dallas) 66.9%61.6%Bos (Rome & Leeds) 58.1%-60.5%11 groups 52.9%-55.6%7 groups Average: 60% Median: 59%

Analysis For the first time: deep methods (semantic/ syntactic/ logical) clearly outperform shallow methods (lexical/n-gram) Cf. Kevin Knight’s invited talk in EACL, titled: Isn’t linguistic Structure Important, Asked the Engineer Still, most systems based on deep analysis did not score significantly better than the lexical baseline

Why? System reports point at two directions: Lack of knowledge (syntactic transformation rules, paraphrases, lexical relations, etc.) Lack of training data It seems that systems that coped better with these issues performed best: Hickl et al. - acquisition of large entailment corpora for training Tatu et al. – large knowledge bases (linguistic and world knowledge)

Open Questions Are knowledge and training data more important than inference/matching method? Or perhaps given more knowledge and training data, the difference between inference methods will become more apparent?

Per-task analysis Best ResultAverage Accuracy 84.5%67.9%SUM 74.5%60.8%IR 70.5%58.2%QA 73.0%52.2%IE 75.4%59.8%Total Some systems trained per-task

Some suggested research directions Acquiring larger entailment corpora Beyond parameter tuning – discovering needed linguistic and world knowledge Manual knowledge engineering for concise knowledge E.g. syntactic transformations, logical axioms Further exploration of global information Principled framework for fusing information levels Are we happy with bags of features?

Conclusions RTE-2 introduced a more realistic dataset, based mostly on system outputs Participation shows growing interest in the textual entailment framework Accuracy improvements are very encouraging Many interesting new ideas and approaches

Acknowledgments Funding: PASCAL Network of Excellence PASCAL challenges program managers: Michele Sebag, Florence d’Alche-Buc, Steve Gunn Workshop local organizer: Rodolfo Delmonte Contributing systems : IE – NYU, IBM, ITC-irst QA - AnswerBus, LCC IR – Google, Yahoo, MSN SUM – NewsBlaster (Columbia), NewsInEssence (U. Michigan) Datasets: TREC, TREC-QA, CLEF, MUC, ACE Annotation: Malky Rabinowitz, Dana Mills, Ruthie Mandel, Errol Hayman, Vanessa Sandrini, Allesandro Valin, Elizabeth Lima, Jeff Stevenson, Amy Muia, The Butler Hill Group Advice: Dan Roth Special thanks: Oren Glickman Enjoy the workshop!