Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.

Slides:

Advertisements

Similar presentations

Chapter 11: The t Test for Two Related Samples

Advertisements

COGEX at the Second RTE Marta Tatu, Brandon Iles, John Slavick, Adrian Novischi, Dan Moldovan Language Computer Corporation April 10 th, 2006.

COGEX at the Second RTE Marta Tatu, Brandon Iles, John Slavick, Adrian Novischi, Dan Moldovan Language Computer Corporation April 10 th, 2006.

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Word Spotting DTW.

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.

Fast Algorithms For Hierarchical Range Histogram Constructions

Arthur Chan Prepared for Advanced MT Seminar

Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Recognizing Textual Entailment Challenge PASCAL Suleiman BaniHani.

Combining Test Data MANA 4328 Dr. Jeanne Michalski

Chapter 10 Decision Making © 2013 by Nelson Education.

Protein sequence clustering has been widely used as a part of the analysis of protein structure and function. We demonstrate an approach to protein clustering,

On the Genetic Evolution of a Perfect Tic-Tac-Toe Strategy

Fingerprint Minutiae Matching Algorithm using Distance Histogram of Neighborhood Presented By: Neeraj Sharma M.S. student, Dongseo University, Pusan South.

COFFEE: an objective function for multiple sequence alignments

Normalized alignment of dependency trees for detecting textual entailment Erwin Marsi & Emiel Krahmer Tilburg University Wauter Bosma & Mariët Theune University.

Evaluating Search Engine

Data Structures Hash Tables

BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.

Textual Entailment Using Univariate Density Model and Maximizing Discriminant Function “Third Recognizing Textual Entailment Challenge 2007 Submission”

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.

Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Third Recognizing Textual Entailment Challenge Potential SNeRG Submission.

From Prototypes to Abstract Ideas A review of On The Genesis of Abstract Ideas by MI Posner and SW Keele Siyi Deng.

Experimental Evaluation

Fast multiresolution image querying CS474/674 – Prof. Bebis.

© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Overview of the Fourth Recognising Textual Entailment Challenge NIST-Nov. 17, 2008TAC Danilo Giampiccolo (coordinator, CELCT) Hoa Trang Dan (NIST)

Machine translation Context-based approach Lucia Otoyo.

Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Arthur Chan Prepared for Advanced MT Seminar

Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.

A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.

Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases Jost Enderle, Nicole Schneider, Thomas Seidl RWTH Aachen University,

Copyright © 2011 Pearson Education, Inc. Analysis of Variance Chapter 26.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

RecBench: Benchmarks for Evaluating Performance of Recommender System Architectures Justin Levandoski Michael D. Ekstrand Michael J. Ludwig Ahmed Eldawy.

Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG,

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Cost Efficiency Factor

A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.

Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.

Weighted Guidelines Cost Efficiency Factor. Cost Efficiency Provides additional profit $ for reduction in costs on the “pending” contract Range is 0 –

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

FACE DETECTION : AMIT BHAMARE. WHAT IS FACE DETECTION ? Face detection is computer based technology which detect the face in digital image. Trivial task.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

1 Measuring the Semantic Similarity of Texts Author ： Courtney Corley and Rada Mihalcea Source ： ACL-2005 Reporter ： Yong-Xiang Chen.

Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.

Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.

Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks EMNLP 2008 Rion Snow CS Stanford Brendan O’Connor Dolores.

Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.

Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 26 Analysis of Variance.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,

Recognising Textual Entailment Johan Bos School of Informatics University of Edinburgh Scotland,UK.

Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

MANA 4328 Dr. Jeanne Michalski

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi

Presentation transcript:

Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski

What is Textual Entailment? Informally: A text T entails a hypothesis H if the meaning of H can be inferred from the meaning of T. Example:  T: Profits nearly doubled to nearly $1.8 billion.  H: Profits grew to nearly $1.8 billion.  Entailment holds (is true).

Types of Entailment For many entailments, H is simply a paraphrase of all or part of T. Other entailments are less obvious:  T: Jorma Ollila joined Nokia in 1985 and held a variety of key management positions before taking the helm in 1992  H: Jorma Ollila is the CEO of Nokia. ~95% human level of agreement on entailment judgments

The PASCAL RTE Challenge First challenge held in 2005 (RTE1)  16 entries  System performances ranged from 50% to 59% accuracy.  Wide array of approaches, using word overlap, synonymy/word distance, statistical lexical relations, dependency tree matching… Second challenge is underway (RTE2)

What is BLEU? BLEU was designed as a metric to measure the accuracy of machine- generated translations by comparing them to human-generated gold standards. Scores based on n-gram overlap (typically for n=1,2,3 and 4) and penalizes for brief translations. Application for RTE?

Using the BLEU Algorithm for RTE Proposed by Perez & Alfonseca in RTE1.  Use the traditional BLEU algorithm to capture n-gram overlap between T-H pairs.  Find a cutoff score such that a BLEU score above the cutoff implies a TRUE entailment (otherwise FALSE)  Roughly 50% accuracy: simple baseline. However: intuitively, the BLEU algorithm is not ideal for RTE  BLEU was designed for evaluating MT systems  BLEU could be adjusted to better suit the RTE task.

Modifying the BLEU Algorithm Entailments are normally short; thus it does not make sense to penalize them for being short. BLEU uses a geometric mean to average the n- gram overlap for n=1,2,3, and 4  If any value of n produces a zero score, the entire score is nullified. Therefore: modify the algorithm to not penalize for brevity, use a linear weighted average.

Modifying the BLEU Algorithm Original BLEU Modified BLEU w i is the weighting factor (universally set to 1/N) b is the brevity factor (see paper for details) c test,ref is the count of n-grams appearing in both test and ref, and c test is the count of total n-grams appearing in test.

Performance Comparison Ran both unmodified and modified BLEU algorithm on the RTE1 data sets. Used the development set to obtain the cutoff score Use the test set as the evaluation data

Cutoff Score for BLEU The unmodified algorithm produces a high percentage of zero scores (67%). Not surprisingly, the cutoff score is zero!

Cutoff Score for BLEU Two equivalent cutoff scores: 0 and Both offer 53.8% accuracy, but the zero cutoff was used because it is a natural candidate for cutoff.

Cutoff Score for Modified BLEU Modified BLEU produces a continuum of scores, unlike the original BLEU Need to find the optimal cutoff score that maximizes accuracy.

Cutoff Score for Modified BLEU Optimal cutoff score is found to be 0.221

Validity of cutoff scores? The original BLEU seems to have a good natural cutoff score of zero The modified BLEU optimal cutoff varies depending on the data set, although is an acceptable value (future data may be needed for optimization; also the cutoff may be task-specific).

Results on RTE1 Data Original BLEU Development Set:  Cutoff score = zero  Accuracy = 53.8% Test Set:  Accuracy = 52.0% Modified BLEU Development Set:  Cutoff score =  Accuracy = 57.8% Test Set:  Accuracy = 53.8%

Results on RTE2 Data Original BLEU Development Set:  Cutoff score = zero  Accuracy = 56.0% Test Set:  ??? Modified BLEU Development Set:  Cutoff score = Accuracy = 60.4%  Cutoff score = 0.25 Accuracy = 61.4% Test Set:  ??? RTE2 test set will be released in January.

Comparison of Results BLEUModifiedBLEUPérez &AlfonsecaRTE1Best Development Set (RTE1) n/a Test Set (RTE1) Development Set (RTE2) n/a Accuracy scores for four systems: Original BLEU, Modified BLEU, Perez & Alfonseca’s implementation of BLEU, and the best submission to the RTE1 Challenge. Modified BLEU is better than the other versions of BLEU, but nowhere near the best system performance.

End Results Modified BLEU algorithm outperforms the original BLEU algorithm for RTE  Consistent 2-4% increase in accuracy Does this mean that modified BLEU is a candidate system for RTE applications?

NO: BLEU is a baseline algorithm “Don’t climb a tree to get to the moon.” BLEU (and other n-gram based methods) are good baselines, but lack the potential for future improvement. Example:  T: It is not the case that John likes ice cream.  H: John likes ice cream.  Perfect n-gram overlap, but entailment is FALSE.

Future Improvements Potential exists to add word-similarity enhancements, such as synonym substitution, etc. Rather than think of these as enhancements to the BLEU algorithm, we should think of the BLEU algorithm as a baseline for measuring the benefit offered by such improvements. i.e. Performance of BLEU vs. Performance of BLEU after synonym substitution. => Evaluate the benefit synonym substitution can have on a larger RTE system.

Conclusions The BLEU algorithm can be modified to better suit the RTE task  Modifications are theory-motivated  Eliminate brevity penalty, use linear rather than geometric mean  Performance benefits: Modified BLEU consistently has 2-4% higher accuracy. Still, BLEU is only a baseline algorithm  Lacks the capacity to incorporate future developments  Can be used to measure performance benefits of various enhancements.