RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

Slides:



Advertisements
Similar presentations
Kees van Deemter Matthew Stone Formal Issues in Natural Language Generation Lecture 4 Shieber 1993; van Deemter 2002.
Advertisements

Generation of Referring Expressions: Managing Structural Ambiguities I.H. KhanG. Ritchie K. van Deemter University of Aberdeen, UK.
Generation of Referring Expressions: the State of the Art SELLC Winter School, Guangzhou 2010 Kees van Deemter Computing Science University of Aberdeen.
Some common assumptions behind Computational Generation of Referring Expressions (GRE) (Introductory remarks at the start of the workshop)
Generation of Referring Expressions (GRE) Reading: Dale & Reiter (1995) (key paper in this area)
SELLC Winter School 2010 Evaluating Algorithms for GRE Kees van Deemter (work with Albert Gatt, Ielka van der Sluis, and Richard Power) University of Aberdeen,
Conceptual coherence in the generation of referring expressions Albert Gatt & Kees van Deemter University of Aberdeen {agatt,
Generation of Referring Expressions: the State of the Art SELLC Summer School, Harbin 2010 Kees van Deemter Computing Science University of Aberdeen.
Charting the Potential of Description Logic for the Generation of Referring Expression SELLC, Guangzhou, Dec Yuan Ren, Kees van Deemter and Jeff.
Generation of Referring Expressions: the State of the Art LOT Winter School, Tilburg 2008 Kees van Deemter Computing Science University of Aberdeen.
Generation of Referring Expressions: the State of the Art SELLC Winter School, Guangzhou 2010 Kees van Deemter Computing Science University of Aberdeen.
Overspecified reference in hierarchical domains: measuring the benefits for readers Ivandre Paraboni * Judith Masthoff # Kees van Deemter # * = University.
Generation of Referring Expressions (GRE) The Incremental Algorithm (IA) Dale & Reiter (1995)
A small taste of inferential statistics
Vagueness: a problem for AI Kees van Deemter University of Aberdeen Scotland, UK.
Microplanning (Sentence planning) Part 1 Kees van Deemter.
CS4018 Formal Models of Computation weeks Computability and Complexity Kees van Deemter (partly based on lecture notes by Dirk Nikodem)
Conceptual Clustering
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Dr. Ehud Reiter, Computing Science, University of Aberdeen1 NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen)
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Generation of Referring Expressions: the State of the Art LOT Winter School, Tilburg 2008 Kees van Deemter Computing Science University of Aberdeen.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Clustering V. Outline Validating clustering results Randomization tests.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Internet Vision - Lecture 3 Tamara Berg Sept 10. New Lecture Time Mondays 10:00am-12:30pm in 2311 Monday (9/15) we will have a general Computer Vision.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Generation of Referring Expressions: Modeling Partner Effects Surabhi Gupta Advisor: Amanda Stent Department of Computer Science.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
1 Confidence Interval for the Population Mean. 2 What a way to start a section of notes – but anyway. Imagine you are at the ground level in front of.
Learning Subjective Adjectives from Corpora Janyce M. Wiebe Presenter: Gabriel Nicolae.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Towards the automatic identification of adjectival scales: clustering adjectives according to meaning Authors: Vasileios Hatzivassiloglou and Kathleen.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview of Lecture Between Group & Within Subjects Designs Mann-Whitney Test.
Generating Referring Expressions (Dale & Reiter 1995) Ivana Kruijff-Korbayová (based on slides by Gardent&Webber, and Stone&van Deemter) Einfürung.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Robert's Drawers (and other variations on GRE shared tasks) Gatt, Belz, Reiter, Viethen.
Mining and Summarizing Customer Reviews
Statistical Techniques I
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
Line Arrangement Chapter 6. Line Arrangement Problem: Given a set L of n lines in the plane, compute their arrangement which is a planar subdivision.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Modelling Human Thematic Fit Judgments IGK Colloquium 3/2/2005 Ulrike Padó.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Generation of Referring Expressions (GRE) The Incremental Algorithm Dale & Reiter (1995)
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Descriptive Statistics The goal of descriptive statistics is to summarize a collection of data in a clear and understandable way.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Adish Singla, Microsoft Bing Ryen W. White, Microsoft Research Jeff Huang, University of Washington.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
Hierarchical Clustering for POS Tagging of the Indonesian Language Derry Tanti Wijaya and Stéphane Bressan.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University.
Jette Viethen 20 April 2007NLGeval07 Automatic Evaluation of Referring Expression Generation is Possible.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Sequence Alignment.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla.
Kees van Deemter Generation of Referring Expressions: a crash course Background information and Project HIT 2010.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Lecture 12: Data Wrangling
Kees van Deemter Computing Science University of Aberdeen
Generation of Referring Expressions (GRE)
Presentation transcript:

RANLP, Borovets Sept Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of Aberdeen, Scotland, UK

RANLP, Borovets Sept Outline GRE: Generation of Referring Expressions TUNA project: Corpus and Annotation Evaluation of Algorithms –Furniture Domain –People Domain [ Evaluation in the real world: STEC ]

RANLP, Borovets Sept TUNA project (ended Feb. 2007) TUNA: Towards a UNified Algorithm for Generating Referring Expressions. 1.Extend coverage of GRE algorithms (plurals, negation, gradable properties,…) 2.Improve empirical foundations of GRE Focus on –Content Determination –“First mention” NPs (no anaphora!)

RANLP, Borovets Sept TUNA results Elsewhere: –Reference to sets (e.g., Gatt 2006, 2007) –Gradable/vague properties (Van Deemter 2006) –Pointing (Van der Sluis & Krahmer 2007) –Large domains (Paraboni et al. 2007) This talk: empirical issues –Testing classic algorithms –Method: compute similarity to human-generated NPs

RANLP, Borovets Sept Method (overview) Elicitation experiment Leads to transparent corpus of referring expressions: –referent and distractors are known –Domain attributes are known Transparent corpora can be used for many purposes This talk: Compare some classic algorithms –giving each algorithm the same input as subjects –computing how similar algorithm’s output is to subjects’ output –We count semantic content only

RANLP, Borovets Sept Elicitation Experiment Furniture (simple domain) –TYPE, COLOUR, SIZE, ORIENTATION People (complex domain) –Nine annotated properties in total Location: –Vertical location (Y-DIMENSION) –Horizontal location (X-DIMENSION) the green desk facing backwards the sofa and the desk which are red the young man with a white shirt the man with the funny haircut the man on the left the chair in the top right

RANLP, Borovets Sept Furniture trial

RANLP, Borovets Sept People trial

RANLP, Borovets Sept Corpus setup Each corpus was carefully balanced, e.g. between singulars and plurals. Between-subjects design: -Location: Subjects discouraged from using locative expressions. +Location: Subjects not discouraged. -FaultCritical: Subjects could correct their utterances +FaultCritical: Subjects could not correct their utterances After discounting outliers and (self-reported) non-fluent speakers, 45 subjects were left

RANLP, Borovets Sept Experiment design: Furniture (-Location) 18 trials: (C=Colour, O=orientation, S-size) –1 referent: minimal identification uses {c}, {o}, {s}, {c,o}, {c,s}, or {o,s} [6 trials] –2 “similar” referents {c}, {o}, {s}, {c,o}, {c,s}, or {o,s} [6 trials] –2 “dissimilar referents” {c}, {o}, {s}, {c,o}, {c,s}, or {o,s} [6 trials]

RANLP, Borovets Sept Classic GRE Algorithms Full Brevity (FB; Dale 1989) –Generation of a minimal description Greedy Algorithm (GR; Dale 1989) –Always add property that removes the most distractors Incremental Algorithm (IA; Dale and Reiter 1995) –Add next useful property from an ordered list of properties. (“Preference Order” = PO)

RANLP, Borovets Sept Other evaluation studies Jordan 2000, Jordan & Walker 2005 –More than just identification (Jordan 2000) Siddharthan & Copestake 2004 –References in linguistic context Gupta & Stent 2005 –Realisation mixed with Content Determination Viethen & Dale 2006 –Only Colour and Location

RANLP, Borovets Sept Other evaluation studies General limitations: Limited numbers of subjects/referents Few attempts at balancing the corpus. (E.g., Viethen & Dale 2006 let subjects decide what to refer to.) IA: no teasing apart of preference orders

RANLP, Borovets Sept Extensions to the classics Plurality: (van Deemter 2002) –Extend each algorithm to search through disjunctions of increasing length Location: (van Deemter 2006) –Locatives treated as gradable: “the leftmost table/person” –E.g., suppose the referent x is located in column 3 => “x is left of column 4”, “x is left of column 5” … => “x is right of column 2”, “x is right of column 1”… Type: –People tend to use TYPE (Dale & Reiter 1995) –Here: All algorithms added TYPE.

RANLP, Borovets Sept Evaluation aims Hypothesis in Dale & Reiter 1995: –IA resembles human output most Our main questions: –Is this true? –How important are parameters (PO) for the IA? More generally: –assess ‘quality’ of classic GRE algorithms : –calculate average match between the description generated by an algorithm and the descriptions produced by people (for the same referent)

RANLP, Borovets Sept Evaluation metric Dice Coefficient: 2 x |Common properties| |total properties| A coefficient result of 1 indicates identical sets. 0 means no common terms We also used this to measure agreement between annotators of the corpus

RANLP, Borovets Sept (Assumptions behind DICE) Deletion of a property is slightly worse than addition of a property The discriminatory power of a description does not matter All properties are equidistant See Gatt & Van Deemter 2007, “Content Determination in GRE: evaluating the evaluator” )

RANLP, Borovets Sept Evaluation (I): Furniture Which preference orders for the IA? –Psycholinguistic evidence: COLOUR >> {ORIENTATION, SIZE} (Pechmann 89; Eikmeyer & Ahlsen 96; Belke & Meyer 02) Y-DIMENSION >> X-DIMENSION (Bryant et al, 1992; Arts 2004) Split data: +LOCATION vs –LOCATION This talk: focus on –LOCATION –LOCATION = approx. 800 descriptions Compare algorithms to a randomized IA (RAND)

RANLP, Borovets Sept Furniture: -LOCATION Significant FB/GR

RANLP, Borovets Sept Beyond Toy Domains More on Furniture corpus: Gatt et al. (ENLG-2007) With complex real-world objects: –Many different attributes can be used –Number of PO’s explodes –Few psycholinguistic precedents People domain attributes: –{ hasBeard, hasGlasses, age, hasTie, hasSuit, hasSuit, hasHair, hairColour, orientation } –9 Attributes, so 9! = possible POs

RANLP, Borovets Sept IA: Preference Orders for People Domain Little psycholinguistic evidence for choosing between all possible PO’s Focus on the most frequent Attributes: G=hasGlasses, B=hasBeard, H=hasHair, C=haircolour –Assumption: H and B must precede C –This leaves us with eight POs: { GBHC, GHBC,HBGC,HBCG, HGBC,BHGC, BHCG, BGHC }

RANLP, Borovets Sept Preference Orders and frequency Mean (std)Sum type hasGlasses hasBeard HairColour hasHair orientation.2173 age.1034 hasTie.0412 hasSuit.014 hasShirt.013 For attributes other than {G,C,H,B}, we let corpus frequency determine the order E.g, IA-GBHC uses  type, G,B,H,C, age, hasTie, hasSuit,hasShirt  as its PO

RANLP, Borovets Sept Results People Domain IA-BASE Significant Significant by subjects GR

RANLP, Borovets Sept Results People domain IA_base performs very badly now So much about the best IA’s that start with {B,H,G,C} and end with Some of these did much worse: –IA_BHCG had DICE=0.6, making it significantly worse (by subjects) than GR!

RANLP, Borovets Sept Summary People domain gives much lower DICE scores than Furniture domain Difference between “good” and “bad” POs was enormous in People domain

RANLP, Borovets Sept Summary The “Incremental Algorithm” (IA): –not an algorithm but a class of algorithms The best IA beats all other algorithms, but the worst is very bad... GR performs remarkably well. How to choose a suitable PO? –Furniture: few attributes; psycholinguistic precedent Still, there is variation. –People: more attributes; no precedents Variation even greater!

RANLP, Borovets Sept Discussion Suppose you want to build a GRE algorithm for a new and complex domain, for which no transparent corpus is available. Psycholinguistic principles are unlikely to help you much If corpus is also not balanced, then frequency doesn’t say much either …

RANLP, Borovets Sept Other uses of this method: STEC Summer 2007: First NLG Shared task Evaluation Challenge (STEC) STEC involved GRE only, focussing on Content Determination 22 GRE Algorithms were submitted and evaluated (6 teams) Reported in UCNLG+MT workshop, Copenhagen, Sept 2007

RANLP, Borovets Sept Other uses of this corpus: STEC Each algorithm was compared with the TUNA corpus (minus 40% training set) –Both Furniture and People domain –DICE measured “humanlikeness” –Singulars only Each algorithm was also tested in terms of identification time (by human reader)

RANLP, Borovets Sept Other uses of this corpus: STEC Future STEC: –beyond “first mention” –beyond Content Determination –more hearer-oriented experiments

RANLP, Borovets Sept STEC results 1.The more minimal the descriptions generated by these 22 systems were, the worse their DICE scores were

RANLP, Borovets Sept. 2007

RANLP, Borovets Sept No relation between humanlikeness and identification time –Best system in terms of DICE was worst- but-one in terms of identification time More research needed on the different criteria for judging NLG output

RANLP, Borovets Sept Thank you

RANLP, Borovets Sept Annotator agreement Semantic markup was applied manually to all descriptions in the corpus. 2 annotators were given a stratified random sample Comparison used Dice. meanmode Furniture0.89 (A/B) 1 (71.1%) Annotator A0.93 (A/us) 1 (74.4%) Annotator B0.92 (B/us) 1(73%) People0.89 (A/B) 1(70%) Annotator A0.84 (A/us) 1(41.1%) Annotator B.78 (B/us) 1(36.3%)