Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University.

Slides:



Advertisements
Similar presentations
Kees van Deemter Matthew Stone Formal Issues in Natural Language Generation Lecture 4 Shieber 1993; van Deemter 2002.
Advertisements

Generation of Referring Expressions: Managing Structural Ambiguities I.H. KhanG. Ritchie K. van Deemter University of Aberdeen, UK.
Some common assumptions behind Computational Generation of Referring Expressions (GRE) (Introductory remarks at the start of the workshop)
SELLC Winter School 2010 Evaluating Algorithms for GRE Kees van Deemter (work with Albert Gatt, Ielka van der Sluis, and Richard Power) University of Aberdeen,
Conceptual coherence in the generation of referring expressions Albert Gatt & Kees van Deemter University of Aberdeen {agatt,
Generation of Referring Expressions: the State of the Art SELLC Summer School, Harbin 2010 Kees van Deemter Computing Science University of Aberdeen.
Database Design: ER Modelling (Continued)
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
CS4018 Formal Models of Computation weeks Computability and Complexity Kees van Deemter (partly based on lecture notes by Dirk Nikodem)
Experiments and Variables
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Action Logic Modelling Logic Models communicate a vision for an intervention as a solution to a public health nutrition (PHN) problem to:  funding agencies,
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
SE 450 Software Processes & Product Metrics Reliability: An Introduction.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 4: Modeling Decision Processes Decision Support Systems in the.
Algorithms and Problem Solving-1 Algorithms and Problem Solving.
Beginning the Research Design
Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.
Algorithms and Problem Solving. Learn about problem solving skills Explore the algorithmic approach for problem solving Learn about algorithm development.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Predicting the Semantic Orientation of Adjectives
The Use of Corpora for Automatic Evaluation of Grammar Inference Systems Andrew Roberts & Eric Atwell Corpus Linguistics ’03 – 29 th March Computer Vision.
ICS 463, Intro to Human Computer Interaction Design: 9. Experiments Dan Suthers.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
Describing Syntax and Semantics
Second Language Acquisition and Real World Applications Alessandro Benati (Director of CAROLE, University of Greenwich, UK) Making.
Software Development, Programming, Testing & Implementation.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 14.
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
Robert's Drawers (and other variations on GRE shared tasks) Gatt, Belz, Reiter, Viethen.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
How to conduct good investigation in social sciences Albu Iulian Alexandru.
What now? Is this the best? PROBLEM SOLVING AS A STRATEGY.
Chapter 8 Architecture Analysis. 8 – Architecture Analysis 8.1 Analysis Techniques 8.2 Quantitative Analysis  Performance Views  Performance.
By : Garima Indurkhya Jay Parikh Shraddha Herlekar Vikrant Naik.
TEA Science Workshop #3 October 1, 2012 Kim Lott Utah State University.
Big Idea 1: The Practice of Science Description A: Scientific inquiry is a multifaceted activity; the processes of science include the formulation of scientifically.
Assessing Quality for Integration Based Data M. Denk, W. Grossmann Institute for Scientific Computing.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Research Process Parts of the research study Parts of the research study Aim: purpose of the study Aim: purpose of the study Target population: group whose.
1 Introduction to Software Engineering Lecture 1.
RANLP, Borovets Sept Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.
1 Copyright © 2011 by Saunders, an imprint of Elsevier Inc. Chapter 8 Clarifying Quantitative Research Designs.
Copyright  2003 by Dr. Gallimore, Wright State University Department of Biomedical, Industrial Engineering & Human Factors Engineering Human Factors Research.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
3.4 The Components of the OLS Variances: Multicollinearity We see in (3.51) that the variance of B j hat depends on three factors: σ 2, SST j and R j 2.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
Volgograd State Technical University Applied Computational Linguistic Society Undergraduate and post-graduate scientific researches under the direction.
Jette Viethen 20 April 2007NLGeval07 Automatic Evaluation of Referring Expression Generation is Possible.
Writing a Science or Engineering Paper: It is just a story Frank Shipman Department of Computer Science Texas A&M University.
Research Word has a broad spectrum of meanings –“Research this topic on ….” –“Years of research has produced a new ….”
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Kees van Deemter Generation of Referring Expressions: a crash course Background information and Project HIT 2010.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Investigate Plan Design Create Evaluate (Test it to objective evaluation at each stage of the design cycle) state – describe - explain the problem some.
Human Computer Interaction Lecture 21 User Support
Algorithms and Problem Solving
Human Computer Interaction Lecture 21,22 User Support
What are the key elements of maths that you need to focus on
Block Matching for Ontologies
Algorithms and Problem Solving
Dr. Arslan Ornek MATHEMATICAL MODELS
Kees van Deemter Computing Science University of Aberdeen
Presentation transcript:

Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University of Aberdeen

Focus of this talk ● Generation of Referring Expressions (GRE) ● A very big part of this is Content Determination Knowledge Base + Intended referent (R) Search for distinguishing properties “Description” = A semantic representation ● Evaluation challenges:  Semantically intensive  Pragmatic issues: identify, inform, signal agreement... (cf. Jordan 2000,...)  “Human gold standard”: one and only one standard per input?  Evaluation metric: an all-or-none affair?

Outline of our proposal ● Large corpus of descriptions (2000+) constructed via a controlled experiment. Part of the TUNA Project. ● Semantic annotation. ● Balance. ● Expressive variety ● Related proposals on Human Gold Standards:  M. Walker: Language Productivity Assumption  J. Viethen: GRE resources are difficult to obtain from naturally occurring text.

Corpora and NLG: Transparency ● Requirements for a GRE Evaluation Corpus:  Semantic transparency: ● Linguistic realisation + semantic representation + domain  Pragmatic transparency: ● Human intention = algorithmic intention ● These requirements ensure that a match between the output of content determination and a corpus instance is done on a level playing field. ● Perhaps the same can be said of other Content Determination tasks.

Example the large red sofa the large, bright red settee the red couch which is larger than the rest ● All of the above are co-extensive. ● An algorithm may generate a logical form that “means” all of the above. ● Corpus annotation should indicate that all realisations of the same property denote that property.

Corpora and NLG: Balance ● Corpora are sources of exclusively positive evidence.  If C is not in the corpus, should the generator avoid it? ● Frequency of occurrence:  If C’ is very frequent, should the generator always use it? (Only if we know that C is produced to the exclusion of other interesting possibilities) ● So there's a trade-off between:  ecological validity  adequacy for the evaluation task ● Partial solution: Experimental design to generate a balanced corpus.

Example (cont/d.) ● Relevant variables:  When are A and A’ used when not required?  When are A and A’ omitted when required? ● Ideal setting: A and A’ are (not) required in an equal no. of instances. ● Same argument for, e.g., communicative setting. Hypothesis: Incremental algorithms with preference order A >> A’ are better than A’ >> A

The TUNA Reference Corpus ● Corpus meets the transparency and balance requirements. ● Different domains (of different complexity):  A domain of simple furniture objects: ● 4 attributes + horizontal and vertical location  A domain of real b&w photographs of people: ● 9 attributes + horizontal and vertical location ● Different communicative situations:  Fault-critical  Non-fault critical ● Different kinds of attributes:  Absolute properties (e.g. colour, baldness)  Gradable properties (e.g. size, relative position) ● Different numbers of referents:  Reference to individuals (“the red sofa”)  Reference to sets (“the red and blue sofas”)

Web-based corpus collection experiment

With (limited) feedback…

Design ● Balance within-subjects:  Content: For each attribute combination, there are equal numbers of domains in which the combination is minimally required to distinguish the referents.  Cardinality: number of plural & singular references ● Between subjects:  Fault-critical vs. non fault-critical communicative situation.  Use of location

Corpus annotation ● Domain representation makes all attributes of all domain entities explicit.

Corpus annotation ● 2-level annotation for descriptions:  tags mark up description segments with the domain information they express.  tag allows compilation of a logical form from the description “the large settee at oblique angle” large settee at oblique angle

How feasible is this annotation? ● Evaluation with 2 independent annotators using the same annotation manual. ● Very high inter-annotator agreement:  Furniture domain: ca. 75% perfect agreement. Mean DICE coefficient 0.92  People domain: ca. 40% perfect agreement. Mean DICE coefficient: 0.84

State of the corpus 1140 total 300 -Loc 300 +Loc furniture 270 -Loc 270 +Loc People -FC+FC - Fully annotated - Evaluation shows high inter-annotator agreement - Annotation in progress Corpus is currently available on demand. Will be in public domain by May 2007.

Current uses of the corpus ● Two evaluations, comparing some standard GRE algorithms on singulars and plurals. ● Basic procedure:  Run algorithm over a domain  Compile a logical form from a corpus description  Estimate the degree of match between description and algorithm output.

Future uses ● Machine learning approaches to GRE:  Corpus contains a mapping between linguistic and semantic representations… ● Extending the remit of GRE to cover realisation and lexicalisation, exploiting realisation-semantics mapping. ● Investigation of impact of communicative setting on algorithm performance. ● Compare outcomes of corpus evaluation to task- oriented (reader) evaluation.

Conclusion ● NLG is not only about surface linguistic form. Many choices are made at a different level. ● Evaluation of Content Determination requires adequate resources. Our arguments are strongly related to those by J. Viethen and M. Walker. ● We argue that evaluation in such tasks is more reliable if resources are semantically/pragmatically transparent & balanced. ● This obviously makes the evaluation exercise more expensive, but ultimately pays off.

Further info

Design: between subjects ● Fault-critical vs. non-fault-critical: Our program will eventually be used in situations where it is crucial that it understand descriptions accurately with no option to correct mistakes… vs. If the computer misunderstands your description and removes the wrong objects, you can point out the right objects for it, by clicking on the pictures with the red borders. ● +Location vs. –Location  Row/column of each object determined randomly at runtime.  This increases domain variation, offsets the more determinate nature of other attribute combinations.  Some people could use location, others couldn’t.  We considered location a good candidate for a gradable property.