Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University.

Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University of Aberdeen

Focus of this talk ● Generation of Referring Expressions (GRE) ● A very big part of this is Content Determination Knowledge Base + Intended referent (R) Search for distinguishing properties “Description” = A semantic representation ● Evaluation challenges:  Semantically intensive  Pragmatic issues: identify, inform, signal agreement... (cf. Jordan 2000,...)  “Human gold standard”: one and only one standard per input?  Evaluation metric: an all-or-none affair?

Outline of our proposal ● Large corpus of descriptions (2000+) constructed via a controlled experiment. Part of the TUNA Project. ● Semantic annotation. ● Balance. ● Expressive variety ● Related proposals on Human Gold Standards:  M. Walker: Language Productivity Assumption  J. Viethen: GRE resources are difficult to obtain from naturally occurring text.

Corpora and NLG: Transparency ● Requirements for a GRE Evaluation Corpus:  Semantic transparency: ● Linguistic realisation + semantic representation + domain  Pragmatic transparency: ● Human intention = algorithmic intention ● These requirements ensure that a match between the output of content determination and a corpus instance is done on a level playing field. ● Perhaps the same can be said of other Content Determination tasks.

Example the large red sofa the large, bright red settee the red couch which is larger than the rest ● All of the above are co-extensive. ● An algorithm may generate a logical form that “means” all of the above. ● Corpus annotation should indicate that all realisations of the same property denote that property.

Corpora and NLG: Balance ● Corpora are sources of exclusively positive evidence.  If C is not in the corpus, should the generator avoid it? ● Frequency of occurrence:  If C’ is very frequent, should the generator always use it? (Only if we know that C is produced to the exclusion of other interesting possibilities) ● So there's a trade-off between:  ecological validity  adequacy for the evaluation task ● Partial solution: Experimental design to generate a balanced corpus.

Example (cont/d.) ● Relevant variables:  When are A and A’ used when not required?  When are A and A’ omitted when required? ● Ideal setting: A and A’ are (not) required in an equal no. of instances. ● Same argument for, e.g., communicative setting. Hypothesis: Incremental algorithms with preference order A >> A’ are better than A’ >> A

The TUNA Reference Corpus ● Corpus meets the transparency and balance requirements. ● Different domains (of different complexity):  A domain of simple furniture objects: ● 4 attributes + horizontal and vertical location  A domain of real b&w photographs of people: ● 9 attributes + horizontal and vertical location ● Different communicative situations:  Fault-critical  Non-fault critical ● Different kinds of attributes:  Absolute properties (e.g. colour, baldness)  Gradable properties (e.g. size, relative position) ● Different numbers of referents:  Reference to individuals (“the red sofa”)  Reference to sets (“the red and blue sofas”)

Web-based corpus collection experiment

With (limited) feedback…

Design ● Balance within-subjects:  Content: For each attribute combination, there are equal numbers of domains in which the combination is minimally required to distinguish the referents.  Cardinality: number of plural & singular references ● Between subjects:  Fault-critical vs. non fault-critical communicative situation.  Use of location

Corpus annotation ● Domain representation makes all attributes of all domain entities explicit.

Corpus annotation ● 2-level annotation for descriptions:  tags mark up description segments with the domain information they express.  tag allows compilation of a logical form from the description “the large settee at oblique angle” large settee at oblique angle

How feasible is this annotation? ● Evaluation with 2 independent annotators using the same annotation manual. ● Very high inter-annotator agreement:  Furniture domain: ca. 75% perfect agreement. Mean DICE coefficient 0.92  People domain: ca. 40% perfect agreement. Mean DICE coefficient: 0.84

State of the corpus 1140 total 300 -Loc 300 +Loc furniture 270 -Loc 270 +Loc People -FC+FC - Fully annotated - Evaluation shows high inter-annotator agreement - Annotation in progress Corpus is currently available on demand. Will be in public domain by May 2007.

Current uses of the corpus ● Two evaluations, comparing some standard GRE algorithms on singulars and plurals. ● Basic procedure:  Run algorithm over a domain  Compile a logical form from a corpus description  Estimate the degree of match between description and algorithm output.

Future uses ● Machine learning approaches to GRE:  Corpus contains a mapping between linguistic and semantic representations… ● Extending the remit of GRE to cover realisation and lexicalisation, exploiting realisation-semantics mapping. ● Investigation of impact of communicative setting on algorithm performance. ● Compare outcomes of corpus evaluation to task- oriented (reader) evaluation.

Conclusion ● NLG is not only about surface linguistic form. Many choices are made at a different level. ● Evaluation of Content Determination requires adequate resources. Our arguments are strongly related to those by J. Viethen and M. Walker. ● We argue that evaluation in such tasks is more reliable if resources are semantically/pragmatically transparent & balanced. ● This obviously makes the evaluation exercise more expensive, but ultimately pays off.

Further info http://www.csd.abdn.ac.uk/research/tuna/corpus

Design: between subjects ● Fault-critical vs. non-fault-critical: Our program will eventually be used in situations where it is crucial that it understand descriptions accurately with no option to correct mistakes… vs. If the computer misunderstands your description and removes the wrong objects, you can point out the right objects for it, by clicking on the pictures with the red borders. ● +Location vs. –Location  Row/column of each object determined randomly at runtime.  This increases domain variation, offsets the more determinate nature of other attribute combinations.  Some people could use location, others couldn’t.  We considered location a good candidate for a gradable property.

Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University.

Similar presentations

Presentation on theme: "Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University.

Similar presentations

Presentation on theme: "Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University."— Presentation transcript:

Similar presentations

About project

Feedback