Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

Similar presentations


Presentation on theme: "Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science."— Presentation transcript:

1 Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield www.dcs.shef.ac.uk/~robertg

2 June 27, 2003 Propor03, Faro Outline of Talk n Introduction: Perspectives on Evaluation in HLT n Technology Evaluation: Terms and Definitions n An Extended Example: The Cub Reporter Scenario –Application Evaluation: Question Answering and Summarisation –Component Evaluation: Time and Event Recognition n Ingredients of a Successful Evaluation n Snares and Delusions n Conclusions

3 June 27, 2003 Propor03, Faro Perspectives on Evaluation … n Three key stakeholders in the evaluation process: –Users (user-centred evaluation) –Researchers/technology developers (technology evaluation) –Funders (programme evaluation) n Users are concerned to accomplish a task for which HLT is just a tool –User evaluation requires evaluating system in its operational setting –Does the HLT system allow users to accomplish their task better ? –Keeping in mind that n The technology may be transformational n Faults may due to non-system problems or interaction effects in setting

4 June 27, 2003 Propor03, Faro … Perspectives on Evaluation … n Researchers are concerned with models and techniques for carrying out language processing tasks n For them evaluation is a key part of the empirical method –A model/system represents a hypothesis about how a language- related input may be translated into an output –Evaluation = hypothesis testing Model output Human output Evaluation Human Model Model refinement

5 June 27, 2003 Propor03, Faro … Perspectives on Evaluation n Funders are concerned to determine whether R & D funds have been well spent n Programme evaluation may rely on –User-centred evaluation –Technology evaluation –Competitive evaluation –Assessment of social impact n The rest of this talk will concentrate on “technology evaluation”, or more preferably, the empirical method for HLT

6 June 27, 2003 Propor03, Faro Outline of Talk n Introduction: Perspectives on Evaluation in HLT n Technology Evaluation: Terms and Definitions n An Extended Example: The Cub Reporter Scenario –Application Evaluation: Question Answering and Summarisation –Component Evaluation: Time and Event Recognition n Ingredients of a Successful Evaluation n Snares and Delusions n Conclusions

7 June 27, 2003 Propor03, Faro Technology Evaluation … n Helpful to make a few further initial distinctions … n Important to distinguish tasks or functions from systems which carry out them out n Tasks/functions may be broken into subtasks/subfunctions and systems into subsystems (or components) n Need not be an isomorphism between these decompositions n Tasks, specified independently of any system or class of systems, are the proper subject of evaluation

8 June 27, 2003 Propor03, Faro …Technology Evaluation … n User-visible tasks are tasks where input and output have functional significance for a system user –E.g. machine translation, speech recognition n User-transparent tasks are tasks where input and output do not have such significance –E.g. part-of-speech tagging, parsing, mapping to logical form n Usually user-transparent tasks are components of higher level user-visible tasks n Will refer to “user-visible tasks” as application tasks and “user transparent tasks” as component tasks

9 June 27, 2003 Propor03, Faro … Technology Evaluation … n Machine translation (DARPA MT) n Speech recognition (CSR, LVCSR,Broadcast News) n Spoken language understanding (ATIS) n Information Retrieval (TREC, Amaryllis, CLEF) n Information Extraction (MUC, ACE, IREX) n Summarisation (Summac, DUC) n Question Answering (TREC-QA) n … Applications Components n Parsers (Parseval) n Morphology n POS Tagging (Grace) n Coreference (MUC) n Word Sense Disambiguation (SENSEVAL) n …

10 June 27, 2003 Propor03, Faro …Technology Evaluation … n Evaluation scenarios may be defined for both application tasks and component tasks n Each sort of evaluation faces characteristic challenges n Component task evaluation is difficult because –No universally agreed set of “components” or intermediate representations composing the human language processing system – i.e. theory dependence (e.g. grammatical formalisms) –Collecting annotated resources is difficult because n They must be created, unlike, e.g. source and target language texts, full texts and summaries, etc. which can be found n Their creation relies on a small number of expensive experts –Users and funders (and sometimes scientists!) need convincing

11 June 27, 2003 Propor03, Faro …Technology Evaluation … n Application task evaluation also faces difficulties n Note that an application task technology evaluation is NOT a user-centred evaluation – no specific setting is assumed n An application task evaluation may use –Intrinsic criteria: how well does a system perform on the task it was designed to carry out? –Extrinsic criteria: how well does a system enable a user to complete a task? n May approximate user-centred evaluation depending on reality of the setting of the user task

12 June 27, 2003 Propor03, Faro Application Task Technology Evaluation vs User-Centred Evaluation: Example n Information Extraction (IE) is the task of populating a structured DB with information from free text pertaining to predefined scenarios of interest n E.g. extract info about management succession events – events involving persons moving in or out of positions in organizations n Technology evaluation of this task was carried out in MUC-6 using a controlled corpus, task definition and scoring metrics/software n The participating systems produced structured templates which were scored by the organisers

13 June 27, 2003 Propor03, Faro BURNS FRY Ltd. (Toronto) -- Donald Wright, 46 years old, was named executive vice president and director of fixed income at this brokerage firm. Mr. Wright resigned as president of Merrill Lynch Canada Inc., a unit of Merrill Lynch & Co., to succeed Mark Kassirer, 48, who left Burns Fry last month. A Merrill Lynch spokeswoman said it hasn't named a successor to Mr. Wright, who is expected to begin his new position by the end of the month. := DOC_NR: "9404130062“ CONTENT: := SUCCESSION_ORG: POST: "executive vice president" IN_AND_OUT: VACANCY_REASON: OTH_UNK := := IO_PERSON: IO_PERSON: NEW_STATUS: OUT NEW_STATUS: IN ON_THE_JOB: NO ON_THE_JOB: NO OTHER_ORG: REL_OTHER_ORG: OUTSIDE_ORG := := ORG_NAME: "Burns Fry Ltd.“ ORG_NAME: "Merrill Lynch Canada Inc." ORG_ALIAS: "Burns Fry“ ORG_ALIAS: "Merrill Lynch" ORG_DESCRIPTOR: "this brokerage firm“ ORG_DESCRIPTOR: "a unit of Merrill Lynch & Co." ORG_TYPE: COMPANY ORG_TYPE: COMPANY ORG_LOCALE: Toronto CITY ORG_COUNTRY: Canada := := PER_NAME: "Mark Kassirer" PER_NAME: "Donald Wright" PER_ALIAS: "Wright" PER_TITLE: "Mr." Application Task Technology Evaluation vs User-Centred Evaluation: Example (cont)

14 June 27, 2003 Propor03, Faro Application Task Technology Evaluation vs User-Centred Evaluation: Example (cont) n The MUC evaluations are application task technology evaluation – results are of interest to users, but resulting systems cannot be directly used by them n Interestingly, to design a system users can use, leads to insights into the task and into the limitations of technology evaluation. n To explore deployability of IE technology we carried out a 2-year project with GlaxoSmithKline: TRESTLE – Text Retrieval, Extraction and Summarisation Technology for Large Enterprises n Key insights were: –The importance of good interface design –The importance of allowing for imperfect IE – one click from “truth"

15 June 27, 2003 Propor03, Faro TRESTLE Interface

16 June 27, 2003 Propor03, Faro TRESTLE Interface

17 June 27, 2003 Propor03, Faro …Technology Evaluation n Both application task and component task technology evaluation face challenges n However, both are essential elements in the empirical methodology for progressing HLT n Researchers and developers need to fight hard for support for well-designed evaluation programmes n However, they must bear in mind that users and funders will be sceptical and that translating results from evaluations into usable systems is not trivial

18 June 27, 2003 Propor03, Faro Outline of Talk n Introduction: Perspectives on Evaluation in HLT n Technology Evaluation: Terms and Definitions n An Extended Example: The Cub Reporter Scenario –Application Evaluation: Question Answering and Summarisation –Component Evaluation: Time and Event Recognition n Ingredients of a Successful Evaluation n Snares and Delusions n Conclusions

19 June 27, 2003 Propor03, Faro The Cub Reporter Scenario n “Cub reporter” = junior reporter whose job is to gather background and check facts n The electronic cub reporter is a vision about how question answering (QA) and summarisation technology could be brought together – first identified by the TREC QA Roadmap group n Sheffield (Computer Science and Journalism) has a 6 person-year effort funded by the UK EPSRC to investigate the scenario in conjunction with the UK Press Association (UK premier newswire service)

20 June 27, 2003 Propor03, Faro 05:33 20/05/03: Page 1 (HHH) CITY Glaxo Background GROWING ANGER OVER BOSSES' PAY By Paul Sims, PA News The shareholders' revolt at yesterday's annual general meeting of GlaxoSmithKline was the latest in a line of protests against the ``fat cat'' salaries of Britain's top executives. Last night's revolt is the first time a FTSE 100 member has faced such a revolt since companies became obliged to submit their remuneration report to shareholder vote. But shareholders from a number of companies had already denounced massive payouts to directors presiding over plummeting share values on a struggling stock market. Earlier this month, a third of shareholders' votes went against Royal & Sun Alliance's remuneration report … A similar protest at Shell last month attracted 23% of the vote … And at Barclays Bank, three out of 10 of the larger shareholders registered dissent over rewards for top executives … 20:05 19/05/03: Page 1 (HHH) CITY Glaxo Pharmaceuticals giant GlaxoSmithKline tonight suffered an unprecedented defeat at its annual general meeting when shareholders won a vote against a multi- million pound pay and rewards package for executives. Cub Reporter Scenario: Example Snap 20:05 19/05/03: Page 1 (HHH) CITY Glaxo Pharmaceuticals giant GlaxoSmithKline tonight suffered an unprecedented defeat at its annual general meeting when shareholders won a vote against a multi- million pound pay and rewards package for executives. How have shareholders in other companies recently voted on executive pay? How have shareholders in GlaxosmithKline voted on executive pay in the past? Questions Archive (10 years of PA Text) Question Generation Question Answering Answers and Answer Source Documents Multidocument Summarisation Background

21 June 27, 2003 Propor03, Faro Cub Reporter Scenario: Evaluation n What role will evaluation play in the project? n Application task technology evaluation –Question answering (TREC QA) –Summarisation (DUC) n Component task technology evaluation –Part of speech tagging (Penn Tree Bank; BNC) –Parsing –Time and event recognition (TimeML) n User-centred evaluation –Observation of journalists performing controlled task with and without the developed system

22 June 27, 2003 Propor03, Faro Application Task Evaluation: TREC QA Track n Aim is to move beyond document retrieval (traditional search engines) to information retrieval n The TExt Retrieval Conferences started in 1992 to stimulate research in information retrieval through providing: –Standardised task definitions –Standardised resources (corpora, human relevance judgments) –Standardised metrics (e.g. recall and precision) and evaluation procedures –An annual competition and forum for reporting results n TREC has added and removed variant tasks –In 1999 a task (“track”) on open domain question answering was added

23 June 27, 2003 Propor03, Faro The TREC QA Track: Task Definition (TREC 8/9) n Inputs: –4GB newswire texts (from the TREC text collection) –File of natural language questions (200 TREC-8/700 TREC-9), e.g. Where is the Taj Mahal? How tall is the Eiffel Tower? Who was Johnny Mathis’ high school track coach? n Outputs: –Five ranked answers per question, including pointer to source document n 50 byte category n 250 byte category –Up to two runs per category per site n Limitations: –Each question has an answer in the text collection –Each answer is a single literal string from a text (no implicit or multiple answers)

24 June 27, 2003 Propor03, Faro The TREC QA Track: Task Definition (TREC2001) n Subtrack1 : Main (similar to previous years) n NIL is a valid response – no longer a guaranteed answer in corpus –questions (500) more ‘real’ – from MSNSearch and AskJeeves logs –more definition type questions -- What is an atom? n Subtrack 2: List –system must assemble an answer from multiple documents –questions specified number of instances to retrieve What are 9 novels written by John Updike? –Response is an unordered set of the target number of [doc-id answer] pairs –Target number of instances guaranteed to exist in the collection n Subtrack 3: Context: –System must return ranked list of response pairs for question in a series How many species of spider are there? How many are poisonous to humans? What percentage of spider bites in the US are fatal? –Evaluated using reciprocal rank –Answers guaranteed to exist + later questions independently answerable

25 June 27, 2003 Propor03, Faro The TREC QA Track: Task Definition (TREC2002) n New text collection – AQUAINT collection –AP newswire (1998-2000), New York Times newswire (1998- 2000), Xinhua News Agency (English portion, 1996-2000) –Approximately 1,033,000 documents / 3 gigabytes of text n Subtrack1 : Main, similar to previous years but: –No definition type questions –One answer per question only –Exact matches only (no 50/250 byte strings) –principal metric is confidence weighted score n Subtrack 2: List –As for TREC2001 n No context track

26 June 27, 2003 Propor03, Faro TREC QA Track 2003 n Two tasks: main task and passage task n Main task features 3 sorts of questions: –Factoid (400-450); exact answers; answers not guaranteed –List (25-50); exact answers; answers guaranteed –Definition (25-50); answers guaranteed n Definition question ask for “a set of interesting and salient information items about a person, organization, or thing” Questions are tagged as to type n Passage task –Relaxes the requirement of exact answers for factoid questions –Answers may be up to 250 bytes in length n Same corpus (AQUAINT) as TREC2002

27 June 27, 2003 Propor03, Faro The TREC QA Track: Metrics and Scoring n Principal metric for TREC8-10 was Mean Reciprocal Rank (MRR) –Correct answer at rank 1 scores 1 –Correct answer at rank 2 scores 1/2 –Correct answer at rank 3 scores 1/3 –… Sum over all questions and divide by number of questions n More formally: N = # questions r i = reciprocal of best (lowest) rank assigned by system at which a correct answer is found for question i, or 0 if no correct answer found n Judgements made by human judges based on answer string alone (lenient evaluation) and by reference to documents (strict evaluation)

28 June 27, 2003 Propor03, Faro The TREC QA Track: Metrics and Scoring n For list questions –each list judged as a unit –evaluation measure is accuracy: # distinct instances returned / target # instances n The principal metric for TREC2002 was Confidence Weighted Score where Q is number of questions

29 June 27, 2003 Propor03, Faro The TREC QA Track: Metrics and Scoring n A systems overall score will be: 1/2*factoid-score + 1/4*list-score + 1/4*definition-score n A factoid answer is one of: correct, non-exact, unsupported, incorrect. Factoid-score is % factoid answers judged correct n List answers are treated as sets of factoid answers or “instances” Instance recall + precision are defined as: IR = # instances judged correct & distinct/|final answer set| IP = # instances judged correct & distinct/# instances returned Overall factoid score is then the F1 measure: F = (2*IP*IR)/(IP+IR) n Definition answers are scored based on the number of “essential” and “acceptable” information “nuggets” they contain – see track definition for details

30 June 27, 2003 Propor03, Faro TREC QA Track Evaluation: Observations n Track has evolved –Each year has presented a more challenging task –Metrics have changed –Bad ideas have been identified and discarded n Task definitions developed/modified in conjunction with the participants n While scoring involves human judges, automatic scoring procedures which approximate human judges have been developed –Allows evaluation to be repeated outside formal evaluation –Supports development of supervised machine learning n Support from academic and industrial participants

31 June 27, 2003 Propor03, Faro TREC QA Track Evaluation: Stimulations … n Most QA systems follow a two part architecture: 1.Use an IR component with the (preprocessed) question as query to retrieve documents/passages from the overall collection which are likely to contain answers 2.Use an answer extraction component to extract answers from the highest ranked documents/passages retrieved in step 1 n Clearly performance of step 1 places on bound performance of step 2 n What is the most appropriate way to assess performance of step 1? –Conventional metrics for evaluating IR systems are recall and precision –Not directly useful in QA context …

32 June 27, 2003 Propor03, Faro TREC QA Track Evaluation: Stimulations … n Intuitively, want to know –% of questions which have an answer in top n ranks of returned documents/passages, i.e. how far down the ranking to go –# of answer instances in top n ranks, i.e. redundancy n Let –Q be the question set –D the document (or passage) collection –A D, q the subset of D which contains correct answers for q Q –R D,q,n be the n top-ranked documents (or passages) in D retrieved given question $q$ n Define

33 June 27, 2003 Propor03, Faro TREC QA Track Evaluation: Stimulations … n Given these new metrics can now ask new questions about e.g. different passage retrieval approaches (passage before or after initial query? )

34 June 27, 2003 Propor03, Faro TREC QA Track Evaluation: Stimulations … n Conclusion: evaluations lead to new questions about components which in turn lead to new metrics and new metrics

35 June 27, 2003 Propor03, Faro Outline of Talk n Introduction: Perspectives on Evaluation in HLT n Technology Evaluation: Terms and Definitions n An Extended Example: The Cub Reporter Scenario –Application Evaluation: Question Answering and Summarisation –Component Evaluation: Time and Event Recognition n Ingredients of a Successful Evaluation n Snares and Delusions n Conclusions

36 June 27, 2003 Propor03, Faro Component Task Evaluation: Time and Event Recognition n Answering questions, extraction information from text, summarising documents all presuppose sensitivity to the temporal location and ordering of events in text When did the war between Iran and Iraq end? When did John Sununu travel to a fundraiser for John Ashcroft? How many Tutsis were killed by Hutus in Rwanda in 1994? Who was Secretary of Defense during the Gulf War? What was the largest U.S. military operation since Vietnam? When did the astronauts return from the space station on the last shuttle flight?

37 June 27, 2003 Propor03, Faro Time and Event Recognition n To address this task a significant effort has been recently been made to create –An agreed standard annotation for times, events and temporal relations in text (TimeML) –A corpus of texts annotated according to this standard (TimeBank) –An annotation tool to support manual creation of the annotations –Automated tools to assist in the annotation process (time and event taggers) n This effort was made via a 9-month workshop (Jan-Sep 2002) called TERQAS: Time and Event Recognition for Question Answering Systems sponsored by ARDA n See www.time2002.org n Significantly informed by earlier work by Andrea Setzer at Sheffield

38 June 27, 2003 Propor03, Faro TERQAS: Organisation n TERQAS was organised into 6 working groups: –TimeML Definition and Specification –Algorithm Review and Development –Article Corpus Collection Development –Query Corpus Development and Classification –TIMEBANK Annotation –TimeML and Algorithm Evaluation

39 June 27, 2003 Propor03, Faro TERQAS: Participants –James Pustejovsky, PI –Rob Gaizauskas –Graham Katz –Bob Ingria –José Castaño –Inderjeet Mani –Antonio Sanfilippo –Dragomir Radev –Patrick Hanks –Marc Verhagen –Beth Sundheim –Andrea Setzer –Jerry Hobbs –Bran Boguraev –Andy Latto –John Frank –Lisa Ferro –Marcia Lazo –Roser Saurí –Anna Rumshisky –David Day –Luc Belanger –Harry Wu –Andrew See

40 June 27, 2003 Propor03, Faro TERQAS: Outcomes n Creation of a robust markup language for temporal expressions, event expressions, and the relations between them (TimeML 1.0) n Guidelines for Annotation n Creation of a Gold standard annotated against this language (TIMEBANK) n Creation of a Suite of Algorithms for recognizing (T3PO): –Temporal Expressions─ Event Expressions –Signals─ Link Construction n Development of a Text Segmented Closure Algorithm n Creation of a Semi-graphical Annotation Tool for speeding up annotation of dependency-rich texts (SGAT) n Query Database Creation Tool n Guidelines for Creating a Corpus of Questions n Initial Scoring and Inter-annotator Evaluation Setup

41 June 27, 2003 Propor03, Faro TimeML: The Conceptual and Linguistic Basis n TimeML presupposes the following temporal entities and relations. n Events are taken to be situations that occur or happen, punctual or lasting for a period of time. They are generally expressed by means of tensed or untensed verbs, nominalisations, adjectives, predicative clauses, or prepositional phrases. n Times may be either points, intervals, or durations. They may be referred to by fully specified or underspecified temporal expressions, or intensionally specified expressions. n Relations can hold between events and events and times. They can be temporal, subordinate, or aspectual relations.

42 June 27, 2003 Propor03, Faro TimeML: Annotating Events n Events are marked up by annotating a representative of the event expression, usually the head of the verb phrase. n The attributes of events are a unique identifier, the event class, tense, and aspect. n Fully annotated example: All 75 passengers died n See full TimeML spec for handling of events conveyed by nominalisations or stative adjectives.

43 June 27, 2003 Propor03, Faro TimeML: Annotating Times n Annotation of times designed to be as compatible with TIMEX2 time expression annotation guidelines as possible. n Fully annotated example for a straightforward time expression: July 1966 n Additional attributes are used to, e.g. anchor relative time expressions and supply functions for computing absolute time values (last week).

44 June 27, 2003 Propor03, Faro TimeML: Annotating Signals n The SIGNAL tag is used to annotate sections of text, typically function words, that indicate how temporal objects are to be related to each other. n Also used to mark polarity indicators such as not, no, none, etc., as well as indicators of temporal quantification such as twice, three times, and so forth. n Signals have only one attribute, a unique identifier. n Fully annotated example: Two days the attack …

45 June 27, 2003 Propor03, Faro TimeML: Annotating Relations (1) n To annotate the different types of relations that can hold between events and events and times, the LINK tag has been introduced. n There are three types of LINKs: TLINKs, SLINKs, and ALINKs, each of which has temporal implications. n A TLINK or Temporal Link represents the temporal relationship holding between events or between an event and a time. n It establishes a link between the involved entities making explicit whether their relationship is: before, after, includes, is_included, holds, simultaneous, immediately after, immediately before, identity, begins, ends, begun by, ended by.

46 June 27, 2003 Propor03, Faro TimeML: Annotating Relations (2) n An SLINK or Subordination Link is used for contexts introducing relations between two events, or an event and a signal. –SLINKs are of one of the following sorts: Modal, Factive, Counter-factive, Evidential, Negative evidential, Negative. n An ALINK or Aspectual Link represents the relationship between an aspectual event and its argument event. –The aspectual relations encoded are: initiation, culmination, termination, continuation.

47 June 27, 2003 Propor03, Faro Annotating Relations (3) Annotated examples: n TLINK: John taught on Monday n SLINK: John said he taught n ALINK: John started to read

48 June 27, 2003 Propor03, Faro Comparing TimeML Annotations n Time and event annotations may be compared as are other text tagging annotations (e.g. named entities, template elements) n Problem: semantically identical temporal relations can be annotated in multiple ways n Solution: the deductive closure of temporal relations is computed and compared for two different annotations of the same text (similar to solution for MUC coreference task) n Provides basis for defining Precision and Recall metrics for evaluation and inter-annotator agreement A B C < < ~

49 June 27, 2003 Propor03, Faro An Earlier Pilot Study n Based on Andrea Setzer’s annotation scheme which was taken as the starting point for TimeML n trial corpus: 6 New York Times newspaper texts (1996) n each text annotated by 2-3 annotators n gold standard n questions: –guidelines comprehensive and unambiguous? –how much genuine disagreement? –feasible to annotate a larger corpus? n calculated recall and precision of annotators against gold standard

50 June 27, 2003 Propor03, Faro Pilot Study Annotation Procedure n annotation takes place in stages: 1.annotate events and times 2.annotate explicit temporal relations 3.annotate ‘obvious’ implicit temporal relations 4.annotate less obvious implicit temporal relations by inference and interactively soliciting information from the annotator n TimeBank annotation follows more or less same procedure –Events and times automatically annotated in 1 st pass and corrected by human

51 June 27, 2003 Propor03, Faro Pilot Study Inter-Annotator Results two sets of results: 1.agreement on entities (events and times) 77% recall and 81% precision agreement on attributes 60% recall and 64% precision 2.agreement on temporal relations 40% recall and 68% precision

52 June 27, 2003 Propor03, Faro Outline of Talk n Introduction: Perspectives on Evaluation in HLT n Technology Evaluation: Terms and Definitions n An Extended Example: The Cub Reporter Scenario –Application Evaluation: Question Answering and Summarisation –Component Evaluation: Time and Event Recognition n Ingredients of a Successful Evaluation n Snares and Delusions n Conclusions

53 June 27, 2003 Propor03, Faro Ingredients of a Successful (Technology) Evaluation n Resources –Human (people to make it work) –Data (copyright issues, distribution) n Well-defined task –Can humans independently perform the task at an acceptable level (e.g. > 80%) when given only a written specification? n Controlled challenge –Should be hard enough to attract interest, stimulate new approaches –Not so hard as to require too much effort from participants or to be impossible to specify n Metrics –Need to capture intuitively significant aspects of systems performance –Need to experiment with new metrics (lead to asking different questions) = money

54 June 27, 2003 Propor03, Faro Ingredients of a Successful (Technology) Evaluation n Participants –Must be a community willing to take part –Is the task so hard that participants need funding to do it? n Reusability –Evaluation should result in resources – data (raw and annotated), scoring software, guidelines, that can be reused b n Extensibility –Should be possible to ramp up the challenge year-on-year so as to capitalise on the resources invested and the community which has been established

55 June 27, 2003 Propor03, Faro Snares and Delusions n Diminishing returns –Evaluations can lead to participants getting obsessed with reducing error/increasing performance by fractions of a percent n Pseudo-science –Because we’re measuring something it must be science n Significance –Are differences between participant’s systems statistically significant? n Users care

56 June 27, 2003 Propor03, Faro Conclusions n A successful technology evaluation should advance the field n Advances can include: –Better models of human language processing (e.g. better applications and components) –Creation of reusable resources and materials –Creation of a community of researchers focused on the same task –Creation of an ethos of repeatability and transparency (i.e. good scientific practice)

57 June 27, 2003 Propor03, Faro The End I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it and cannot express it in numbers your knowledge is of a meager and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science, whatever the matter may be. Lord Kelvin, Popular Lectures and Addresses, (1889), vol 1. p. 73.


Download ppt "Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science."

Similar presentations


Ads by Google