Evaluation of NLP Systems Martin Hassel KTH NADA Royal Institute of Technology 100 44 Stockholm +46-8-790 66 34

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Using Syntax to Disambiguate Explicit Discourse Connectives in Text Source: ACL-IJCNLP 2009 Author: Emily Pitler and Ani Nenkova Reporter: Yong-Xiang Chen.
Popper On Science Economics Lawlor. What is and inductive inference? Example: “All Swans are white” Needs an observation to confirm it’s truth.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Word sense disambiguation and information retrieval Chapter 17 Jurafsky, D. & Martin J. H. SPEECH and LANGUAGE PROCESSING Jarmo Ritola -
For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.
April 26th, 2007 Workshop on Treebanking, HLT/NAACL, Rochester 1 Layering of Annotations in the Penn Discourse TreeBank (PDTB) Rashmi Prasad Institute.
Systems & Applications: Introduction Ling 573 NLP Systems and Applications March 29, 2011.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Evaluation of NLP Systems Martin Hassel KTH CSC Royal Institute of Technology Stockholm
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Information Access I Measurement and Evaluation GSLT, Göteborg, October 2003 Barbara Gawronska, Högskolan i Skövde.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
PPA 502 – Program Evaluation
1 I256: Applied Natural Language Processing Marti Hearst Sept 25, 2006.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.
Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile.
Artificial intelligence & natural language processing Mark Sanderson Porto, 2000.
Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions.
ELN – Natural Language Processing Giuseppe Attardi
The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,
Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.
August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.
Search Engines and Information Retrieval Chapter 1.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Ling 570 Day 17: Named Entity Recognition Chunking.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)
1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa
Usability and Accessibility Usability of Accessibility Features Janey Barnes, PhD User-View, Inc. 1.
Page 1 SenDiS Sectoral Operational Programme "Increase of Economic Competitiveness" "Investments for your future" Project co-financed by the European Regional.
CS626: NLP, Speech and the Web Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 15, 17: Parsing Ambiguity, Probabilistic Parsing, sample seminar 17.
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
For Wednesday No reading Homework –Chapter 23, exercise 15 –Process: 1.Create 5 sentences 2.Select a language 3.Translate each sentence into that language.
A.F.K. by SoTel. An Introduction to SoTel SoTel created A.F.K., an Android application used to auto generate text message responses to other users. A.F.K.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Friday Finish chapter 24 No written homework.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Information Retrieval
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Automatic recognition of discourse relations Lecture 3.
Psychology As Science Psychologists use the “scientific method” Steps to the scientific method: - make observations - ask question - develop hypothesis.
Natural Language Generation Martin Hassel KTH NADA Royal Institute of Technology Stockholm
1 An Introduction to Computational Linguistics Mohammad Bahrani.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Word Sense and Subjectivity (Coling/ACL 2006) Janyce Wiebe Rada Mihalcea University of Pittsburgh University of North Texas Acknowledgements: This slide.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Semantic Parsing for Question Answering
Donna M. Gates Carnegie Mellon University
Topics in Linguistics ENG 331
CS246: Information Retrieval
Retrieval Evaluation - Reference Collections
Information Retrieval
Presentation transcript:

Evaluation of NLP Systems Martin Hassel KTH NADA Royal Institute of Technology Stockholm

Martin Hassel Why Evaluation? General aspects To measure progress Commercial aspects To ensure consumer satisfaction Scientific aspects Good science

Martin Hassel What Is Good Science? Induction Testing against a data subset considered fairly representing the complete possible data set Poppers theory of falsifiability For an assertion to be falsifiable, in principle it must be possible to make an observation or do a physical experiment that would show the assertion to be false

Martin Hassel Evaluation Schemes Intrinsic Measures the system in of itself Extrinsic Measures the efficiency and acceptability of the generated summaries in some task Requires ”user” interaction

Martin Hassel Stages of Development Early Intrinsic evaluation on component level Mid Intrinsic evaluation on system level Late Extrinsic evaluation on system level

Martin Hassel Manual Evaluation Human judges + Semantically based assessment – Subjective – Time consuming – Expensive

Martin Hassel Semi-Automatic Evaluation Task based evaluation (extrinsic) + Measures the summary’s utility – Subjective interpretation of questions and answers Keyword association (intrinsic) + No annotation required – Shallow, allows for “good guesses”

Martin Hassel Automatic Evaluation Comparison to Gold Standard: Sentence Recall (intrinsic) + Cheap and repeatable – Does not distinguish between different summaries Vocabulary Test (intrinsic) + Useful for key phrase summaries – Sensitive to word order differences and negation

Martin Hassel Corpora A body of data considered to represent ”reality” in a balanced way Sampling Raw format vs annotated data

Martin Hassel Corpora can be… a Part-of-Speech tagged data collection Arrangörnn.utr.sin.ind.nom varvb.prt.akt.kop Järfällapm.gen naturföreningnn.utr.sin.ind.nom därha Margaretapm.nom ärvb.prs.akt.kop medlemnn.utr.sin.ind.nom.mad

Martin Hassel Corpora can be… a parse tree data collection (S (NP-SBJ (NNP W.R.) (NNP Grace) ) (VP (VBZ holds) (NP (NP (CD three) ) (PP (IN of) (NP (NP (NNP Grace) (NNP Energy) (POS 's) ) (CD seven) (NN board) (NNS seats) ) ) ) ) (..) )

Martin Hassel Corpora can be… a RST tree data collection (SATELLITE(SPAN|4||19|)(REL2PAR ELABORATION- ADDITIONAL) (SATELLITE(SPAN|4||7|)(REL2PAR CIRCUMSTANCE) (NUCLEUS(LEAF|4|)(REL2PAR CONTRAST) (TEXT _!THE PACKAGE WAS TERMED EXCESSIVE BY THE BUSH |ADMINISTRATION,_!|)) (NUCLEUS(SPAN|5||7|)(REL2PAR CONTRAST) (NUCLEUS(LEAF|5|)(REL2PAR SPAN) (TEXT _!BUT IT ALSO PROVOKED A STRUGGLE WITH INFLUENTIAL CALIFORNIA LAWMAKERS_!))

Martin Hassel Corpora can be… a collection of sound samples

Martin Hassel Widely Accepted Corpora & Metrics Pros Well-defined origin and context Well-established evaluation schemes Inter-system comparabilitity Cons Optimizing for a specific data set May establish a common “truth”

Martin Hassel Ethics Informants Must be informed Should be anonymous Data should be preserved for ten years

Martin Hassel Upper and Lower Bounds Baselines Serve as lower limit Common to have several baselines Inter-assessor agreement Low inter-assessor agreement demands comparison against several ”sources”

Martin Hassel Conferences & Campaigns TREC – Text REtrieval Conferences Information Retrieval/Extraction and TDT CLEF – Cross-Language Evaluation Forum Information Retrieval on texts in European languages DUC – Document Understanding Conference Automatic Text Summarization SENSEVAL Word Sense Disambiguation ATIS – Air Travel Information System DARPA Spoken Language Systems and few more…