Dr. Ehud Reiter, Computing Science, University of Aberdeen1 NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen)

Slides:



Advertisements
Similar presentations
Generation of Referring Expressions: Managing Structural Ambiguities I.H. KhanG. Ritchie K. van Deemter University of Aberdeen, UK.
Advertisements

Generation of Referring Expressions: the State of the Art SELLC Winter School, Guangzhou 2010 Kees van Deemter Computing Science University of Aberdeen.
Evaluation in NLG Anja Belz, NLTG, Brighton University Ehud Reiter, CS, Aberdeen University.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Towards Twitter Context Summarization with User Influence Models Yi Chang et al. WSDM 2013 Hyewon Lim 21 June 2013.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Evaluation State-of the-art and future actions Bente Maegaard CST, University of Copenhagen
Reviewing Papers: What Reviewers Look For Session 19 C507 Scientific Writing.
Evaluating Search Engine
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Search Engines and Information Retrieval
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
UMC – HCI seminar series 1 Human Computer Interaction Query by Sketch Chi-Ren Shyu Department of Computer Engineering and Computer Science University of.
Designing Software for Personal Music Management and Access Frank Shipman & Konstantinos Meintanis Department of Computer Science Texas A&M University.
Beyond Sentiment Mining Social Media A Panel Discussion of Trends and Ideas Marie Wallace, IBM Marcello Pellacani, Expert System Fabio Lazzarini, CRIBIS.
Applying Software Model Checking to Automatic Text Summarization SSSEV2011 Irina Shoshmina, Nasrin Mostafazadeh, Omid Bakhshandeh, Alexey Belyaev, and.
MIS 650 Knowledge Generation1 MIS 650 Generating Knowledge: Some Methodological Issues.
Scientific method - 1 Scientific method is a body of techniques for investigating phenomena and acquiring new knowledge, as well as for correcting and.
Research in computer science. Computer science – immature Research in general – Systematic investigation into and study of materials, sources, etc in.
© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.
Building a Project a Model Classroom Project brainstorming guide with multiple starting points.
The Machine Scoring of Essays: Redefining Writing Pedagogy? Deborah Crusan Wright State University.
Attitude Measurement Carlos Torelli Lu Wang. Attitudes Measuring the unobservable in order to predict behavior and to assess people’s responses to persuasion.
Robert's Drawers (and other variations on GRE shared tasks) Gatt, Belz, Reiter, Viethen.
CHI 2009 Review Process Changes area-based submissions and sub-committees.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Search Engines and Information Retrieval Chapter 1.
Aardvark Anatomy of a Large-Scale Social Search Engine.
Ruthellen Josselson, Ph.D.. “There is no method capable of being learned and systematically applied so that it leads to the goal. The scientist has to.
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
U & I: Users & Information Lab Sept 2008  Alice Oh 
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
Human Computer Interaction
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Conclusions (in general… and for this essay). Purpose: The conclusion of an essay has a few purposes. In addition, there are several different kinds of.
Tamil Summary Generation for a Cricket Match
How to read a scientific paper
Summarizing the Content of Large Traces to Facilitate the Understanding of the Behaviour of a Software System Abdelwahab Hamou-Lhadj Timothy Lethbridge.
Dr. Ehud Reiter, Computing Science, University of Aberdeen1 Explaining Data to People Ehud Reiter (Univ of Aberdeen)
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Conclusions (in general… and for this assignment).
Computing Science, University of Aberdeen1 Reflections on Bayesian Spam Filtering l Tutorial nr.10 of CS2013 is based on Rosen, 6 th Ed., Chapter 6 & exercises.
PIER Research Methods Protocol Analysis Module Hua Ai Language Technologies Institute/ PSLC.
QM Spring 2002 Business Statistics Probability Distributions.
Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.
JS Mrunalini Lecturer RAKMHSU Data Collection Considerations: Validity, Reliability, Generalizability, and Ethics.
Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University.
Jette Viethen 20 April 2007NLGeval07 Automatic Evaluation of Referring Expression Generation is Possible.
An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh.
Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.
Ehud Reiter, Computing Science, University of Aberdeen1 NLG Evaluation, Commercial NLG Ehud Reiter (Abdn Uni and Arria/Data2text)
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
24 January 2016© P C F de Oliveira Evaluating Summaries Automatically – a system proposal Paulo C. F. de Oliveira, Edson Wilson Torrens, Alexandre.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
Putting development and evaluation of core technology first Anja Belz Natural Language Technology Group University of Brighton, UK N L T G.
How to Write an Abstract Gwendolyn MacNairn Computer Science Librarian.
CSM06: Information Retrieval Notes about writing coursework reports, revision and examination.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Copyright © 2011 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 1 Research: An Overview.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Presented By: Madiha Saleem Sunniya Rizvi.  Collaborative filtering is a technique used by recommender systems to combine different users' opinions and.
Evaluation Anisio Lacerda.
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
Interaction Styles.
CSM18 Usability Engineering
(in general… and for this essay)
Inquiry in the Science Classroom:
Presentation transcript:

Dr. Ehud Reiter, Computing Science, University of Aberdeen1 NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen)

Dr. Ehud Reiter, Computing Science, University of Aberdeen2 Contents l General Comments l Geneval proposal

Dr. Ehud Reiter, Computing Science, University of Aberdeen3 Good points of Shared Task l Compare different approaches l Encourage people to interact more l Reduce NLG “barriers to entry” l Better understanding of evaluation

Dr. Ehud Reiter, Computing Science, University of Aberdeen4 Bad Points l May narrow focus of community »IR ignored web search because of TREC? l May encourage incremental research instead of new ideas

Dr. Ehud Reiter, Computing Science, University of Aberdeen5 My opinion l Lets give it a try l But I suspect one-off exercises are better than a series »Many people think MUC, DUC, etc were very useful initially but became less scientifically exciting over time

Dr. Ehud Reiter, Computing Science, University of Aberdeen6 Practical Issues l Domain/task? »Need something which several (6?) group are interested in l Evaluation technique »Avoid techniques that are biased –Eg, some automatic metrics may favour stat systems

Dr. Ehud Reiter, Computing Science, University of Aberdeen7 Geneval l Proposal to evaluate NLG evaluation »Core idea is to evaluate in many ways a set of systems with similar input/output functionality, and see how well different evaluation techniques correlate »Anja Belz and Ehud Reiter »Hope to submit to EPSRC (roughly similar to NSF in US) soon

Dr. Ehud Reiter, Computing Science, University of Aberdeen8 NLG Evaluation l Many types »Task-based, human ratings, BLEU-like metrics, etc l Little consensus on best technique »Ie, most appropriate for a context l Poorly understood

Dr. Ehud Reiter, Computing Science, University of Aberdeen9 Some open questions l How well do diff types correlate? »Eg, does BLEU predict human ratings? l Are there biases? »Eg, are statistical NLG systems over/under rated by some techniques? l What is best design? »Number of subjects, subject expertise, number (quality) of reference texts, etc

Dr. Ehud Reiter, Computing Science, University of Aberdeen10 Belz and Reiter (2006) l Evaluated several systems for generating wind statements in weather forecasts, using both human judgements and BLEU-like metrics l Found OK (not wonderful) correlation, but also some biases l Geneval: do this on a much larger scale »More domains, more systems, more evaluation techniques (including new ones), etc

Dr. Ehud Reiter, Computing Science, University of Aberdeen11 Geneval: Possible Domains l Weather forecasts (not wind statements) »Use SumTime corpus l Referring expressions »Use Prodigy-Grec or Tuna corpus l Medical summaries »Use Babytalk corpus l Statistical summaries »Use Atlas corpus

Dr. Ehud Reiter, Computing Science, University of Aberdeen12 Geneval: Evaluation techniques l Human task-based »Eg, referential success l Human ratings »Likert vs pref; expert vs non-expert l Automatic metrics based on ref texts »BLEU, ROUGE, METEOR, etc l Automatic metrics without ref texts »MT T and X scores, length

Dr. Ehud Reiter, Computing Science, University of Aberdeen13 Geneval: new techniques l Would also like to explore and develop new evaluation techniques »Post-edit based human evaluations? »Automatic metrics which look at semantic features? »Open to suggestions for other ideas!

Dr. Ehud Reiter, Computing Science, University of Aberdeen14 Would like systems contributed l Study would be better if other people would contribute systems »We supply data sets and corpora, and carry out evaluations »So you can focus 100% on your great new algorithmic ideas!

Dr. Ehud Reiter, Computing Science, University of Aberdeen15 Geneval from STEC perspect l Sort of like STEC??? »If people contribute systems based on our data sets and corpora »But results will be anonymised –only developer of system X knows how well X did »One-off exercises, not repeated »Multiple evaluation techniques l Hope data sets will reduce barriers to entry

Dr. Ehud Reiter, Computing Science, University of Aberdeen16 Geneval l Please let Anja or I know if »You have general comments, and/or »You have a suggestion for an additional evaluation technique »You might be interested in contributing a system