MT Evaluation: What to do and what not to do Eduard Hovy Information Sciences Institute University of Southern California.

Slides:



Advertisements
Similar presentations
Financial and Grants Management Institute - March 18-20, 2008 (updated 2010) 1 Key Focus Areas for Learn and Serve.
Advertisements

Animal, Plant & Soil Science
Federal Guidance on Statistical Use of Administrative Data Shelly Wilkie Martinez, Statistical and Science Policy, OIRA U. S. Office of Management and.
Dr. Ehud Reiter, Computing Science, University of Aberdeen1 NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen)
Evaluation State-of the-art and future actions Bente Maegaard CST, University of Copenhagen
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Assessment: Reliability, Validity, and Absence of bias
Correlation of Translation Phenomena and Fidelity Measures John White, Monika Forner.
SERC Security Systems Engineering Initiative Dr. Clifford Neuman, Director USC Center for Computer Systems Security Information Sciences Institute University.
Modeling and Validation Victor R. Basili University of Maryland 27 September 1999.
Comparative Evaluation of the Linguistic Output of MT Systems for Translation and Non-translation Purposes Marie-Jo Astre Anna Civil Francine Braun-Chen.
CHAPTER 12 Audit Strategy in Response to Assessed Risks Fall 2007 u Designing Substantive Tests u Special Consideration in Designing Substantive Tests.
NAACL Workshop on MTE 3 rd in a Series of MTE adventures 3 June, 2001.
SOFTWARE PROJECT MANAGEMENT Project Quality Management Dr. Ahmet TÜMAY, PMP.
Learning Objectives LO1 Outline six general audit techniques for gathering evidence. LO2 Identify the procedures and sources of information auditors can.
Using statistics in small-scale language education research Jean Turner © Taylor & Francis 2014.
Short Course on Introduction to Meteorological Instrumentation and Observations Techniques QA and QC Procedures Short Course on Introduction to Meteorological.
Classroom Assessment A Practical Guide for Educators by Craig A
Understanding Validity for Teachers
Oxford English Dictionary (1989) factoid, n. and a. A. n. Something that becomes accepted as a fact, although it is not (or may not be) true; spec. an.
Info for this website Taken from: From: Compass Test Information Presentation.
Lecture 7 Analytical Quality Control
Source Code: Assessing Cited References to Measure Student Information Literacy Skills Dale Vidmar Professor/Information Literacy and Instruction Librarian.
IB Internal Assessment Design. Designing an Experiment Formulate a research question. Read the background theory. Decide on the equipment you will need.
SOFTWARE QUALITY ASSURANCE PRACTICE IN JAPAN
Building Effective Assessments. Agenda  Brief overview of Assess2Know content development  Assessment building pre-planning  Cognitive factors  Building.
Forum - 1 Assessments for Learning: A Briefing on Performance-Based Assessments Eva L. Baker Director National Center for Research on Evaluation, Standards,
Systematization of Crowdsoucing for Data Annotation Aobo, Feb
EPA’s Bioassessment Performance and Comparability “Guidance”
Benthic Community Assessment Tool Development Ananda Ranasinghe (Ana) Southern California Coastal Water Research Project (SCCWRP) Sediment.
G544 – Practical project SELF REPORT. Revision  Socrative quiz  In pairs – answer each question.  We will then discuss each answer given.
Bloom’s Taxonomy USSF Referee Instructor CourseITIP United States Soccer Federation.
Informative / Explanatory Writing Lit and Comp 2.
Lesson One Introduction: Teaching and Testing/Assessment.
Validity Validity: A generic term used to define the degree to which the test measures what it claims to measure.
An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield.
MT and the Future Eduard Hovy Information Sciences Institute University of Southern California.
An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh.
WERST – Methodology Group
Psychology As Science Psychologists use the “scientific method” Steps to the scientific method: - make observations - ask question - develop hypothesis.
Building Capacity on Protected Areas Law & Governance Module 7 International Law and Protected Areas Exercise 2 Domestic Compliance with International.
Reliability and Validity Themes in Psychology. Reliability Reliability of measurement instrument: the extent to which it gives consistent measurements.
Writing A Review Sources Preliminary Primary Secondary.
Evaluate Davis Goal Requirements Comparison Last Year This Year Performance Goal Academic Goal.
©2014 IPDAE. All rights reserved. All content in this presentation is the proprietary property of The Institute for the Professional Development of Adult.
Best Practices in CMSD SLO Development A professional learning module for SLO developers and reviewers Copyright © 2015 American Institutes for Research.
Chapter 23: Overview of the Occupational Therapy Process and Outcomes
Evaluation Issues: June 2002 Donna Harman Ellen Voorhees.
ESTABLISHING RELIABILITY AND VALIDITY OF RESEARCH TOOLS Prof. HCL Rawat Principal UCON,BFUHS Faridkot.
The New Illinois Learning Standards
The development, change, and transformation of MIS A content analysis of articles published in business and marketing journals Suphan Nasir International.
Developing items and assembling a test based on curriculum standards
No gum, candy or chewing please! 
Principles of Language Assessment
DUMMIES RELIABILTY AND VALIDITY FOR By: Jeremy Starkey Lijia Zhang
Research Design in Psychology
پرسشنامه کارگاه.
Diagnostic Essay Feedback Analysis
Samples or groups for comparison
Julie Kircher1, Amanda J Pucelli1, Luz Islas1, Arely Briseno1,
Ace it! Summer Conference 2011 Assessment
Unit: Science and Technology
Research methods and organization
REFERENCES AND ACKNOWLEDGEMENTS
MT and Post-editing from a Translator’s Perspective
Deconstructing Standard 2a Dr. Julie Reffel Valdosta State University
Year 10 Science Life - Psychology
Biological Science Applications in Agriculture
Homework Choose ONE of your favorite places to visit, and write an OUTLINE for a descriptive paragraph. Then develop your outline into a descriptive paragraph.
iLayout: Performance Evaluation
Presentation transcript:

MT Evaluation: What to do and what not to do Eduard Hovy Information Sciences Institute University of Southern California

My background: MT Eval is pain Psychological damage: DARPA MT program Killer competition: TREC IR and QA contests Painstaking details: ISLE collaboration

Best and worst Best: you describe your needs, your environment, and give some sample text… …and the Evaluation system gives you a report of all the systems’ performances! Worst: a repeat of the DARPA experience: expensive, stressful, and inconclusive.

The worst thing an evaluator can do Apply some arbitrary measure: …without understanding what it measures, …without knowing when it is applicable, …without understanding the accuracy or reliability of the metric. What do we need? …a systematic outline of the metrics and their appropriateness and their scope.

Do-It-Yourself Evaluations Procedure: 1. characterize your translation purposes and operational processes, 2. moving downward, locate their points in the taxonomies, 3. find their evaluation metrics, 4. apply them to your MT system(s), De/Recomposable scores: 5. recombine scores, by moving back up the taxonomy.

Work required The ISLE taxonomy is there, more or less. BUT: the actual metrics have not been characterized systematically yet: –inter-tester agreement, –robustness over different domains and genres, –expense and difficulty, –independence/relationship to other metrics. What’s needed: a project of evaluation evaluations.