TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May 29 2008 LREC 2008 Marrakech, Morocco.

Slides:



Advertisements
Similar presentations
Assessment types and activities
Advertisements

Market Research Ms. Roberts 10/12. Definition: The process of obtaining the information needed to make sound marketing decisions.
Performance Management I
1 CASAS Overview Symposium on Issues and Challenges in Assessment and Accountability for Adult English Language Learners May 16, 2003 Washington DC Linda.
Developing an Assessment
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
1 New England Common Assessment Program (NECAP) Setting Performance Standards.
1 The New Adaptive Version of the Basic English Skills Test Oral Interview Dorry M. Kenyon Funded by OVAE Contract: ED-00-CO-0130 The BEST Plus.
Large Scale Assessment Conference June 22, 2004 Sue Rigney U.S. Department of Education Assessments Shall Provide for… Participation of all students Reasonable.
Basic Logic of Experimentation The design of an Internally valid experimental procedure requires us to: Form Equivalent Groups Treat Groups Identically.
CALIFORNIA DEPARTMENT OF EDUCATION Tom Torlakson, State Superintendent of Public Instruction Accessibility and Accommodations for California Assessment.
Minnesota Manual of Accommodations for Students with Disabilities Training Guide
Appraisal Types.
Working With Databases. Questions to Answer about a Database System What functions the marketing database is expected to perform? What is the initial.
1 OWLTS Online World Language Testing Software Pittsburgh Public Schools Prismatic Consulting LLC.
ASSESSMENT& EVALUATION Assessment is an integral part of teaching. Observation is your key assessment tool in the primary and junior grades.
Principles of Assessment
Maestro: A Computer Tutor for Writers Kurt Rowley, Ph.D. Maestro Principal Investigator 'They [students] seem to be less distressed.
ITEC224 Database Programming
Classroom Assessments Checklists, Rating Scales, and Rubrics
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Update on Virginia’s Growth Measure Deborah L. Jonas, Ph.D. Executive Director for Research and Strategic Planning Virginia Department of Education July-August.
Classroom Assessment A Practical Guide for Educators by Craig A. Mertler Chapter 7 Portfolio Assessments.
1 TURKOISE: a Mechanical Turk-based Tailor-made Metric for Spoken Language Translation Systems in the Medical Domain Workshop on Automatic and Manual Metrics.
Chapter 5 Building Assessment into Instruction Misti Foster
Learning From Assessment: Evaluating the Benefits of DALI (Diagnostic Assessment Learning Interface) Hershbinder Mann & Guinevere Glasfurd-Brown, University.
Team Assignment 15 Team 04 Class K15T2. Agenda 1. Introduction 2. Measurement process 3. GQM 4. Strength Weakness of metrics.
Alternative Assessment
How do Humans Evaluate Machine Translation? Francisco Guzmán, Ahmed Abdelali, Irina Temnikova, Hassan Sajjad, Stephan Vogel.
Maria Grazia Pia, INFN Genova Test & Analysis Project aka “statistical testing” Maria Grazia Pia, INFN Genova on behalf of the T&A team
Validity Is the Test Appropriate, Useful, and Meaningful?
Inclusive Education PLC November 16, 2012 Facilitated by Jennifer Gondek TST BOCES.
D1.HRD.CL9.06 D1.HHR.CL8.07 D2.TRD.CL8.09 Slide 1.
Student assessment Assessment tools AH Mehrparvar,MD Occupational Medicine department Yazd University of Medical Sciences.
Jackson County School District A overview of test scores and cumulative data from 2001 – 2006 relative to the following: Mississippi Curriculum Test Writing.
SHOW US YOUR RUBRICS A FACULTY DEVELOPMENT WORKSHOP SERIES Material for this workshop comes from the Schreyer Institute for Innovation in Learning.
Assessment Information from multiple sources that describes a student’s level of achievement Used to make educational decisions about students Gives feedback.
Assessment. Workshop Outline Testing and assessment Why assess? Types of tests Types of assessment Some assessment task types Backwash Qualities of a.
State Practices for Ensuring Meaningful ELL Participation in State Content Assessments Charlene Rivera and Lynn Shafer Willner GW-CEEE National Conference.
ECOS Information Session Draft EPA Quality Documents February 13, 2013 Presented by EPA Quality Staff, Office of Environmental Information For meeting.
EL TECHNICAL ASSISTANCE MEETING Chart your data and state your findings. Let’s Get Started! LEP1 NEWCOMERS LEP2 FLEP1 FLEP2 NOMPHLOTE.
WERST – Methodology Group
Mini-Project #2 Quality Criteria Review of an Assessment Rhonda Martin.
CALIFORNIA DEPARTMENT OF EDUCATION Tom Torlakson, State Superintendent of Public Instruction Smarter Balanced Assessment System Accessibility Supports.
Interventions Identifying and Implementing. What is the purpose of providing interventions? To verify that the students difficulties are not due to a.
Development of the Egyptian Code of Practice for Student Assessment Lamis Ragab, MD, MHPE Hala Salah, MD.
Appropriate Testing Administration
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
Chapter 3 Selection of Assessment Tools. Council of Exceptional Children’s Professional Standards All special educators should possess a common core of.
Michigan Assessment Consortium Common Assessment Development Series Module 16 – Validity.
HARMO13, 1-4June 2010, Paris, France1 Institute for Environment and Sustainability Procedure.
Monitoring and Assessment Presented by: Wedad Al –Blwi Supervised by: Prof. Antar Abdellah.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
WHY IS THIS HAPPENING IN THE PROGRAM? Session 5 Options for Further Investigation & Information Flow.
V: Maryland’s High School Assessments (HSAs) & the Bridge Plan for Academic Validation Overview.
The Achievement Chart Mathematics Grades The primary purpose of assessment and evaluation is to improve student learning.
Scoring Guidelines for Speaking and Writing
WP4 Models and Contents Quality Assessment
Classroom Assessments Checklists, Rating Scales, and Rubrics
Standards-Based Assessment Linking up with Authentic Assessment
WSP quality assurance tool
Language Technologies Institute Carnegie Mellon University
Literacy Assessment and Monitoring Programme (LAMP)
Classroom Assessments Checklists, Rating Scales, and Rubrics
SKILL ASSESSMENT OF SOFTWARE TESTERS Case Study
Appraisal Types.
Chapter Eight: Quantitative Methods
TESTING AND EVALUATION IN EDUCATION GA 3113 lecture 1
1.3 Data Recording, Analysis and Presentation
Presentation transcript:

TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco

Outline Background NIST Open MT evaluations Human assessment of MT NIST’s TAP-ET tool Software design & implementation Assessment tasks Example: MT08 Conclusions & Future Directions May LREC 2008 Marrakech, Morocco

NIST Open MT Evaluations Purpose: To advance the state of the art of MT technology Method: Evaluations at regular intervals since 2002 Open to all who wish to participate Multiple language pairs, two training conditions Metrics: Automatic metrics (primary: BLEU) Human assessments May LREC 2008 Marrakech, Morocco

Human Assessment of MT Accepted standard for measuring MT quality Validation of automatic metrics System error analysis Labor-intensive both in set-up and execution Time limitations mean assessment of: Less systems Less data Assessor consistency Choice of assessment protocols UsesChallenges May LREC 2008 Marrakech, Morocco

NIST Open MT Human Assessment: History 2002 – FundingFunded (paid assessors) Not funded (volunteer assessors) OrganizerLDCNIST System inclusion criteria To span a range of BLEU scores Participants’ decision May LREC 2008 Marrakech, Morocco 1 Assessment of Fluency and Adequacy in Translations, LDC, – FundingFunded (paid assessors) Not funded (volunteer assessors) OrganizerLDCNIST System inclusion criteria To span a range of BLEU scores Participants’ decision Assessment tasksAdequacy (5-point scale) 1 Adequacy (7-point scale plus Yes/No global decision) Fluency (5-point scale) 1 Preference (3-way decision)

Opportunity knocks… New assessment model provided opportunity for human assessment research Application design How do we best accommodate the requirements of an MT human assessments evaluation? Assessment tasks What exactly are we to measure, and how? Documentation and assessor training procedures How do we maximize the quality of assessors’ judgments? May LREC 2008 Marrakech, Morocco

NIST’s TAP-ET Tool Translation Adequacy and Preference Evaluation Tool PHP/MySQL application Allows quick and easy setup of a human assessments evaluation Accommodates centralized data with distributed judges Flexible to accommodate uses besides NIST evaluations Freely available Aims to address previous perceived weaknesses Lack of guidelines and training for assessors Unclear definition of scale labels Insufficient granularity on multipoint scales May LREC 2008 Marrakech, Morocco

TAP-ET: Implementation Basics Administrative interface Evaluation set-up (data and assessor accounts) Progress monitoring Assessor interface Tool usage instructions Assessment instructions and guidelines Training set Evaluation tasks Adjudication interface Allows for adjudication over pairs of judgments Helps identify and correct assessment errors Assists in identifying “adrift” assessors May LREC 2008 Marrakech, Morocco

Assessment Tasks Adequacy Measures semantic adequacy of a system translation compared to a reference translation Preference Measures which of two system translations is preferable compared to a reference translation May LREC 2008 Marrakech, Morocco

Assessment Tasks: Adequacy Comparison of: 1 reference translation 1 system translation Word matches are highlighted as a visual aid Decisions: Q1: “Quantitative” (7-point scale) Q2: “Qualitative” (Yes/No) May LREC 2008 Marrakech, Morocco

Assessment Tasks: Preference Comparison of two system translations for one reference segment Decision: Preference for either system or no preference May LREC 2008 Marrakech, Morocco

Example: NIST Open MT08 Arabic to English 9 systems 21 assessors (randomly assigned to data) Assessment data: May LREC 2008 Marrakech, Morocco AdequacyPreference Documents26 Segments206 (full docs)104 (first 4 per doc) Assessors2 per system translation2 per system translation pair

Adequacy Test, Q1: Inter-Judge Agreement May LREC 2008 Marrakech, Morocco

Adequacy Test, Q1: Correlation with Automatic Metrics 14 1 LREC 2008 Marrakech, Morocco May Rule- based system

Adequacy Test, Q1: Correlation with Automatic Metrics 15 1 LREC 2008 Marrakech, Morocco May

Adequacy Test, Q1: Scale Coverage Adequacy ScoreCoverage 7 (All) Yes 12.9% 14.1% No 1.2% 6 Yes 13.1% 23.1% No 10.0% 5 Yes 6.0% 18.0% No 12.0% 4 (Half)No % 3No % 2No % 1 (None)No % Coverage of 7-point scale by 3 systems with high, medium, low system BLEU scores May LREC 2008 Marrakech, Morocco

Adequacy Test, Q2: Scores by Genre May LREC 2008 Marrakech, Morocco

Preference Test: Scores May LREC 2008 Marrakech, Morocco

Conclusions & Future Directions Continue improving human assessments as an important measure of MT quality and validation of automatic metrics What exactly are we measuring that we want automatic metrics to correlate with? What questions are the most meaningful to ask? How do we achieve better inter-rater agreement? Continue post-test analyses What are the most insightful analyses of results? Adjudicated “gold” score vs. statistics over many assessors? Incorporate user feedback into tool design and assessment tasks May LREC 2008 Marrakech, Morocco