Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz
Outline Introduction State of the Art Discussion of MT Evaluation Metrics Hypothesis & Objective Methodology & Schedule
Introduction Quickly access to Multilingual Information Need for quick translation High increase of MT Systems Need for evaluation of those MT Systems Evaluation needs to be quick and reliable
Introduction Current and most used Evaluation Metrics show problems New approaches to Evaluation using linguistic information: –Syntactic info –Semantic info Our scenario: –Comparisson between already existing systems –Direction of translation to test: English-Spanish
State of the Art MT absolutely linked to MT Evaluation Purpose of the evaluation methods: –Error analysis –System comparisson Chronologically: 1. Human MT Evaluation 2. Automatic MT Evaluation
State of the Art Types of MT Evaluation Focused on Context: –Context-based Evaluation (FEMTI) Evaluates suitability of the MT Technology & the MT System for the user’s purpose Parameters of analysis: functionality, reliability, usabiility, efficiency, maintainability, portability, cost, etc. Focused on Quantitiy & Quality: –Human Evaluation and Automatic Evaluation
State of the Art Types of MT Evaluation Human Evaluation: –Several approaches: Fidelity (ALPAC report) Intelligibility (ALPAC report) Comprehensive evaluation of informativeness (ARPA) Quality panel evaluation Adequacy and Fluency (Semantics and Syntax) Preferred Translation Required Post-Editing
State of the Art Types of MT Evaluation Human Evaluation: –Advantage: human evaluators can evaluate the overall qualitiy of the system –Disadvantages: Time-consuming Expensive Subjective
State of the Art Types of MT Evaluation Automatic Evaluation: –Approaches: Based on Lexical Matching Based on Syntax Based on Semantics
State of the Art Types of MT Evaluation Based on Lexical Matching: –Dominant approach to Automatic MT Evaluation –Seeks for lexical similarities between MT output and reference translations –Types: Edit Distance Measures (WER) Precision-oriented Measures (BLEU) Recall-oriented Measures (ROUGE) Measure balancing Precision & Recall (GTM)
State of the Art Types of MT Evaluation Based on Syntax –Recently developed –Focused on the syntax of the output sentence –Types: Constituency Parsing Dependency Parsing Combination of both analyses (Liu & Gildea 2005)
State of the Art Types of MT Evaluation Based on Semantics: –Recently developed –Focused on the semantics of the output level –Types: NEs: Quality over NEs (NEE) Semantic Roles: Similarities over Semantic Roles (SR)
Discussion of MT evaluation Metrics Human Evaluation: –Advantatges: Allow to evaluate overall quality –Disadvantatges: Time-consuming Expensive Subjective
Discussion of MT Evaluation Metrics Automatic Evaluation: –Advantages: Fast Not expensive Objective Updatable –Disadvantages?
Discussion of MT Evaluation Metrics Automatic Metrics based on Lexical Matching: –Great advance in MT Research in the last decade –Widely accepted & used by the SMT research community –BLEU is the most used Automatic Metric –Criticized by those not developing SMT systems –Usually depend on translation references –Only take into account lexical similarities & disregard syntax –Biased
Discussion of MT Evaluation Metrics Automatic Metrics based on Syntax: –Good improvement –Works at sentence level –Only focused on Syntax –What about meaning? Automatic metrics based on Semantics: –Good improvement –Only NEs & Semantic Roles –NEs not too relevant –Need further development –Only focused on meaning, what about syntax?
Discussion of MT Evaluation Metrics Discussion of Automatic Metrics: –Each metric focuses on a partial aspect of quality Strongly biased evaluations Unfair comparisson between systems Overtuning of the system −Need for integration of metrics Parametric vs. Non-parametric Evaluation of the quality of a metric combination Human likeness Human acceptability
Hypothesis & Objective Hypothesis: Adding new linguistic information will improve the performance of Automatic Metrics Main Objective: Proposing a new Automatic Evaluation Metric based on linguistic information.
Hypothesis & Objective Secondary Objectives: –Explore linguistic information: Syntactic info: POS, shallow parsing, chunking, full parsing, dependency parsing, constituency parsing, etc. Semantic info: Semantic Roles, semantic features, Wordnet, Framenet, Lexical Semantics, etc. –Look for linguistic resources appropriate to be computationally processed –Look for linguistic resources publicly available –Explore the appropriate way to combine this information
Methodology & Schedule 4 stages: –Stage 1 (year 1 & 2): Bibliography research and analysis: –Detailed exploration and analysis of Automatic Evaluation Metrics –Detailed exploration, analysis and selection of the adequate linguistic information. –Exploration of the feasibility and availability of the linguistic resources needed –Stage 2 (year 1 & 2): Selection of the Corpus of evaluation
Methodology & Schedule –Stage 3 (year 3): Experiments on how to combine this linguistic information and the automatic evaluation metrics Evaluation of our metric combination based on either likeness or acceptability. –Stage 4 (year 4): Analysis & discussion of the results obtained Summary of the findings and reflection on the results obtained Proposal of a new evaluation metric