MODL5003 Principles and applications of MT

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Automatic methods of MT evaluation Practical 18/04/2005 MODL5003 Principles and applications of machine translation slides available at:
Statistical modelling of MT output corpora for Information Extraction.
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Improving Machine Translation Quality with Automatic Named Entity Recognition Bogdan Babych Centre for Translation Studies University of Leeds, UK Department.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Chapter 5: Introduction to Information Retrieval
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Predicting MT Fluency from IE Precision and Recall Tony Hartley, Brighton, UK Martin Rajman, EPFL, CH.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
Machine Translation Anna Sågvall Hein Mösg F
User and Task Analysis Howell Istance Department of Computer Science De Montfort University.
BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.
The Project AH Computing. Functional Requirements  What the product must do!  Examples attractive welcome screen all options available as clickable.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Machine Translation- 5 Autumn 2008 Lecture Sep 2008.
Text Analysis Everything Data CompSci Spring 2014.
Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Sensitivity of automated MT evaluation metrics on higher quality MT output Bogdan Babych, Anthony Hartley Centre for Translation.
A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.
Chapter 6: Information Retrieval and Web Search
Final Presentation Industrial project Automatic tagging tool for Hebrew Wiki pages Supervisors: Dr. Miri Rabinovitz, Supervisors: Dr. Miri Rabinovitz,
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 Chapter 3 1.Quality Management, 2.Software Cost Estimation 3.Process Improvement.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
I Power Higher Computing Software Development Development Languages and Environments.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Vector Space Models.
Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
44220: Database Design & Implementation Introduction to Module Ian Perry Room: C49 Ext.: 7287
Chapter 6 - Standardized Measurement and Assessment
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
What is a Computer An electronic, digital device that stores and processes information. A machine that accepts input, processes it according to specified.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Relation Extraction (RE) via Supervised Classification See: Jurafsky & Martin SLP book, Chapter 22 Exploring Various Knowledge in Relation Extraction.
Automatic methods of MT evaluation Lecture 18/03/2009 MODL5003 Principles and applications of machine translation Bogdan Babych.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Centre for Translation Studies FACULTY OF ARTS
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Development Environment
Measurement and Observation
EDU 385 Session 8 Writing Selection items
Personal Software Process Software Estimation
Classification of Tests Chapter # 2
The CoNLL-2014 Shared Task on Grammatical Error Correction
Multiple Regression – Split Sample Validation
Presentation transcript:

Automatic methods of MT evaluation Lecture 20/03/2006 MODL5003 Principles and applications of machine translation Bogdan Babych <bogdan@comp.leeds.ac.uk>

MODL5003 Principles and applications of MT Overview Aspects of MT evaluation Text Quality evaluation Advantages / disadvantages of automatic techniques Methods of automatic evaluation Validation of automatic scores Challenges Recent developments 20 March 2006 MODL5003 Principles and applications of MT

1. Aspects of MT evaluation (1) (Hutchins & Somers, 1992:161-174) Text quality (important for developers, users and managers); Extendibility (developers) Operational capabilities of the system (users) Efficiency of use (companies, managers, freelance translators) 20 March 2006 MODL5003 Principles and applications of MT

Aspects of MT evaluation (2) Text Quality can be done manually and automatically central issue in MT quality… Extendibility = architectural considerations: adding new language pairs extending lexical / grammatical coverage developing new subject domains: “improvability” and “portability” of the system 20 March 2006 MODL5003 Principles and applications of MT

Aspects of MT evaluation (3) Operational capabilities of the system user interface dictionary update: cost / performance, etc. Efficiency of use is there an increase in productivity? the cost of buying / tuning / integrating into the workflow / maintaining / training personnel how much money can be saved for the company / department? 20 March 2006 MODL5003 Principles and applications of MT

2. Text quality evaluation (TQE) – issues 1/2 Quality evaluation vs. error identification / analysis Black box vs. glass box evaluation Error correction on the user side dictionary updating do-not-translate lists, etc. 20 March 2006 MODL5003 Principles and applications of MT

2. Text quality evaluation (TQE) – issues 2/2 Multiple quality parameters & their relations fidelity (adequacy) fluency (intelligibility, clarity) style informativeness… Are these parameters completely independent? Or is intelligibility a pre-condition for adequacy or style? Granularity of evaluation different for different purposes individual sentences; texts; corpora of similar documents; the average performance of an MT system 20 March 2006 MODL5003 Principles and applications of MT

3. Advantages of automatic evaluation Low cost Objective character of evaluated parameters reproducibility comparability across texts: relative difficulty for MT across evaluations 20 March 2006 MODL5003 Principles and applications of MT

MODL5003 Principles and applications of MT & Disadvantages … need for “calibration” with human scores interpretation in terms of human quality parameters is not clear do not account for all quality dimensions hard to find good measures for certain quality parameters reliable only for homogeneous systems the results for non-native human translation, knowledge-based MT output, statistical MT output may be non-comparable 20 March 2006 MODL5003 Principles and applications of MT

4. Methods of automatic evaluation Automatic Evaluation is more recent: first methods appeared in the late 90-ies Performance methods Measuring performance of some system which uses degraded MT output Reference proximity methods Measuring distance between MT and a “gold standard” translation 20 March 2006 MODL5003 Principles and applications of MT

MODL5003 Principles and applications of MT 4.1 Performance methods A pragmatic approach to MT: similar to performance-based human evaluation “…can someone using the translation carry out the instructions as well as someone using the original?” (Hutchins & Somers, 1992: 163) Different from human performance evaluation 1. Tasks are carried out by an automated system 2. Parameter(s) of the output are automatically computed 20 March 2006 MODL5003 Principles and applications of MT

… automated systems used & parameters computed parser (automatic syntactic analyser) Computing an average depth of syntactic trees (Rajman and Hartley, 2000) Named Entity Recognition system (a system which finds proper names, e.g., names of organisations…) Number of extracted organisation names Information Extraction filling a database: events, participants of events Computing ratio of correctly filled database fields 20 March 2006 MODL5003 Principles and applications of MT

Performance-based methods: an example 1/2 Open-source NER system for English (ANNIE) www.gate.ac.uk the number of extracted Organisation Names gives an indication of Adequacy ORI: … le chef de la diplomatie égyptienne HT: the <Title>Chief</Title> of the <Organization>Egyptian Diplomatic Corps </Organization> MT-Systran: the <JobTitle> chief </JobTitle> of the Egyptian diplomacy 20 March 2006 MODL5003 Principles and applications of MT

Performance-based methods: an example 2/2 count extracted organisation names the number will be bigger for better systems biggest for human translations other types of proper names do not correspond to such differences in quality Person names Location names Dates, numbers, currencies … 20 March 2006 MODL5003 Principles and applications of MT

NE recognition on MT output 20 March 2006 MODL5003 Principles and applications of MT

Performance-based methods: interpretation built on prior assumptions about natural language properties sentence structure is always connected; MT errors more frequently destroys relevant contexts than creates spurious contexts; difficulties for automatic tools are proportional to relative “quality” (the amount of MT degradation) Be careful with prior assumptions what is worse for the human user may be better for an automatic system 20 March 2006 MODL5003 Principles and applications of MT

MODL5003 Principles and applications of MT Example 1 ORI : “Il a été fait chevalier dans l'ordre national du Mérite en mai 1991” HT: “He was made a Chevalier in the National Order of Merit in May, 1991.” MT-Systran: “It was made <JobTitle> knight</JobTitle> in the national order of the Merit in May 1991”. MT-Candide: “He was knighted in the national command at Merite in May, 1991”. 20 March 2006 MODL5003 Principles and applications of MT

MODL5003 Principles and applications of MT Example 2 Parser-based score: X-score Xerox shallow parser XELDA produces annotated dependency trees; identifies 22 types of dependencies The Ministry of Foreign Affairs echoed this view SUBJ(Ministry, echoed) DOBJ(echoed, view) NN(Foreign, Affairs) NNPREP(Ministry, of, Affairs) 20 March 2006 MODL5003 Principles and applications of MT

MODL5003 Principles and applications of MT Example 2 (contd.) a hearing that lasted more then 2 hours RELSUBJ(hearing, lasted) a public program that has already been agreed on RELSUBJPASS(program, agreed) to examine the effects as possible PADJ(effects, possible) brightly coloured doors ADVADJ(brightly, coloured) X-score = (#RELSUBJ + #RELSUBJPASS – #PADJ – #ADVADJ) 20 March 2006 MODL5003 Principles and applications of MT

4.2 Reference proximity methods Assumption of Reference Proximity (ARP): “…the closer the machine translation is to a professional human translation, the better it is” (Papineni et al., 2002: 311) Finding a distance between 2 texts Minimal edit distance N-gram distance … 20 March 2006 MODL5003 Principles and applications of MT

MODL5003 Principles and applications of MT Minimal edit distance Minimal number of editing operations to transform text1 into text2 deletions (sequence xy changed to x) insertions (x changed to xy) substitutions (x changed by y) transpositions (sequence xy changed to yx) Algorithm by Wagner and Fischer (1974). Edit distance implementation: RED method Akiba Y., K Imamura and E. Sumita. 2001 20 March 2006 MODL5003 Principles and applications of MT

Problem with edit distance: Legitimate translation variation ORI: De son côté, le département d'Etat américain, dans un communiqué, a déclaré: ‘Nous ne comprenons pas la décision’ de Paris. HT-Expert: For its part, the American Department of State said in a communique that ‘We do not understand the decision’ made by Paris. HT-Reference: For its part, the American State Department stated in a press release: We do not understand the decision of Paris. MT-Systran: On its side, the American State Department, in an official statement, declared: ‘We do not include/understand the decision’ of Paris. 20 March 2006 MODL5003 Principles and applications of MT

Legitimate translation variation (LTV) …contd. to which human translation should we compute the edit distance? is it possible to integrate both human translations into a reference set? 20 March 2006 MODL5003 Principles and applications of MT

MODL5003 Principles and applications of MT N-gram distance the number of common words (evaluating lexical choices); the number of common sequences of 2, 3, 4 … N words (evaluating word order): 2-word sequences (bi-grams) 3-word sequences (tri-grams) 4-word sequences (four-grams) … N-word sequences (N-grams) N-grams allow us to compute several parameters… 20 March 2006 MODL5003 Principles and applications of MT

Proximity to human reference (1) MT “Systran”: The 38 heads of undertaking put in examination in the file were the subject of hearings […] in the tread of "political" confrontation. Human translation “Expert”: The 38 heads of companies questioned in the case had been heard […] following the "political" confrontation. MT “Candide”: The 38 counts of company put into consideration in the case had the object of hearings […] in the path of confrontal "political." 20 March 2006 MODL5003 Principles and applications of MT

Proximity to human reference (2) MT “Systran”: The 38 heads of undertaking put in examination in the file were the subject of hearings […] in the tread of "political" confrontation. Human translation “Expert”: The 38 heads of companies questioned in the case had been heard […] following the "political" confrontation. MT “Candide”: The 38 counts of company put into consideration in the case had the object of hearings […] in the path of confrontal "political." 20 March 2006 MODL5003 Principles and applications of MT

Proximity to human reference (3) MT “Systran”: The 38 heads of undertaking put in examination in the file were the subject of hearings […] in the tread of "political" confrontation. Human translation “Expert”: The 38 heads of companies questioned in the case had been heard […] following the "political" confrontation. MT “Candide”: The 38 counts of company put into consideration in the case had the object of hearings […] in the path of confrontal "political." 20 March 2006 MODL5003 Principles and applications of MT

MODL5003 Principles and applications of MT Matches of N-grams MT Omissions False hits HT True hits 20 March 2006 MODL5003 Principles and applications of MT

Matches of N-grams (contd.) MT + MT – Human text + true hits omissions → recall (avoiding omissions) Human text – false hits ↓ precision (avoiding false hits) 20 March 2006 MODL5003 Principles and applications of MT

MODL5003 Principles and applications of MT Precision and Recall Precision = how accurate is the answer? “Don’t guess, wrong answers are deducted!” Recall = how complete is the answer? “Guess if not sure!”, don’t miss anything! 20 March 2006 MODL5003 Principles and applications of MT

NE recognition on MT output 20 March 2006 MODL5003 Principles and applications of MT

Precision (P) and Recall (R): Organisation names 20 March 2006 MODL5003 Principles and applications of MT

N-grams: Union and Intersection Union Intersection ~Precision ~Recall 20 March 2006 MODL5003 Principles and applications of MT

Translation variation and N-grams N-gram distance to multiple human reference translations Precision on the union of N-gram sets in HT1, HT2, HT3… N-grams in all independent human translations taken together with repetitions removed Recall on the intersection of N-gram sets N-grams common to all sets – only repeated N-grams! (most stable across different human translations) 20 March 2006 MODL5003 Principles and applications of MT

Human and automated scores Empirical observations: Precision on the union gives indication of Fluency Recall on intersection gives indication of Adequacy Automated Adequacy evaluation is less accurate – harder Now most successful N-gram proximity -- BLEU evaluation measure (Papineni et al., 2002) BiLingual Evaluation Understudy 20 March 2006 MODL5003 Principles and applications of MT

BLEU evaluation measure computes Precision on the union of N-grams accurately predicts Fluency produces scores in the range of [0,1] Usage: download and extract Perl script “bleu.pl” prepare MT output and reference translations in separate *.txt files Type in the command prompt: perl bleu-1.03.pl -t mt.txt -r ht.txt 20 March 2006 MODL5003 Principles and applications of MT

BLEU evaluation measure Texts may be surrounded by tags: e.g.: <DOC doc_ID="1" sys_ID="orig"> </DOC> different reference translations: <DOC doc_ID="1" sys_ID="orig"> <DOC doc_ID="1" sys_ID="ref2"> <DOC doc_ID="1" sys_ID="ref3"> paragraphs may be surrounded by tags: e.g.: <seg id="1"> </seg> 20 March 2006 MODL5003 Principles and applications of MT

5. Validation of automatic scores Automatic scores have to be validated Are they meaningful, whether of not predict any human evaluation measures, e.g., Fluency, Adequacy, Informativeness Agreement human vs. automated scores measured by Pearson’s correlation coefficient r a number in the range of [–1, 1] –1 < r < –0.5 = strong negative correlation 0.5 < r < +1 = strong positive correlation –0.5 < r < 0.5 no correlation or weak correlation 20 March 2006 MODL5003 Principles and applications of MT

Pearson’s correlation coefficient r in Excel 20 March 2006 MODL5003 Principles and applications of MT

HumanSc = Slope * AutomatedSc + Intercept 20 March 2006 MODL5003 Principles and applications of MT

MODL5003 Principles and applications of MT 6. Challenges Multi-dimensionality no single measure of MT quality some quality measures are harder Evaluating usefulness of imperfect MT different needs of automatic systems and human users human users have in mind publication (dissemination) MT is primarily used for understanding (assimilation) 20 March 2006 MODL5003 Principles and applications of MT

7. Recent developments: N-gram distance paraphrasing instead of multiple RT more weight to more “important” words relatively more frequent in a given text (Babych, Hartley, ACL 2004) relations between different human scores accounting for dynamic quality criteria 20 March 2006 MODL5003 Principles and applications of MT

MODL5003 Principles and applications of MT “Salience” weighting fti.j – frequency of wi in a documentj dfi – number of documents in a collection wi N – total number of documents in a collection Term frequency / inverse document frequency tf.idf(i,j) = (1 + log (tfi,j)) log (N / dfi) “Salience” score 20 March 2006 MODL5003 Principles and applications of MT

Proximity to human reference (3) MT “Systran”: The 38 heads of undertaking put in examination in the file were the subject of hearings […] in the tread of "political" confrontation. Human translation “Expert”: The 38 heads of companies questioned in the case had been heard […] following the "political" confrontation. MT “Candide”: The 38 counts of company put into consideration in the case had the object of hearings […] in the path of confrontal "political." 20 March 2006 MODL5003 Principles and applications of MT

IE-based MT evaluation: analysis of improvement Systran: higher term frequency weights: heads tf.idf=4.605;S=4.614 confrontation tf.idf=5.937;S=3.890 Candide: less salient unigrams case tf.idf=3.719;S=2.199 had tf.idf=0.562;S=0.000 20 March 2006 MODL5003 Principles and applications of MT

IE-based MT evaluation: analysis of improvement Systran: higher term frequency weights: heads tf.idf=4.605;S=4.614 confrontation tf.idf=5.937;S=3.890 Candide: less salient unigrams case tf.idf=3.719;S=2.199 had tf.idf=0.562;S=0.000 20 March 2006 MODL5003 Principles and applications of MT