Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed.

Slides:



Advertisements
Similar presentations
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advertisements

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Vamshi Ambati | Stephan Vogel | Jaime Carbonell Language Technologies Institute Carnegie Mellon University A ctive Learning and C rowd-Sourcing for Machine.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
NICE: Native language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown Carnegie Mellon University.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Machine Translation with Scarce Resources The Avenue Project.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF August 6, 2001 NICE: Native language Interpretation and Communication.
MACHINE TRANSLATION TRANSLATION(5) LECTURE[1-1] Eman Baghlaf.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Chapter 6 System Engineering - Computer-based system - System engineering process - “Business process” engineering - Product engineering (Source: Pressman,
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Eliciting Features from Minor Languages The elicitation tool provides a simple interface for bilingual informants with no linguistic training and limited.
Area Report Machine Translation Hervé Blanchon CLIPS-IMAG A Roadmap for Computational Linguistics COLING 2002 Post-Conference Workshop.
Can Controlled Language Rules increase the value of MT? Fred Hollowood & Johann Rotourier Symantec Dublin.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Multi-Engine MT for Quick MT. Missing Technology for Quick MT LingWear ISI MT NICE Core Rapid MT - Multi-Engine MT - Omnivorous resource usage - Pervasive.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
AVENUE Automatic Machine Translation for low-density languages Ariadna Font Llitjós Language Technologies Institute SCS Carnegie Mellon University.
The MultilingualWeb-LT Working Group receives funding by the European Commission (project name LT-Web) through the Seventh Framework Programme (FP7) in.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, January 2003.
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Semi-Automated Elicitation Corpus Generation The elicitation tool provides a simple interface for bilingual informants with no linguistic training and.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Bridging the Gap: Machine Translation for Lesser Resourced Languages
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
Seed Generation and Seeded Version Space Learning Version 0.02 Katharina Probst Feb 28,2002.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
AVENUE: Machine Translation for Resource-Poor Languages NSF ITR
Developing affordable technologies for resource-poor languages Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September.
FROM BITS TO BOTS: Women Everywhere, Leading the Way Lenore Blum, Anastassia Ailamaki, Manuela Veloso, Sonya Allin, Bernardine Dias, Ariadna Font Llitjós.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Eliciting a corpus of word-aligned phrases for MT
Language Technologies Institute Carnegie Mellon University
Ariadna Font Llitjós March 10, 2004
NAACL-HLT 2010 June 5, 2010 Jee Eun Kim (HUFS) & Kong Joo Lee (CNU)
University of Illinois System in HOO Text Correction Shared Task
Presentation transcript:

Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed approach - User-friendly online GUI: the Translation Correction Tool >> non-expert bilingual speakers (abstract away from MT system details) >> MT error classification specifically tailored to elicit the most information possible with the least linguistics terminology - Active Learning to obtain minimal pairs and do feature detection - Rule Refinement operations to automatically modify translation rules AVENUE System Rule-based MT system rapid development of MT Resource-poor languages Requirements: small number of non-expert bilingual speakers to translate and align elicitation corpus (Probst et al. 2001) Goal: learn and refine translation rules automatically The Translation Correction Tool v.01 MT error classification Radically different approach to MT evaluation Instead of end-users, translation experts or developers, it needs to be tailored for non-expert bilingual users. Hypothesis: >> non-expert bilingual users can accurately detect an error in the machine translated sentence, given the source language sentence and, optionally, some context. >> they can also probably indicate which other word(s) in the target sentence give us the clue about why there is an error. Example: in agreement errors, what is the word it needs to agree with. English-Spanish User Studies Purpose: threefold >> test naïve users ability to detect and classify MT errors >> assess GUI usefulness and user-friendliness >> asses appropriateness of MT error classification 32 English sentences extracted from the AVENUE elicitation corpus Transfer MT system included a hand-crafted grammar with 12 rules and 442 lexical entries Correction Example with the TCTool Output from MT system: Users need to correct the Spanish translation so that words are in the right form and in the right order. Note that an alignment is missing from “I” to “vi”, so users should also add an alignment between these two words. Actual user statistics 29 users who completed all 32 sentences. 83% users were from Spain. 2/3 with no background in Linguistics 75% with a graduate degree and 25% with a Bachelor's degree. Average translations fixed: 26,6 (over 32) Average duration: 1:30 min >> ~3 minutes per translation Duration range [28min-4:18hours] Measuring user accuracy Gold standard 10 users log files (~ 300 files ) >> interested in high precision at the expense of lower recall. User corrections were not always consistent with other users’. Most of the time, when the final translations differed from gold standard, they were still correct. On average, users only produced 2.5 translations that were worse than the gold standard (out of 26,6). Users got most alignments correctly. Usability questionnaire 82% said TCTool is user-friendly 100% said it is easy to determine if a sentence translation is correct, but only 88% felt that determining the source of errors is easy. Users did not read most of the tutorial (23-pages) Conclusions The TCTool is an online tool that elicits guided and structured user feedback on translations generated by a transfer-based MT system, with the ultimate goal of automatically improving the translation rules. The first English-Spanish user study shows that users can detect errors with high accuracy (89%), but have a harder time classifying error given the MT error classification above (72%). In general, most of the problems users had were due to not having read the instructions and tutorial. The Translation Correction Tool: English-Spanish user studies Ariadna Font Llitjós and Jaime Carbonell Language Technologies Institute CMU Abstract Machine translation systems should improve with feedback from post-editors, but none do beyond statistical and example-base MT improving marginally if the corrected translation is added to the parallel training data. Rule based systems to date improve only via manual debugging. In contrast, we introduce a largely automated method for capturing more information from the human post-editor so that corrections may be performed automatically to translation grammar rules and lexical entries. This paper focuses on the information capture phase and reports on an experiment with English-Spanish translation. Version 01 has 5 CGI scripts in Perl and 1 JavaScript, which together produce a total of 8 different HTML pages. This simplified data flow diagram shows how the core of the TCTool works. Set of possible actions to correct a sentence using the TCTool modify a word >> set of error types associated with it add a word delete a word drag a word into a different position (change word order) add an alignment delete an alignment Future Work >> Interactive dynamic tutorial Need higher precision in error classification: >> Refine MT error classification as shown in the snapshot on the right. >> examples added >> drop-down menu added >> Analyze all user feedback to see how we can automate the rule refinement process. Acknowledgements The research funded in part by NSF grant number IIS NSF. We would also like to thank Kenneth Sim and Patrick Milholl for the implementation of the JavaScript. References Flanagan, M., Error Classification for MT Evaluation. Proceedings of AMTA 94, pp , Imamura, K., Sumita, E. and Matsumoto, Y., Feedback cleaning of Machine Translation Rules Using Automatic Evaluation. ACL-03: 41st Annual Meeting of the Association for Computational Linguistics, pp , Menezes, A. and Richardson, S A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora. Workshop on Example-Based Machine Translation, in MT Summit VIII, pp , Papineni, K., Roukos, S. and Ward, T., Maximum Likelihood and Discriminative Training of Direct Translation Models. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP-98), pp , Probst, K., Brown, R., Carbonell, J., Lavie, A. Levin, and L., Peterson, E., Design and Implementation of Controlled Elicitation for Machine Translation of Low-density Languages. Proceedings of the MT2010 workshop at MT Summit Probst, Katharina, Lori Levin, Erik Peterson, Alon Lavie, Jaime Carbonell MT for Resource-Poor Languages Using Elicitation-Based Learning of Syntactic Transfer Rules. Machine Translation, Special Issue on Embedded MT, 17(4) Su K., Chang J. and Una Hsu, Y A corpus-based statistics-oriented two-way design for parameterized MT systems: Rationale, Architecture and Training issues. TMI-95, 6th Theoretical and Methodological Issues in Machine Translation, pp , White, J.S., O'Connell, T. and O'Mara, F., The ARPA MT Evaluation Methodologies: Evaluation, Lessons, and Future Approaches. Proceedings of AMTA 94, pp , SL: i saw you yesterday TL: vi tu ayer AL: ((2,1),(3,2),(4,3))