Language Technologies Institute Carnegie Mellon University

Language Technologies Institute Carnegie Mellon University
Machine Translation Post-Editing Study Project Kent State Project Meeting Alon Lavie Language Technologies Institute Carnegie Mellon University 8 June 2011

Meeting Goals Work out details of a summer pilot project on MT post-editing involving CMU and Kent State Discuss long-term research goals and possible funding opportunities Identify concrete target program for a research grant proposal

Longterm Project Goals
Technology Goal: design MT systems that are most useful and productive for use by human translators as a CAT tool Project Goals: Develop an in-depth understanding of the characteristics of MT post-editing within commercially- relevant settings Develop measures for quantifying the suitability of MT systems to the task of MT post-editing Explore advanced methods for optimizing MT for post- editing, and for integrating MT into CAT environments

Research Questions What types of MT errors are easy for human translators to correct, and what types are difficult? Can we create a taxonomy of such errors? How do these error characteristics of MT systems vary across different MT approaches and technologies (i.e. "rule-based" systems vs. "statistical" systems)? How do these error characteristics vary for different target-languages and language-pairs? How do these error characteristics differ between "generic" MT systems (such as Google) vs. MT systems that are directly adapted to domain and client data? How should translations produced by MT be presented and displayed to translators most effectively for post-editing? Should poor MT translations be filtered out as to not confuse translators? Can we design measures that better capture the post-editing "difficulty" of MT output? If so, can we use these measures to produce MT output that is easier for translators to post-edit?

Pilot Project Goals Collect preliminary data that supports developing a solid scientific research agenda for a long-term research project Become familiar with the task and challenges involved Develop an effective working relationship between MT research team at CMU and translation studies research team at Kent State

Commercially-Relevant Setting
Research should be framed in a commercially-relevant setting, where MT has been shown to produce significant gains in translator productivity, so that outcomes bear immediate impact on translation industry Main characteristics of such settings: Commercially-relevant domain and data MT and TMs integrated within a common CAT editing environment for human translators (i.e. TRADOS) Domain and/or client-adapted MT as opposed to “generic” MT engines (i.e. Google) Probably too complex and difficult to create a complete commercial setup for the summer pilot project, so simplify to the minimum required in order to collect meaningful data

Proposed Setting for Pilot
Domain: Computer Hardware and Software documentation and software localization Language-Pair: English-Spanish In what direction? English-to-Spanish? Spanish-to- English? Both? No Translation Memories or integration of MT with TMs Simple GUI for MT error classification and MT post-editing

Proposed MT Systems Domain-specific statistical MT system can be developed by Safaba Translation Solutions Data: About 4 million TUs (60 million words) of domain-specific training data that Safaba has acquired from the TAUS Data Association (TDA) System can be trained and ready for use within a couple of weeks Will be made available online for remote access and connection Use two other MT systems as comparisons for the study: Google: “generic” (unadapted) high-quality SMT system BabelFish/SYSTRAN: “generic” (unadapted) rule-based MT system Is this too much?

Proposed Pilot Study Task-1: collect data on high-level classification of MT utility for post-editing: Translators classify MT-translated segments into one of three categories: MT translation does not require any post-editing (perfect) MT translation requires post-editing and can be post-edited MT translation is unintelligible and cannot be effectively post-edited Task-2: analysis of the data collected: Inter and Intra coder agreement levels Distributional analysis Variation across type of MT and other controlled variables Task-3: Perform a more detailed classification of the data from category-2 into types of error and their difficulty Task-4: Perform actual post-editing of data from category-2, with time and end-quality measurements

Preliminary Tasks Selection of documents for the pilot study
Domain relevant data from online resources Preferably with target human translations Controlling for document and segment difficulty (and length)? Who does this, and how soon? Creation of the required user interfaces Design and develop simple online interfaces Testing Identifying and selecting translator subjects Do you have students and are they available? IRB

Discussion…

Grant Opportunities NSF:
NSF Information and Intelligent Systems (IIS) Core Programs: rg=IIS&from=fund Medium-size Projects: Proposals due 9/15/2011 Cyber-Enabled Discovery and Innovation (CDI) program: _key=nsf11502 Next deadline is unclear Highly competitive Grant Opportunities for Academic Liaison with Industry (GOALI) program: rg=IIS&from=fund This program accepts proposals anytime, but the funding level is unclear. Other US Government Funding Sources, such as NSA, NVTC

Language Technologies Institute Carnegie Mellon University

Similar presentations

Presentation on theme: "Language Technologies Institute Carnegie Mellon University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Language Technologies Institute Carnegie Mellon University

Similar presentations

Presentation on theme: "Language Technologies Institute Carnegie Mellon University"— Presentation transcript:

Similar presentations

About project

Feedback