Presentation is loading. Please wait.

Presentation is loading. Please wait.

Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.

Similar presentations


Presentation on theme: "Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda."— Presentation transcript:

1 Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda

2 Introduction  LDC develops large scale parallel text corpora for sponsored research programs Manual creation of parallel text by human translators Harvesting, aligning potential parallel documents from known repositories and the web  Recent expansion in scope and variety Requiring improvements in quality, efficiency and cost-effectiveness

3 Context for Resource Creation  Previous focus primarily Chinese, Arabic newswire (NW)  Current focus on "unstructured" data Broadcast News (BN) and Broadcast Conversation (BC) Weblogs, Newsgroups (WB) Handwritten document images of many types (VAR)  New linguistic varieties Eight language pairs in the LCTL program Colloquial Arabic varieties for some projects  New evaluation requirements Multiple human translations, adjudication of multiple translations Translation alternatives for ambiguous source text Translation post-editing

4 Recent translation efforts

5 Manual Translation Pipeline data pool select audio select text selected web data segment into sentence units convert to release format source text translated text validate release package convert to translator- friendly format translation QC transcription and segmentation

6 Manual Translation  Commercial agencies vetted, trained by LDC  Required to use LDC's project-specific guidelines Accuracy and fidelity over fluency General principles, language-specific requirements Rules for named entities, disfluencies, emoticons, etc. Requirements for formatting and validation Multiple examples of preferred translation  Separate guidelines for specialized tasks Post-editing machine translation output Translation alternatives Translation of novel single sentences Translation of handwritten document images

7 Translation QC  All translations undergo additional QC at LDC Typically 10% of training data, 100% of evaluation data reviewed  Standardized QC rating system deducts points for each type of error QC report including score, examples sent to translators Failing score requires re-translation of full data set  QC process facilitated by customized TransQC GUI

8 QCTrans GUI

9 Translation Project Management  Translation database is core management tool Document ID, language, genre, token count, LDC file server path Data set information including project, phase, partition, restrictions Translator assignment, due date, status, QC score, payment info  Backend to LDC Translator Extranet Translators access and submit assignments, validate submissions, view QC reports, generate invoices, check payment status  Queries support status tracking but also assignment generation, data selection, cross-project coordination What translation assignments are pending delivery this week? What is average QC score for this translator on Chinese BC? List Arabic NW files from 2007 that have never been released as GALE training data and are not part of any project's eval set

10 LDC Translation Database

11 Parallel text harvesting  Manual translation supplemented by harvesting and alignment of potential parallel text Harvest text from multilingual sites E.g. newswire providers Standardize markup format Use BITS document mapping module to find likely parallel documents Use Champollion to find sentence alignments  High yields in GALE program 82,000 Arabic-English document pairs 67,000 Chinese-English document pairs

12 Conclusion  Robust, flexible translation infrastructure to support multiple, distinct, concurrent projects  Much of this infrastructure freely available from LDC Task specifications, guidelines available for all projects http://projects.ldc.upenn.edu/gale/Translation/ QCTrans GUI slated for free, open-source distribution  Many resulting parallel text corpora already in LDC Catalog  Newly emerging data sets to be added over time

13 Recent corpora Catalog NumberTitle LDC2007T23GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1 LDC2008T08GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2 LDC2008T18GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3 LDC2007T24GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 LDC2008T09GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 LDC2009T02GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1 LDC2009T06GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2 LDC2008T02GALE Phase 1 Arabic Blog Parallel Text LDC2008T06GALE Phase 1 Chinese Blog Parallel Text LDC2009T03GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 LDC2009T09GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 LDC2009T15GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1 LDC2010T03GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2

14 Acknowledgements  This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this paper does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.


Download ppt "Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda."

Similar presentations


Ads by Google