Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

Configuration management
© Nuance Communications, Inc. All rights reserved. Page 1 Nuance ® AutoStore ® for SAP ® solutions.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.
Assurance through Enhanced Design Methodology Orlando, FL 5 December 2012 Nirav Davé SRI International This effort is sponsored by the Defense Advanced.
THE TRANSLATION NETWORK Overview  Easily manage your multilingual sites  Synchronize content and manage changes  Translate content on the fly  Use.
Speech and Language Technologies in the Next Generation Localisation CSET Prof. Andy Way, School of Computing, DCU.
Information Retrieval in Practice
4/14/20051 ACE Annotation Ralph Grishman New York University.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Overview of Search Engines
Lecture-8/ T. Nouf Almujally
1 Electronic Filing System United States Patent and Trademark Office.
Digitization at the National Archives and Records Administration Doris Hamburg Director, Preservation Programs James Hastings Director, Access Programs.
© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
This chapter is extracted from Sommerville’s slides. Text book chapter
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 1- 1 Chapter 1 - Introduction: Databases and Database Users - Outline Types of Databases and.
 LREC 2008, Marrakech Morocco - May New Resources for Document Classification, Analysis and Translation Technologies Stephanie Strassel, Lauren.
ArcGIS Workflow Manager An Introduction
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
DE&T (QuickVic) Reporting Software Overview Term
Cataloging for Electronic Commerce: Tool and Resource Development for Creating Standardized Catalogs for U.S. Defense Logistics Information Service Barry.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
FLAVIUS Technical presentation (Overblog, Qype, TVTrip) - WP2 Platform architecture.
1 CSBP430 – Database Systems Chapter 1: Databases and Database Users Mamoun Awad College of Information Technology United Arab Emirates University
The Metadata Object Description Schema (MODS) NISO Metadata Workshop May 20, 2004 Rebecca Guenther Network Development and MARC Standards Office Library.
EARS STT Workshop at ICASSP, March 2005 EARS STT Workshop at ICASSP Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel,
Bing Hong OSIsoft Internationalization &
Welcome to the Interactive Survey System. 2 What Is the Interactive Survey System? Online, interactive version of the documents needed for NCQA accreditation,
Enriching Word Alignment with Linguistic Tags Linguistic Data Consortium, IBM Xuansong Li, Niyu Ge, Stephen Grimes, Stephanie M. Strassel, Kazuaki Maeda.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva
DTS Conversion to SSIS Conversion Best Practices Mike Davis
Customizing the IMDI metadata schema for endangered languages Heidi Johnson (AILLA) Arienne Dwyer (DOBES)
Current Projects in DTEI Presented By: Tracy Jordan.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Eurostat Expression language (EL) in Eurostat SDMX - TWG Luxembourg, 5 Jun 2013 Adam Wroński.
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Reports and Learning Resources Module 5 1. SLMS Primary Administrator Training Module 5: Reports and Learning Resources 2.
Windows Role-Based Access Control Longhorn Update
Towards Cross-Language Sentiment Analysis through Universal Star Ratings KMO 2012 Malissa Bal Erasmus University Rotterdam Flavius.
GPO’s Federal Digital System December 10, 2009 U.S. Government Printing Office.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
 TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004 Creating the TDT5 Corpus and 2004 Evaluation Topics at LDC Stephanie Strassel, Meghan Glenn, Junbo.
Advanced Technical Writing 2006 Session #13. Today In Class ► The third analytic perspective: workflows & production models ► Thinking about “metadata”
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
ESRI Education User Conference – July 6-8, 2001 ESRI Education User Conference – July 6-8, 2001 Introducing ArcCatalog: Tools for Metadata and Data Management.
Preparing for Portals Paul Dempsey Director of Electronic Communication Dickinson College.
ARIADNE is funded by the European Commission's Seventh Framework Programme Archiving and Repositories Holly Wright.
SWRCBSWRCBSWRCBSWRCB AB2886 Implementation San Jose Training San Jose Training July 30, 2001 Marilyn R. Arsenault ArsenaultLegg, Inc.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Considerations for Regional Data Collection, Sharing and Exchange Bruce Schmidt StreamNet Program Manager Pacific States Marine Fisheries Commission Presentation.
Canadian SNOMED CT® Extensions Challenges & Lessons learned Presentation to Implementation SIG October 2012 Presented by Linda Parisien and Beverly Knight.
Information Retrieval in Practice
Information Architecture
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
 Corpus Formation [CFT]  Web Pages Annotation [Web Annotator]  Web sites detection [NEACrawler]  Web pages collection [NEAC]  IE Remote.
Static Detection of Cross-Site Scripting Vulnerabilities
Learning Intentions: To understand what is required to achieve a Pass, Merit or Distinction for Task 2.
9/22/2018.
HOW TO WRITE A SYSTEMATIC/NARRATIVE REVIEW
Lightweight tools for on-line course development
Presentation transcript:

Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda

Introduction  LDC develops large scale parallel text corpora for sponsored research programs Manual creation of parallel text by human translators Harvesting, aligning potential parallel documents from known repositories and the web  Recent expansion in scope and variety Requiring improvements in quality, efficiency and cost-effectiveness

Context for Resource Creation  Previous focus primarily Chinese, Arabic newswire (NW)  Current focus on "unstructured" data Broadcast News (BN) and Broadcast Conversation (BC) Weblogs, Newsgroups (WB) Handwritten document images of many types (VAR)  New linguistic varieties Eight language pairs in the LCTL program Colloquial Arabic varieties for some projects  New evaluation requirements Multiple human translations, adjudication of multiple translations Translation alternatives for ambiguous source text Translation post-editing

Recent translation efforts

Manual Translation Pipeline data pool select audio select text selected web data segment into sentence units convert to release format source text translated text validate release package convert to translator- friendly format translation QC transcription and segmentation

Manual Translation  Commercial agencies vetted, trained by LDC  Required to use LDC's project-specific guidelines Accuracy and fidelity over fluency General principles, language-specific requirements Rules for named entities, disfluencies, emoticons, etc. Requirements for formatting and validation Multiple examples of preferred translation  Separate guidelines for specialized tasks Post-editing machine translation output Translation alternatives Translation of novel single sentences Translation of handwritten document images

Translation QC  All translations undergo additional QC at LDC Typically 10% of training data, 100% of evaluation data reviewed  Standardized QC rating system deducts points for each type of error QC report including score, examples sent to translators Failing score requires re-translation of full data set  QC process facilitated by customized TransQC GUI

QCTrans GUI

Translation Project Management  Translation database is core management tool Document ID, language, genre, token count, LDC file server path Data set information including project, phase, partition, restrictions Translator assignment, due date, status, QC score, payment info  Backend to LDC Translator Extranet Translators access and submit assignments, validate submissions, view QC reports, generate invoices, check payment status  Queries support status tracking but also assignment generation, data selection, cross-project coordination What translation assignments are pending delivery this week? What is average QC score for this translator on Chinese BC? List Arabic NW files from 2007 that have never been released as GALE training data and are not part of any project's eval set

LDC Translation Database

Parallel text harvesting  Manual translation supplemented by harvesting and alignment of potential parallel text Harvest text from multilingual sites E.g. newswire providers Standardize markup format Use BITS document mapping module to find likely parallel documents Use Champollion to find sentence alignments  High yields in GALE program 82,000 Arabic-English document pairs 67,000 Chinese-English document pairs

Conclusion  Robust, flexible translation infrastructure to support multiple, distinct, concurrent projects  Much of this infrastructure freely available from LDC Task specifications, guidelines available for all projects QCTrans GUI slated for free, open-source distribution  Many resulting parallel text corpora already in LDC Catalog  Newly emerging data sets to be added over time

Recent corpora Catalog NumberTitle LDC2007T23GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1 LDC2008T08GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2 LDC2008T18GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3 LDC2007T24GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 LDC2008T09GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 LDC2009T02GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1 LDC2009T06GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2 LDC2008T02GALE Phase 1 Arabic Blog Parallel Text LDC2008T06GALE Phase 1 Chinese Blog Parallel Text LDC2009T03GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 LDC2009T09GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 LDC2009T15GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1 LDC2010T03GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2

Acknowledgements  This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR The content of this paper does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.