Download presentation
Presentation is loading. Please wait.
1
Design of a Multi-lingual MT for Real-time Broadcast Captioning Course Project for 11-731 Ying Zhang (Joy) Joy@cs.cmu.edu Joy@cs.cmu.edu Advisor: Eric Nyberg April 18th, 2001
2
Project Description A broadcasting company wishes to translate the captioning for their show. Translations must be provided from English to multiple target languages. Real-time, high-accuracy translations are required. If the captions are poorly formed, then they need not be translated, but the customer would like you to consider teaching a controlled language to the captioners, so that high-quality translation can be achieved.
3
Domain Analysis (Cont.) Special requirementsSpecial requirements –Translating spoken language –The system must perform in real-time –The system can not make pre-edit and post-edit –The system should provide positive information to users as much as possible
4
Domain Analysis Characteristics amenable for MTCharacteristics amenable for MT – –The domain is narrow – –Possible to build large scale monolingual corpus – –Not necessary to translate every utterance in the broadcast Well-defined discourse structure (greetings, etc)Well-defined discourse structure (greetings, etc) No correspondence in another culture (“the bulls outrunning the bears today on Wall Street”)No correspondence in another culture (“the bulls outrunning the bears today on Wall Street”)
5
Domain Analysis-Data(1) Fixed Patterns 000084 WWW.MEDIACAPTIONING.COM [CLOSED CAPTIONING 000085 PROVIDED BY BELL ATLANTIC, THE 000089 HEART OF COMMUNICATION] >> 000090 FROM CNN IN WASHINGTON, THIS 000091 IS "WORLDVIEW." I'M BERNARD 000094 SHAW. >> AND I'M JUDY WOODRUFF. >>> 000095 TALKS BETWEEN PROTESTANT AND
6
Domain Analysis-Data(2) Idioms, Phrases and AcronymsIdioms, Phrases and Acronyms 000218 SCHEDULE? >> THE DISARMAMENT OF ALL 000220 PARAMILITARIES INCLUDING THE IRA 000223 OR THE INSTITUTION OF A 000225 CABINET FOR THE NEW NORTHERN 000227 IRELAND ASSEMBLY INCLUDING SINN FEIN 000270 ALL THIS INDICATE THAT THE 000271 ORIGINAL GOOD FRIDAY AGREEMENT WAS 000276 JUST NOT REALISTIC? >> NO,
7
Domain Analysis-Data(3) Sentence BoundarySentence Boundary 000112 PARAMILITARIES MUST DISARM. MR. BLAIR AND 00140 BEFORE TOO LONG JUDY, YOU CAN SEE 000141 BEHIND ME THAT PEOPLE ARE 000143 STILL AT WORK HERE ALMOST 24 HOURS 000145 AFTER THE DEADLINE PASSED 000147 WHERE THERE'S LIFE, THERE'S HOPE 000149 AND WHEN THERE'S TALK, I GUESS 000151 THERE IS LIFE. WE'RE TOLD BY
8
Assumptions (1) Partial Translation is acceptablePartial Translation is acceptable – –Users may know some English, although their vocabulary size may not be large enough – –Users have visual information – –Users may have background information for the topic – –Provide only positive information to user, do not translate everything unless confident
9
Assumptions (2) 10 seconds delay is acceptable10 seconds delay is acceptable 10:01:12 am 10:01:22am
10
Risk Factors Technical risks Business risks
11
Technical Risks Performance constraints – –Real-time – –High-quality, even if partial translation is acceptable Interface with hardware and software in broadcasting system Specialized user interface if a human translator works together with MT The domain of news broadcasting may be too wide to be covered by current MT technology
12
Business Risks If the quality or real-time requirement can not be reached, the customer will not accept this product The population of potential customers who need partial translation result is not large enough Human translators provided with transcribed caption can translate it in real-time Sales force do not think they can sell this translated service
13
Technical Rationale Multi-engine machine translation system (the requirement of multi target languages can not be satisfied now) Automated update corpus/lexicon from news source Provide only positive information, un-translated text has 0 information, wrong translation has negative information! – –Translate only chunks with high confidence – –Translate only simple structures, leave conjunction and prepositions for complex structures untranslated
14
System Architecture Nyberg and Mitamura (1997) "A Real-Time MT System for Translating Broadcast Captions" Proceedings of MT Summit VI
15
Extracting Lexicon/Phrase The lexicon/phrase used in news domain changes rapidly Comparable corpus exists Extracting lexicon/phrase from comparable corpus
16
Comparable corpus
17
Plan Overview Augmenting rules for news domain Constructing bilingual corpus Research on extracting lexicon from comparable corpus Adjust chart manager for partial translation Research on effects of partial translation Training EBMT
18
Resources Existing KBMT systemExisting KBMT system Existing EBMT softwareExisting EBMT software Transcribed caption (monolingual) dataTranscribed caption (monolingual) data DictionaryDictionary
19
Bibliography Nyberg and Mitamura, 1997, A Real-Time MT System for Translating Broadcast Captions, Proceedings of MT Summit VI David Turcato, A Unified Example-Based and Lexical Approach to Machine Translation, TMI 99 Pascale Fung, A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non- parallel Corpora, Lecture Notes in Artificial Intelligence, Springer Publisher, vol 1529, 1-17.
20
Thanks! Questions?Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.