New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所
10/22/ Why MT Matters? Economics –Costs / Quality / Turnaround –Many MT developers, customers, and sponsors have already invested a lot for years. Politics –Multi-lingual Countries / Minority Languages Intelligence Gathering –Governments / Companies / Individuals Research –AI / CS / Linguistics / Psychology / and so on
10/22/ Recent Trends PC-based MT Systems Online MT Services, MT on Demand – , Web pages, Uploads Sub-language MT Systems Dialog-based (Speech-to-Speech) MT Systems Computer-Assisted Translation
10/22/ Classifying MT Systems Operations Fully-Automatic MT Semi-automatic MT Computer-Assisted Translation (CAT-Tools) Input Unrestricted Texts Restricted Texts (e.g.Technical Manuals) / MT in mind Sub-languages / Controlled languages Quality High / Low / Acceptable / Applicable / Readable How to evaluate a MT system? Strategies (see next page)
10/22/ MT Strategies Fundamentals Direct Translation MT Transfer-based MT Interlingua MT Linguists vs. Empiricists New Strategies Knowledge-based MT Example-based MT Statistics-based MT Hybrid MT –Japanese manufacturers know well that a single linguistic theory cannot lead to a good MT system. They realize that a huge amount of language phenomena must be processed in an ad-hoc manner. (M. Nagao)
10/22/ Direct MT Simple syntactic analysis (disambiguation) Bilingual lexicon (word-by-word translation) Re-ordering rules Source Text Target Text
10/22/ Transfer-based MT SL-TL lexicon & transfer rules ST analysis Source Text (ST) Target Text (TT) structure transfer TT generation TT Structure ST Structure SL grammar & lexicon TL grammar & lexicon SL - source language; TL - target language
10/22/ Interlingua-based MT ST analysis Source Text (ST) Target Text (TT) Interlingua representation (+SL-TL lexicon) TT generation SL grammar & lexicon TL grammar & lexicon
10/22/ Knowledge-based MT All world knowledge? A long-term research Practical Systems: e.g. CMU’s KANT –narrow domain –domain model: defines all semantic classes and instances to represent all concepts in the domain –each concept definition includes: concept head (name of the concept) slots: allowable semantic roles fillers: allowable concept classes that the roles can contain –disambiguation by filler restriction –knowledge acquisition automatic or semi-automatic
10/22/ Example-based MT A companion module to improve MT quality Typically include the following (Nirenburg 1995): –sentence-aligned corpus –intra-language matching find chunks from source language part of the corpus which are best candidates for matching an input chunk –inter-language matching find the target language chunk corresponding to the chunk from the source language part of the corpus –chunk-combination The PANGLOSS Mark III Machine Translation System. S. Nirenburg, Technical Report CMU-CMT (available online at
10/22/ Statistics-based MT(1) Maximize Pr(S|T) = Pr(S) Pr(T|S) / Pr(T) Pr(S): source language model Pr(T|S): translation model –lexical translation, distortion, and fertility Some comments: (Machine Translation 7:(4)) –I joined the attack … without realizing that precisely what the research was doing was to question some of the fundamental assumptions underlying MT research since 1966 … With hindsight, I can see that what this research was doing was saying that in the 20 years since ALPAC, the second generation architecture had led to only slightly better results than the architecture it replaced … (Harold Somers) –My initial reaction was the same as Somers. … The integration of a CANDIDE-type engine into a traditional MT architecture should probably at the deepest level the architecture allows (John White)
10/22/ Statistics-based MT(2) Machine Translation 7:(4) –...not only does it need no linguistics or linguists, but no foreign speakers either.... about 43% of sentences correctly translated. That compares badly with SYSTRAN which is usually assigned figures of around 65% … even if it did equal SYSTRAN’s level of performance, it is not clear what inferences we should draw. … we must always remember that they need millions of words of parallel texts even to start … The problems noted then were of long-distance dependencies: … French and English … were a lucky choice … we have good historical reasons for believing that a purely statistical method cannot do high-quality MT (Yorick Wilks) Word alignment
10/22/ Evaluation Traditional Evaluation Metrics (Church & Hovy) –System-based Metrics –easy to measure, but only for a particular system –e.g. 60 sub-grammars, 900 rewriting rules, … –Text-based Metrics sentence-based metrics –e.g. # of semantically or syntactically correct sentences compressibility metrics amount of post-editing metrics –Cost-based Metrics: cost & time (per N words) –Demos (must avoid misleading) Developer’s view or Customer’s view
10/22/ Some MT Problems Morphological ambiguity Lexical ambiguity and structural ambiguity Lexical mismatch and structural mismatch Idioms and collocations Ill-formed input World knowledge
10/22/ CAT Tools Pre-editing and post-editing environments with linguistic analyses Translation Memory –As the translator translates the text, each sentence (translation unit) is also saved automatically to a sophisticated translation unit database memory. As he translates, any similar sentence already in the memory will appear on screen for editing.(Ian Gordon) Alignment Tools Terminology Management
10/22/ Standards Exchange Standard –(Multilingual) Text Formats –Lexicons –Knowledge Bases –Translation Memories Evaluation Standard
10/22/ Future Direction Exploratory Research or Prototype Research? Modular Design (cf. Somers’ Comments) Better Linguistic Theories Lexicon Construction Hybrid MT (Mainline MT engine + Additional Modules) Spoken Language (Dialog-based) MT MT Evaluation Computer-Assisted Translation / User-Friendly Environment Sub-languages MT Systems Distributed MT / Networked MT MT on Demand
10/22/ References –Journal of Machine Translation (Kluwer) –Proceedings of TMI, MT Summit, AMTA –Proceedings of ACL, COLING, ROCLING –E-Print Archive –AAMT –EAMT –The Association for Computational Linguistics –The LINGUIST List –Translation Research Group –Localization Industry Standards Association (LISA)
10/22/ References USC –CMU/LTI –Verbmobil –C-STAR II –GETA –Machine Translation at PAHO (ACG/T) –METEO –WordNet Bibliography
10/22/ References –Globalink, Inc. –SYSTRAN –Logos Corporation –TRADOS –A.I.SOFT –CSK Home Page –SHARP SOFT –OKI Software –KODENSHA –ASTRANSAC