A roadmap for MT : four « keys » to handle more languages, for all kinds of tasks, while making it possible to improve quality (on demand) International Conference on Universal Knowledge and Language (ICUKL2002), Goa, November 2002 Christian Boitet GETA, CLIPS, IMAG, 385 av. de la bibliothèque, BP 53 F Grenoble cedex 9, France
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 2/30 Outline Basic concepts What is MT ? Goals: Quality / User Architectures: Vauquois' triangle State of the art MT of texts: examples, problems MT of spoken dialogs The future of MT Goals 4 keys
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 3/30 What is M(a)T ? At least 3 types of automation MT = Machine Translation MAT = Machine Assisted Translation MAHT = Machine Aided Human Translation A scientific technology Informatics (computer science) Linguistics Mathematics
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 4/30 Goals: Quality / User
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 5/30 Architectures: Vauquois' triangle
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 6/30 Architekturen: Vauquois Dreieck (größer)
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 7/30 Formal intermediate structures
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 8/30 How to produce an MT system Choose an architecture Program the "tools" Spezialized languages for linguistic programming (SSLP) Development environment (MT shell) Build the "lingware" Lexical data / rules / weights Grammatical data / rules / weights Possible specialization to a typology ("sublanguage") How? Human work ± computer help / support Automatic learning (weights, likeliness…)
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 9/30 State of affairs only a small number of language pairs is covered by MT systems designed for information access Systran EC (2000): 19/110 language pairs, 8 OK for intended use See also examples by Ronaldo Martins even fewer are capable of quality translation or speech translation Now a few examples…
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 Examples: MT for access, Web (1)
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 Examples: MT for access, Web (2) FE quite "easy", compared with EG and mainly FG
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 Comparison: raw vs rough MT
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 Examples: MT for revisors…
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 …with BV-aero/FE (2)
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 MT of spoken dialogs Specialized systems are already usable e.g. ATR/Matsushita, IBM, CSTAR/Nespole!… Much "noise" and "ungrammaticalities" But specializing is very helpful! General systems are also possible e.g. NEC/Xroad, Linguatec/Talk&Translate Speech recognition is already good enough Rough may be good enough (e.g. for chatting) Interpretation is different from translation… …and participants are intelligent ! Similarity with access-oriented-MT
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 French-Korean through IF (1)
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 French-Korean through IF (2)
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 French-Korean through IF (3)
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 A road map… to which goals? MT of adequate quality Not only for access For all languages
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 Four keys 2 on the technical side 2 on the organizational side Compromize: a far wider coverage, a somewhat smaller asymptotic quality Automatic learning techniques Using non-textual pivots (intermediate formal descriptors) Democratization, cooperation Cooperative development of open source linguistic resources on the Web Towards systems where quality can be improved "on demand" by users
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 Learning techniques Extend the use of hybrid techniques symbolic, numerical, or mixed ==> they have demonstrated their potential at the research level stochastic grammars weighted (or "neural") dictionaries or build new tools, intrinsically numerical inspiration from voice recognition 2 examples learning analyzers : text —> semantic tree (IBM) learning implicit very detailed DG from tree bank (NAIST)
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 Using non-textual pivots Semantico-pragmatic (ontological) pivots task & domain oriented ==> limited applicability Abstract linguistic descriptors the most precise, but often too sophisticated depend on each language Anglo-semantic pivot: UNL "the HTML of linguistic content" in UNL, a hypergraph represents the abstract structure of (supposedly) equivalent English utterance less precise but "robust" symbols constructed from English ==> usable by all developers
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 pos obj agt Ronaldo (icl>proper noun) insplt goal(icl>abstract thing) left(aoj<thing) pos mod goal(icl>concrete thing) A simple UNL graph Ronaldo has headed the ball into the left corner of the goal
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 Cooperative development of open source linguistic resources on the Web Mutualization is necessary at least for lexical knowledge too costly even for the leaders size (#entries) has to augment for each language (300K, 3M?) #languages has to increase dramatically (11 —> 20 —> 180?) Integration of human- and machine-oriented knowledge is useful e.g. to produce mixed MT/MAHT systems
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 A contribution: the Papillon project Goal: produce many open source dictionaries from a central lexical data base Means: build rich (DiCo) monolingual dictionaries of lexies (senses) interlink lexies by interlingual links (axies) use XML & associated tools as basis to generate many formats for humans and for machines start from (free) digital resources induce "consumers" to become "producers" (contributors) Quality control: private accounts central validating/integrating group
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 Lexical Database Papillon database macrostructure User Dictionary Resource Interaction with the Dictionaries Extraction of Dictionaries Integration of existing resources Human Contributors
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 Interlingual links based on translations = "AXIEs" Possibility to link 1 lexie with >1 acceptions References to other semantic systems: AXIE—1————n—>UW PAPILLON diagram French. DiCo Vocable carte n.f. Lexie carte.1 carte à jouer Lexie carte.2 carte géographique Japan. DiCo 地図 カード Acception 343 UNL: card(icl>play), card(icl>thing)… Acception 345 UNL: map(fld>geography) Interlingual links Acception 1002 UNL: card(fld>money) a Thai DiCo Engl. DiCo Vocable card N Lexie card.1 playing card Lexie card.2 money card Vocable=lexie map
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 Construct systems where quality can be improved "on demand" by users a priori through interactive disambiguation in the source language or a posteriori by correcting the pivot representation (UNL or other) through any language (as in MultiMeteo) ==> In the 2 cases, all versions (in all languages) are improved possibility to merge MT multilingual generation computer-aided authoring
Ch. Boitet ICUKL2002, Goa, 25-29/11/ /30 Conclusion 4 keys to open the door to MT of adequate quality to all languages On the technical side, dramatically increase the use of learning techniques use pivot architectures, the most universally usable pivot being UNL On the organizational side, cooperatively develop open source linguistic resources on the web construct systems where quality can be improved "on demand" by users On the practical side, seek keys to unlock private investment, public funding, voluntary cooperation could this conference become a decisive turning point?