Latest Developments in (S)MT Harold Somers University of Manchester MT Wars II: The Empire (Linguistics) strikes back.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series 2006.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
C SC 620 Advanced Topics in Natural Language Processing 3/9 Lecture 14.
Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)
Daniel Gildea (2003): Loosely Tree-Based Alignment for Machine Translation Linguistics 580 (Machine Translation) Scott Drellishak, 2/21/2006.
Models of Generative Grammar Smriti Singh. Generative Grammar  A Generative Grammar is a set of formal rules that can generate an infinite set of sentences.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
What a professional translator should know about Machine Translation Harold Somers Professor Emeritus University of Manchester.
Natural Language Processing Expectation Maximization.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Translation Model Parameters (adapted from notes from Philipp Koehn & Mary Hearne) 24 th March 2011 Dr. Declan Groves, CNGL, DCU
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Machine translation Context-based approach Lucia Otoyo.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
An Introduction to SMT Andy Way, DCU. Statistical Machine Translation (SMT) Translation Model Language Model Bilingual and Monolingual Data* Decoder:
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)
Machine Translation Course 5 Diana Trandab ă ț Academic year:
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
What’s in a translation rule? Paper by Galley, Hopkins, Knight & Marcu Presentation By: Behrang Mohit.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Data Profiling 13 th Meeting Course Name: Business Intelligence Year: 2009.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NATURAL LANGUAGE PROCESSING
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Neural Machine Translation
Approaches to Machine Translation
CSE 517 Natural Language Processing Winter 2015
Statistical NLP: Lecture 13
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Approaches to Machine Translation
Statistical Machine Translation Papers from COLING 2004
Dekai Wu Presented by David Goss-Grubbs
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
A Joint Model of Orthography and Morphological Segmentation
Presentation transcript:

Latest Developments in (S)MT Harold Somers University of Manchester MT Wars II: The Empire (Linguistics) strikes back

Overview  The story so far EBMT SMT Latest developments in RBMT  Is there convergence? Some attempts to classify MT (Carl and Wu’s MT model spaces)  Has the empire struck back?

The story so far: EBMT  Early history well known Nagao (1981/3) Early development as part of RBMT Relationship with Translation Memories Focus (cf. Somers 1998) on  Matching algorithms  Selection and storage of examples  Mainly sentence-based  TL generation (Recombination) not much addressed Somers, H. (1998) New paradigms in MT, 10th European Summer School in Logic, Language and Information, Workshop on MT, Saarbrücken; revised version in Machine Translation 14 (1999) and 2 nd revised version in M. Carl & A. Way (2003) Recent Advances in EBMT (Kluwer).

EBMT in a nutshell (In case you’ve been on Tatooine for the last 15 years)  Database of paired examples  Translation involves Finding the best example(s) (matching) Identifying which bits do(n’t) match (alignment) Replacing the non-matching bits (if multiple examples, gluing them together) (recombination)  All of the above at run-time

The operation was interrupted because the file was hidden. a. The operation was interrupted because the Ctrl-c key was pressed. L’opération a été interrompue car la touché Ctrl-c a été enfoncée. b. The specified method failed because the file is hidden. La méthode spécifiée a échoué car le fichier est masqué EBMT in a nutshell (cont.)  Main difficulty is “boundary friction” in two senses: The old man is dead : Le vieil homme est mort The old woman is dead : * Le vieil femme est mort

EBMT later developments  Example generalisation (templates)  Incorporation of linguistic resources and/or statistical measures  Structured representation of examples  Use of statistical techniques

Example generalisation (Furuse & Iida, Kaji et al., Matsumoto et al., Carl, Cicekli & Güvenir, Brown, McTait, Way et al.)  Similar examples can be combined to give a more general example  Can be seen as a way of generating transfer rules (and lexicons)  Process may be entirely automatic, based on string matching …  … or “seeded” using linguistic information (POS tags) or resources (bilingual dictionary)

Example generalisation (cont.) The dog ate a rabbit  inu wa usagi o tabeta monkey  saru man  hito The … ate a peach  … wa momo o tabeta The monkey ate a peach  saru wa momo o tabeta The man ate a peach  hito wa momo o tabeta dog  inu rabbit  usagi The … x ate a... y  … x wa … y tabeta

Example generalisation (cont.)  That’s too simple (e.g. because of boundary friction)  Need to introduce constraints on the slots, e.g. using POS tags and morphological information (which implies some other processing)  Can use clustering algorithms to infer substitution sets

Incorporation of linguistic resources  Actually, early EBMT used all sorts of linguistic resources  Briefly there was a move towards more “pure” approaches  Now we see much use of POS tags (sometimes only partial, e.g. marker words – Way et al.), morphological analysis (as just mentioned), bilingual lexicons  Target-language grammars for recombination/generation phase

Incorporation of statistical measures  Example database preprocessed to assign weights (probabilities) to fragments and their translations (Aramaki et al.) Good way of handling “ambiguities” due to alternative translations  Clustering words into equivalence classes for example generalization (Brown)  Using statistical tools to extract translation knowledge from parallel corpora (Yamamoto & Matsumoto)  Statistically induced grammars for translation or generation, as in...

Use of structured representations  Again, a feature of early EBMT, now reappearing  Translation grammars induced from the example set  Examples stored as tree structures (overwhelmingly: dependency structures)

Translation grammars  Carl: generates translation grammars from aligned linguistically annotated texts  Way:Data-Oriented Translation based on Poutsma’s DOP, using both PS and LFG models)

Structured examples  Use of tree comparison algorithms to extract translation patterns from parsed corpora/tree banks (Watanabe et al.)  Translation pairings extracted from aligned parsed examples (Menezes & Richardson)  Tree-to-string approach used by Langlais & Gotti and Liu et al. (+ statistical generation model)

Typical use of structured examples  Rule-based analysis and generation + example-based transfer Input is parsed into representation using a traditional or statistics-based analyser TL representation constructed by combining translation mappings learned from the parallel corpus TL sentence generated using a hand-written or machine-learned generation grammar  Is this still EBMT? Note that the only example-based part is use of mappings which are learned, not computed at run-time

Pure EBMT (Lepage & Denoual)  In contrast (but now something of an oddity): pure analogy-based EBMT  Use of proportional analogies A:B::C:D  Terms in the analogies are translation pairs A → A’: B → B’:: C → C’: D → D’

Pure EBMT  No explicit transfer  No extraction of symbolic knowledge No use of templates Analogies do not always represent any sort of linguistic reality  No training or preprocessing Solving the proportional analogies is done at run-time

The story so far (SMT)  Early history well known IBM group inspired by improved results in speech recognition when non-linguistic approach taken Availability of Canadian Hansards inspired purely statistical approach to MT (1988) Immediate partial success (60%) to the dismay of MT people Early observers (Wilks) predicted hybrid methods (“stone soup”) would evolve  Later developments Phrase-based SMT Syntax-based SMT

SMT in a nutshell (In case you’ve been on Kamino for the last 15 years)  From parallel corpus two sets of statistical data are extracted Translation model: probabilities that a given word e in the SL gives rise to a word f in the TL (Target) language model: most probable word- order for the words predicted by the translation model These two models are computed off-line  Given an input sentence, a “decoder” applies the two models, and juggles the probabilities to get the best score; various methods have been proposed

SMT in a nutshell (cont.)  The translation model has to take into account the fact that for a given e in there may be various different fs depending on context (grammatical variants as well as alternatives due to polysemy or homonymy) a given e may not necessarily correspond to a single f, or any f at all: “fertility” (e.g. may have → aurait; implemented → mis en application)

SMT in a nutshell (cont.)  The language model has to take into account the fact that The TL words predicted by the translation model will not occur in the same order as the SL words: “distortion” TL word choices can depend on neighbouring words (which may be easy to model) or, especially because of distortion, more distant words: “long-distance dependencies”, much harder to model

SMT in a nutshell (cont.)  Main difficulty: combination of fertility and distortion: Zeitmangel erschwert das Problem. Lack of time makes the problem more difficult. Eine Diskussion erübrigt sich demnach. Therefore there is no point in discussion. Das ist der Sache nicht angemessen. That is not appropriate for this matter. Den Vorschlag lehnt die Kommission ab. The Commission rejects the proposal.

SMT later developments  Phrase-based SMT  Extend models beyond individual words to word sequences (phrases) Direct phrase alignment Word alignment induced phrase model Alignment templates  Results better than word-based models, and show improvement proportional (log-linear) to corpus size  Phrases do not correspond to constituents, and limiting them to do so hurts results

Direct phrase alignment (Wang & Waible 1998, Och et al., 1999, Marcu & Wong 2002)  Enhance word translation model by adding joint probabilities, i.e. probabilities for phrases  Phrase probabilities compensate for missing lexical probabilities  Easy to integrate probabilities from different sources/methods, allows for mutual compensation

Word alignment induced model Koehn et al. 2003; example stolen from Knight & Koehn Maria did not slap the green witch Maria no daba una botefada a la bruja verda Start with all phrase pairs justified by the word alignment

Word alignment induced model Koehn et al. 2003; example stolen from Knight & Koehn (Maria, Maria), (no, did not) (daba una botefada, slap), (a la, the), (verde, green), (bruja, witch)

Word alignment induced model Koehn et al. 2003; example stolen from Knight & Koehn (Maria, Maria), (no, did not) (daba una botefada, slap), (a la, the), (verde, green) (bruja, witch), (Maria no, Maria did not), (no daba una botefada, did not slap), (daba una botefada a la, slap the), (bruja verde, green witch) etc.

Word alignment induced model Koehn et al. 2003; example stolen from Knight & Koehn (Maria, Maria), (no, did not), (slap, daba una bofetada), (a la, the), (bruja, witch), (verde, green), (Maria no, Maria did not), (no daba una bofetada, did not slap), (daba una bofetada a la, slap the), (bruja verde, green witch), (Maria no daba una bofetada, Maria did not slap), (no daba una bofetada a la, did not slap the), (a la bruja verde, the green witch), (Maria no daba una bofetada a la, Maria did not slap the), (daba una bofetada a la bruja verde, slap the green witch), (no daba una bofetada a la bruja verde, did not slap the green witch), (Maria no daba una bofetada a la bruja verde, Maria did not slap the green witch)

Word alignment induced model  Given the phrase pairs collected, estimate the phrase translation probability distribution by relative frequency (without smoothing)

Alignment templates Och et al. 1999; further developed by Marcu and Wong 2002, Koehn and Knight 2003, Koehn et al. 2003)  Problem of sparse data worse for phrases  So use word classes instead of words alignment templates instead of phrases more reliable statistics for translation table smaller translation table more complex decoding  Word classes are induced (by distributional statistics), so may not correspond to intuitive (linguistic) classes  Takes context into account

Problems with phrase-based models  Still do not handle very well... dependencies (especially long-distance) distortion discontinuities (e.g. bought = habe... gekauft )  More promising seems to be...

Syntax-based SMT  Better able to handle Constituents Function words Grammatical context (e.g. case marking)  Inversion Transduction Grammars  Hierarchical transduction model  Tree-to-string translation  Tree-to-tree translation

Inversion transduction grammars  Wu and colleagues (1997 onwards)  Grammar generates two trees in parallel and mappings between them  Rules can specify order changes  Restriction to binary rules limits complexity

Inversion transduction grammars

 Grammar is trained on word-aligned bilingual corpus: Note that all the rules are learned automatically  Translation uses a decoder which effectively works like traditional RBMT: Parser uses source side of transduction rules to build a parse tree Transduction rules are applied to transform the tree The target text is generated by linearizing the tree

Almost all possible mappings can be handled Missing ones (crossing constraints) are not found in Wu’s corpus But examples can be found, apparently

Hierarchical transduction model (Alshawi et al. 1998)  Based on finite-state transducers, also uses binary notation  Uses automatically induced dependency structure  Initial head-word pair is chosen  Sentence is then expanded by translating the dependent structures

Tree-to-string translation (Yamada & Knight 2001, Charniak 2003)  Uses (statistical) parser on input side only  Tree is then subject to reordering and insertion according to models learned from data  Lexical translation is then done, again according to probability models

wa reorder insert translate linearize kare ha ongaku wo kiku no ga daisuki desu wa

Tree-to-tree translation (Gildea 2003)  Use parser on both sides to capture structurual differences  Subtree cloning (Habash 2002, Čmejrek et al. 2003)  Full morphology/syntactic/semantic parsing  All based on stachastic grammars

Latest developments in RBMT  RBMT making a come-back (e.g. METIS)  Perhaps it was always there, just wasn’t represented in CL journals/conferences  There is some activity, but around the periphery Open-source systems development for low-density languages  Much use made of corpus-derived modules, eg tagging, chunking  SMT is now RBMT, only the rules are learned rather than written by linguists

Overview  The story so far EBMT SMT Latest developments in RBMT  Is there convergence? Some attempts to classify MT (Carl and Wu’s MT model spaces)  Has the empire struck back?

Classifications of MT  Empirical vs. Rationalist data- vs theory-driven use (or not) of symbolic representation From MLIM chapter 4:  high vs. low coverage  low vs. high quality/fluency  shallow vs. deep representation  Distinguish in the above design vs. consequence How true are they anyway?

EBMT~SMT: Is there convergence?  Lively debate on mtlist  Articles by Somers, Turcato & Popowich in Carl & Way (2003) Hutchins, Carl, Wu (2006) in special issue of Machine Translation  Slides marked need your input! !

Essential features of EBMT Use of bilingual corpus data as the main (only?) source of knowledge ( Somers )  Most early EBMT systems were hybrids We do not know a priori which parts of example are relevant ( Turcato & Popowich )  Raw data is consulted at run-time: (little or) no preprocessing  Therefore template-based EBMT is already a hybrid (with RBMT) Act of matching the input against the examples, regardless of how they are stored ( Hutchins )

Pros (and cons) of analogy model  Like CBR: Library of cases used during task performance Analogous examples broken down, adapted, recombined  In contrast with other machine learning methods Offline learning to compile abstract performance model  No loss of coverage due to incorrect generalization during training  Guaranteed correct when input is exactly like an example in the training set (not true of SMT)  But: Lack of generalization leads to potential runtime inefficiency (Wu, 2006)

EBMT~SMT: Common features  Easily agreed Use of bilingual corpus data as the main (only?) source of knowledge Translation relations are derived automatically from the data Underlying methods are independent of language-pair, and hence of language similarity  More contentious Bilingual corpus data should be real (a practical issue for SMT, but some EBMT systems use “hand-crafted” examples) System can be easily extended just by adding more data

EBMT~RBMT common features  Hybrid is easy to conceive Rule-based analysis/generation with example-based transfer Example-based processing only for awkward cases !

SMT~RBMT common features  Some versions of SMT exactly mirror classic RBMT parse-transfer-generate  Same things are hard Long-distance dependency Discontinuous constituents !

Wu’s 3D classification of all MT  Example-based vs. schema-based abstraction or generalization performed at run- time  Compositional vs. lexical Relates primarily to transfer (or equiv.)  Statistical vs. logical  Pictures also show historical development

Classic (direct and transfer) MT models  Early systems (Georgetown) lexical and compositional  Treatment of idioms, collocations, phrasal translations in classical 2G transfer systems  Modern RBMT systems starting to adopt statistical methods (according to Wu)  Where do commercial systems sit?

EBMT systems

SMT systems

Example-based SMT systems

Summary

Model space corpus-based MT (Carl 2000)  Based on Dummett’s theory of meaning  Rich vs austere Complexity of representations  Molecular vs holistic Descriptions based on finite set of predefined features vs global distinctions  Fine-grained vs coarse-grained Based on smaller or larger units

Rich vs austere  Translation memories are most austere, depending only on graphemic similarity  TMs with annotated examples (eg Planas & Furuse) are richer  Early EBMT systems, and recent systems where examples are generalized are rich  EBMT using light annotation (eg TAGS, markers) are moderately rich  Pure EBMT (Lepage & Denoual) is austere  Early SMT systems were austere, but move towards syntax makes them richer  Phrase-based SMT still austere

Annotated translation memories Classic EBMT (Sato, Nagao) Template-based EBMT (McTait, Brown, Cicekli) Phrase-based SMT Syntax-based SMT Marker-based EBMT (Way) EBMT where examples are lightly annotated Translation memories Early SMT (Brown et al.) Pure EBMT (Lepage) METIS

Molecular vs holistic  Early SMT purely holistic, as is pure EBMT  TMs molecular: distance measure based on fixed set of symbols  Translation templates are holistic, but molecular if they depend on some sort of analysis  Phrase-based and syntax-based SMT highly molecular

Annotated translation memories Classic EBMT (Sato, Nagao) Template-based EBMT (Cicekli) Phrase-based SMT Syntax-based SMT Marker-based EBMT (Way) EBMT where examples are lightly annotated Translation memories Early SMT (Brown et al.) Pure EBMT (Lepage) Template-based EBMT (McTait, Brown) METIS analysis METIS generation

Coarse- vs. fine-grained  Coarse-grained translates with bigger units  TM system wirks only on sentences: coarse-grained  Word-based systems are fine- grained: Early SMT  Phrase-based SMT slightly more coarse-grained  Template-based EBMT fine-grained !

Phrase-based SMT Marker-based EBMT (Way) Translation memories Early SMT (Brown et al.) coarse fine Template-based EBMT (McTait, Brown)

Overview  The story so far EBMT SMT Latest developments in RBMT  Is there convergence? Some attempts to classify MT (Carl and Wu’s MT model spaces)  Has the empire struck back?

Has the empire struck back?  Is linguistics back in MT? Was MT ever of interest to linguists?  Is SMT like RBMT? !

Vauquois triangle To what extent can a given system be described in terms of the classic view of MT (G2) ? !

Has the empire struck back?  Is linguistics back in MT? Was MT ever of interest to linguists?  Is SMT like RBMT? ! As predicted by Wilks (“Stone soup” talk, 1992) way forward is hybrid Negative experience (for me) of seeing SMT presenters rediscovering problems first described by Yngve, Vauquois without referencing the original papers!

IT’S LIFE, JIM, BUT NOT AS WE KNOW IT. LINGUISTICS

SMT EBMT RBMT ! Fill in the gaps Annotated translation memories Classic EBMT (Sato, Nagao) Template-based EBMT (Cicekli) Phrase-based SMT Syntax-based SMT Marker-based EBMT (Way) EBMT where examples are lightly annotated Translation memories Early SMT (Brown et al.) Pure EBMT (Lepage) Template-based EBMT (McTait, Brown)