Statistical Machine Translation Or, How to Put a Lot of Work Into Getting the Wrong Answer Timothy White Presentation For CSC CSC 9010: Natural Language.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Conceptual Clustering
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
CS 330 Programming Languages 09 / 16 / 2008 Instructor: Michael Eckmann.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)
Introduction to Machine Learning Approach Lecture 5.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Microsoft Research Faculty Summit Robert Moore Principal Researcher Microsoft Research.
Jan 2005Statistical MT1 CSA4050: Advanced Techniques in NLP Machine Translation III Statistical MT.
THE MATHEMATICS OF STATISTICAL MACHINE TRANSLATION Sriraman M Tallam.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Translation Model Parameters (adapted from notes from Philipp Koehn & Mary Hearne) 24 th March 2011 Dr. Declan Groves, CNGL, DCU
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
REFACTORING Lecture 4. Definition Refactoring is a process of changing the internal structure of the program, not affecting its external behavior and.
Lecture 21: Languages and Grammars. Natural Language vs. Formal Language.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
An Introduction to SMT Andy Way, DCU. Statistical Machine Translation (SMT) Translation Model Language Model Bilingual and Monolingual Data* Decoder:
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Machine Translation Course 5 Diana Trandab ă ț Academic year:
A Language Independent Method for Question Classification COLING 2004.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
Martin KayTranslation—Meaning1 Martin Kay Stanford University with thanks to Kevin Knight.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Propositional Calculus CS 270: Mathematical Foundations of Computer Science Jeremy Johnson.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
Introduction to Genetic Algorithms. Genetic Algorithms We’ve covered enough material that we can write programs that use genetic algorithms! –More advanced.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
By: Jesse Ehlert Dustin Wells Li Zhang Iterative Aggregation/Disaggregation(IAD)
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
NATURAL LANGUAGE PROCESSING
January 2012Spelling Models1 Human Language Technology Spelling Models.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Ling 575: Machine Translation Yuval Marton Winter 2016 January 19: Spill-over from last class, some prob+stats, word alignment, phrase-based and hierarchical.
Statistical Machine Translation Part II: Word Alignments and EM
CSE 517 Natural Language Processing Winter 2015
Statistical NLP: Lecture 13
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Lecture 12: Machine Translation (II) November 4, 2004 Dan Jurafsky
Word embeddings (continued)
Chapter 10: Compilers and Language Translation
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Presentation transcript:

Statistical Machine Translation Or, How to Put a Lot of Work Into Getting the Wrong Answer Timothy White Presentation For CSC CSC 9010: Natural Language Processing, 2005

Historic Importance Almost immediately recognized as an important application of computers Warren Weaver (1949): “ I have a text in front of me which is written in Russian but I am going to pretend that it is really written in English and that it has been coded in some strange symbols. All I need to do is strip off the code in order to retrieve the information contained in the text.” In 1954 IBM demonstrated a word-for-word system Adapted from [Dorr04], slide 3

Modern Importance Commercial Translation one of the most used Google features, extensive use of Babelfish EU spends >$1 Billion for translations per year Many businesses conducting multi-international operations also a huge potential market Academic Requires knowledge of many NLP areas: lexical semantics, parsing, morphological analysis, statistical modeling, etc. Would greatly increase ease of sharing knowledge and research among the world-wide scientific community Adapted from [Dorr04], slides 4-5

Goals of this Presentation Present basic concepts underlying modern Statistical Machine Translation Present as an example the IBM Model 3 system Describe methods of utilizing Language/Translation Models to produce actual translations.: Set a foundation for self-exploration

Statistical Machine Translation Motivation: to produce a (translated) sentence β that is grammatically correct in language B that is semantically identical to a source sentence α in language A The Noisy Channel: assume that sentence β was meant to be written, but was corrupted by ‘noise’ and came out as α in language A. therefore, determine sentence β by considering 1) what sentences are written and 2) how language α becomes language β Philosophy: a person thought sentence β in language B, but accidentally communicated it in language A; so we must figure out what β was intended by α. [Knig99]

Bayes’ Theorem In English: The probability of an event (β) occurring if we know that event (α) occurred is equal to the probability of α occurring given β times the probability of α Logically: probability( β if α ) = probability( α if β )*probability( β ) probability( α ) Formally: P(β|α) = P(α|β)P(β) P(α)

Why It Is Relevant We know that sentence α occurred. Lets make every translation possible and call the group a haystack; somewhere in the bucket there must be the needle that is the 'perfect' translation…But, how could we find it? If we know the probability of a translation occurring (i.e. its grammatical correctness) we can eliminate 99% of the sentences. Perfect Translation: the translation that would be produced if a team of experts were to translate the sentence.

Applying Bayes’ Theorem The most likely translation (β) of the source sentence (α) is maximumValue(P(β) * P(α|β)) ** P(α) is constant and for all possible translations t cancels out (and = 1). The likelihood of a translation is a product of 1) the grammatical correctness of sentence β (i.e. likelihood someone would say it) ---- P(β) 2) the semantic correctness of the sentence β compare to α (i.e. likelihood that a person would translate the sentence α into β) P(α|β) –The probability that if someone thought β in language B, they would say α in language A. The best translation is the sentence that has the highest score [Knig99]

Plan of Attack In order to translate a sentence: 1) create a word-for-word translation of the source sentence 2) add, replace, and reorder words to increase grammatical accuracy 3) repeat step 2 until a maximal translation has been found (until there are no more possible modifications that increase the overall score of the sentence) For our purposes we will assume B is English and A is Spanish. * Not necessarily in that order *

Parts of a Statistical Machine Translation System Language Model: assigns a probability P(β) to any destination language string representing the grammatical correctness of that sentence ~ in the textbook called ‘Fluency’ Translation Model: assigns a probability P(α|β) to any pair of strings on languages A and B representing the meaningful similarity of the sentences ~ in the textbook called ‘Faithfulness’ Decoder: takes a sentence α and attempts to return the sentence β that maximizes P(β|α) utilizing Bayes’ Theorem and the Language Model and Translation Model Typically creates a forest of guessed β’s, then finds the best by evaluating them according to the Language and Translation models. ~ in the textbook called ‘Search’

Language Model “How likely is this sentence to be uttered by a native B speaker?” Provides a measure of the grammatical accuracy of the sentence How do you compare Language Models? Using a corpus of test data, compare the P(model | test-data) P(model | test-data) = P(model) * P(test-data | model) / P(data) But P(model) & P(data) same for all models P(model | data) = P(data | model) = P(test-data) Therefore, language models that assign higher numbers to test-data are the best. A language model is therefore simply a N-gram/Brill/CFG/etc. structure that computes probabilities of word combinations instead of using those same probabilities to assign Part-Of-Speech. Because the probabilities assigned to data will be extremely small, usually compared by the equation: 2 -log(P(e))/N This is called the perplexity; as P(e) decreases perplexity increases – small perplexity is better [Knig99]

Translation Model Based on a ‘philosophy’ of translation: 1.A sentences are converted into predicate logic/logical assertions. These assertions are then converted into language B. “Bob and Jane are friends” => Friends(Bob,Jane) => Bob y Jane son amigos. 2.A sentence in language B gets syntactically parsed (i.e. POS) into a tree-diagram and then the tree is rearranged to match language A syntax. Finally, all words are then translated into language A. »The “Transfer Metaphor” 3.Words in language B are translated word-for-word into language A and then randomly scrambled. »Most basic, commonly used model(s): IBM Model 1-5.

Translation Model: IBM Model 3 Developed for IBM by Brown, Della Pietra, Della Pietra, and Mercer, described in “The mathematics of statistical machine translation: Parameter estimation.” in 1993 Very easy to understand and implement. When used in conjunction with a good Language Model will weed out poor permutations. Basic idea is translate sentence word-for-word, then return a permutation of the old sentence. We rely on the Language Model to determine how good this permutation is. We keep feeding back permutations until a good one has been found. Accounts for words that don’t have a 1:1 translation.

IBM Model 3: Fertility Definition: the number of words in language A that are required to translate a given word in language B. Example: Mary did not slap the green witch. English Mary no daba una botefada a la bruja verde. Spanish The word ‘slap’ in English translates to ‘daba una botefada’ in Spanish Thus, when translating English to Spanish, ‘slap’ has a fertility of 3 Adapted from [Knig99]

IBM Model 3: Example the First Translate “Mary did not slap the green witch” from English to Spanish Step 1: Recopy sentence, copying every word the number of times dictated by its fertility. Fertilities: did = 0, Mary = not = green = witch = the = 1, slap = 3. 1 ‘spurious’ word Mary not slap slap slap the the green witch Step 2: Translate the words into language A. Mary no daba una botefada a la verde bruja Step 3: Permute the words to find the proper structure. Mary no daba una botefada a la bruja verde Adapted from [Knig99]

IBM Model 3 Structure Translation: Τ(a|b) = probability of producing a from b Τ(‘verde’|’green’) = probability of translating ‘green’ into ‘verde’ Τ = two-dimensional table of floating point values Fertility: η(x|b) = probability b will produce exactly x words when translated η(3|’slap’) = probability slap will be translated as exactly 3 words η = two-dimensional table of floating point values Distortion: δ(x|y,q,r) = probability word in location y will end up in location x when translated given that the original sentence has q words and the translated sentence has r words. δ(7|6,7,9) = probability that the word in the 7 th position of the translated sentence (‘bruja’) will have resulted from the word in the 6 th position in the original sentence (‘witch’) when the original sentence was of length 7 and the translated sentence is of length 9 ‘Mary did not slap the green *witch*’ -> ‘Mary no daba una botefada a la *bruja* verde’ δ = one-dimensional table of floating point values Spurious Production: the chance that words may appear in the translation for which there was no directly corresponding word in the original sentence. –Assign a probability ρ; every time a word is translated there is a ρ chance that a spurious word is produced. –Ρ = a floating point number [Knig99]

IBM Model 3 Algorithm Step 1: For each original word α i, choose a fertility θ i with probability η(θ i | α i ); the sum of all these fertilities being Μ Step 2: Choose θ 0 ‘spurious’ translated words to be generated from α 0 =NULL, using probability ρ and Μ. Step 3: For each i = 1.. n, and each k = 1.. θ i, choose a translated word τ ik with probability Τ(τ ik | α i ). Step 4: For each i = 1.. n and each k = 1.. θ i, choose a target translated position π ik with probability δ(σ ik |i,n,m ). Step 5: For each k = 1.. θ 0, choose a position π 0 k from the θ 0 – ( k +1) remaining vacant positions in 1..Μ for a total probability of 1/ (θ 0 !). Step 6: Output the translated sentence with words τ i,k in positions π ik. [Knig99]

Step 1: For each original word α i, choose a fertility θ i with probability η(θ i | α i ); the sum of all these fertilities being Μ α 0 NULL, θ 0 = ?; α 1 = ‘Mary’, θ 1 = 1; α 2 = ‘did’, θ 2 = 0; α 3 = ‘not’, θ 3 = 1; α 4 = ‘slap’, θ 4 = 3; α 5 = ‘the’, θ 5 = 1; α 6 = ‘green’, θ 6 = 1; α 7 = ‘witch’, θ 7 = 1 Μ = 8 Step 2: Choose θ 0 ‘spurious’ translated words to be generated from α 0 =NULL, using probability ρ and Μ. θ 0 = 1 Step 3: For each i = 1.. n, and each k = 1.. θ i, choose a translated word τ ik with probability Τ(τ ik | α i ). θ 0 = NULL ; τ 11 = ‘Mary’; τ 20 = null; τ 31 = ‘no’; τ 41 = ‘daba’; τ 42 = ‘una’; τ 43 = ‘botefada’; τ 51 = ‘la’; τ 61 = ‘verde’; τ 71 = ‘bruja’ Step 4: For each i = 1.. n and each k = 1..θ i, choose a target translated position π ik with probability δ(σ ik |i,n,m ). π 11 = 0; π 31 = 1; π 41 = 2; π 42 = 3; π 43 = 4; π 51 = 6; π 61 = 8; π 71 = 7 Step 5: For each k = 1.. θ 0, choose a position π 0 k from the θ 0 – ( k +1) remaining vacant positions in 1..Μ for a total probability of 1/ (θ 0 !). π 01 = 5 Step 6: Output the translated sentence with words τ i,k in positions π ik. “Mary no daba una botefada a la bruja verde” IBM Model 3: Example the Second Translating “Mary did not slap the green witch” from English to Spanish Adapted from [Knig99]

Step 1: For each original word α i, choose a fertility θ i with probability η(θ i | α i ); the sum of all these fertilities being Μ For every word in the original sentence, determine the most likely number of corresponding translated words. α 1 = ‘Mary’, θ 1 = 1: η(1 | 'Mary') = 1.0 α 2 = ‘did’, θ 2 = 0: η(0 | 'did') = 1.0 α 3 = ‘not’, θ 3 = 1: η(1 | 'no') =.75, η(2 | 'no') =.25 α 4 = ‘slap’, θ 4 = 3: η(3 | 'slap') =.66, η(2 | 'slap') =.34 α 5 = ‘the’, θ 5 = 1: η(1 | 'the') = 1.0 α 6 = ‘green’, θ 6 = 1: η(1 | 'green') = 1.0 α 7 = ‘witch’, θ 7 = 1: η(1 | 'witch') = 1.0 M = = 8

Step 2: Choose θ 0 ‘spurious’ translated words to be generated from α 0 =NULL, using probability ρ and Μ Try to guess how many words will appear in the sentence that are not directly related to any of the words from the original sentence α 0 = ‘a’, θ 0 = 1: ρ =.111, M = 9,.111 * 9 = 1 α 1 = ‘Mary’, θ 1 = 1 α 2 = ‘did’, θ 2 = 0 α 3 = ‘not’, θ 3 = 1 α 4 = ‘slap’, θ 4 = 3 α 5 = ‘the’, θ 5 = 1 α 6 = ‘green’, θ 6 = 1 α 7 = ‘witch’, θ 7 = 1 M = 8

Step 3: For each i = 1.. n, and each k = 1.. θ i, choose a translated word τ ik with probability Τ(τ ik | α i ) Choose translations based on the most probable translation for a given word and model. θ 0 = null:α 0 = nullΤ(null | null) = 1.0 τ 11 = 'Mary':α 1 = ‘Mary’Τ('Mary' | 'Mary) = 1.0 τ 20 = null:α 2 = ‘did’Τ(null | 'did') = 1.0 τ 31 = 'no':α 3 = ‘not’T('no' | 'not') =.7, T('nada' | 'not' ) =.3 τ 41 = 'daba':α 4 = ‘slap’T('daba' | 'slap') = 1.0* τ 42 = 'una':α 4 = ‘slap’T('una' | 'slap') =.55, T('un' | 'slap') =.45 τ 43 = 'botefada':α 4 = ‘slap’T('botefada' | 'slap') = 1.0 τ 51 = 'la':α 5 = ‘the’T('la' | 'the') =.55, T('el' | 'the') =.45 τ 61 = 'verde':α 6 = ‘green’T('verde' | 'green') = 1.0 τ 71 = 'bruja':α 7 = ‘witch’T('bruja' | 'witch') = 1.0 The probability that the first (τ i1 ) translated word corresponding to 'slap' will be 'daba'

Step 4: For each i = 1.. n and each k = 1.. θ i, choose a target translated position π ik with probability δ(σ ik |i,n,m ) π 11 = 0 δ(0|1,7,9) =.75 π 31 = 1 δ(1|1,7,9) =.75 π 41 = 2 δ(2|4,7,9) =.3 π 42 = 3 δ(3|4,7,9) =.3 π 43 = 4 δ(4|4,7,9) =.3 π 51 = 6 δ(6|5,7,9) =.75 π 61 = 8 δ(8|6,7,9) =.75 π 71 = 7 δ(7|7,7,9) =.75

Step 5: For each k = 1.. θ 0, choose a position π 0k from the θ 0 – ( k +1) remaining vacant positions in 1..Μ for a total probability of 1/ (θ 0 !) π 00 = 5 θ 0 = 1, k=0, 1 vacant position in M Total Probability = 1/(1!) = 1

Step 6: Output the translated sentence with words τ i,k in positions π ik. Display the result. σ1 = ‘Mary’ σ2 = ‘no’ σ3 = ‘daba’ σ4 = ‘una’ σ5 = ‘botefada’ σ6 = ‘a’ σ7 = ‘la’ σ8 = ‘bruja’ σ9 = ‘verde’ “Mary no daba una botefada a la bruja verde”

Warning All data in the previous example was completely made up. Each function could be implemented in almost any manner. A simple one would be to store ρ as a floating point value, Τ and π as a two-dimensional floating point matrix, and η as a one-dimensional floating point matrix.

Decoding Takes a previously unseen sentence α and tries to find the translation β that maximizes P(β|α) [= P(β) * P(α|β)]. If translation is constrained such that the translation and the source sentences have the same word order then decoding can be done in linear time. If translation is constrained such that the syntax of the translated sentence can be obtained from rotations around binary tree nodes (simple tree re-ordering) then decoding requires high-polynomial time. For most languages, requiring what amounts to arbitrary word reordering, decoding is proven NP-complete. An optimal solution: map translation of a sentence onto the Traveling Salesman Problem. Then use available Linear Integer Programming software to find an optimal solution. [Germ01]

A Greedy Solution Start decoding with an approximate solution, and incrementally improve it until no more improvements can be found. Begin with a word-for-word most-likely translation. At each step in decoding, modify translation sentence with the improvement that most increases the P(β) of the sentence Main Improvements: translateOneOrTwoWords(j1,e1,j2,e2): modifies the translations of word(s) located at j1 (and j2) into e1 and e2, which are inserted at locations that most increase the P(β) of the sentence translateAndInsert(j,e1,e2): changes translation of word at j1 into e1 and inserts e2 in the location that most increases the P(β) of the sentence swapSegments(i1,i2,j1,j2): swaps non-overlapping word segments from [i1,i2] and [j1,j2]. Can find ‘reasonable’ translations in just several seconds per sentence. Overall, this greedy algorithm is O(n 6 ), where n is the number of words in the sentence. Proposed by Germann et al., in Fast Decoding and Optimal Decoding for Machine Translation in 2001.

Greedy Encoding: Example the Third Translate “Bien entendu, il parle de una belle victoire” from French into English Create initial word-for-word most-likely translation: "bien entendu, il parle de una belle victoire" => "well heard, it talking a beautiful victory" Modify using translateOneOrTwoWords(5,talks,7,great) "bien entendu, il parle de una belle victoire" => "well heard, it talks a great victory" Modify using translateOneOrTwoWords(2,understood,0,about) "bien entendu, il parle de una belle victoire" => "well understood, it talks about a great victory" Modify using translateOneOrTwoWords(4,he,null,null) "bien entendu, il parle de una belle victoire" => "well understood, he talks about a great victory" Modify using translateOneOrTwoWords(1,quite,2,naturally) "bien entendu, il parle de una belle victoire" => "quite naturally, he talks about a great victory” Final Result: "quite naturally, he talks about a great victory" Adapted from [Germ01]

Improving Greedy Algorithms The above algorithm can be simplified, obtaining O(n) by: 1.Limiting the ability to swap sentence segments 2.On first sweep through identify all independent improvements that increase P(β) 3.Limit improvements to these sections and consider rest of sentence ‘optimized’ The speedup gained by adding these constraints greatly offsets the decrease in accuracy, allowing Multiple searches utilizing different start permutations. [Germ03]

Incorporating Syntax into STM What if we use syntactic phrases as the foundation for out model, instead of words Extract cues about syntactic structure from training data in addition to word frequencies and alignments Can result in more "perfect translations" where the outputted sentence needs no human editing. [Char03]

Syntax Based MT Translation Model: Given an A parse tree, produce a B sentence. 3 Steps: »Reorder Nodes of the parse tree »Insert optional words at each node »Translate a leaf (word) from A into B Utilizes a Probabilistic Context Free Grammar extracted from a training corpus »(Each operation is dependent on the corresponding probability) Also accounts for phrasal translations – a phrase directly translatable from A to B without any operations »not -> ne…pas in English -> French The "Transfer Metaphor" [Char03]

Syntax Based MT Decoding: Build a large forest of nearly all possible parse trees for a sentence Remove all parse trees with a PCFG probability below threshold (i.e ) Use a lexicalized PCFG* to evaluate remaining trees (is too complicated to apply to full forest) Choose the best translation. While the parse trees are therefore being analyzed utilizing the non-lexical model, empiral evidence has shown that the threshold of will remove most irrelevant parses without affecting parse accuracy (because there are very few lexical combinations that can overcome such a disadvantage). [Char03]

Conclusion Statistical Machine Translation relies on utilizing a Language Model and Translation Model to maximize the equation P(β|α) = P(α|β)P(β) Translations produced are good enough to get a general idea about what is contained, but usually need human editing of results Utilizing Syntax can increase the number of sentences that do not require post-editing. Much time and effort is put into producing inaccurate results. If you followed this superbly interesting and supremely stimulating presentation, you should now be capable of developing (at least in theory) a fairly powerful Machine Translation implementation!

References Brown, P., Della Pietra, S., Della Pietra, V., Mercer, R The mathematics of statistical machine translation: Parameter estimation.Computational Linguistics. (available online at ) Charniak, E., Knight, K. and Yamada K Syntax-based Language Models for Machine Translation. Proceedings of MT Summit IX New Orleans. (available online at language/projects/rewrite/mtsummit03.pdf ) language/projects/rewrite/mtsummit03.pdf Dorr, B. and Monz, C Statistical Machine Translation. Presented as lecture for CMSC 723 at the University of Maryland (available online at notes/Lecture8-statmt.ppt ) notes/Lecture8-statmt.ppt Germann, U Greedy Decoding for Statistical Machine Translation in Almost Linear Time. Proceedings of HLT-NAACL Edmonton, AB, Canada. (available online at ) Germann, U., Jahr, M., Knight, K., Marcu, D. and Yamada, K Fast Decoding and Optimal Decoding for Machine Translation. Proceedings of the Conference of the Association for Computational Linguistics (ACL-2001), Toulouse, France, July (available online at ) Jurafsky, D., Martin, J Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. (Textbook) Knight, K A Statistical MT Tutorial Workbook. Developed for the JHU 1999 Summer MT Workshop. (available online at )