Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.

Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau

Overview Language Alignment System Datasets  Sentence-aligned sets for training (ex. The Hansards Corpus, European Parliamentary Proceedings Parallel Corpus)  A word-aligned set for testing and evaluation to measure accuracy and precision Decoding

Language Alignment Goal: Produce a word-aligned set from a sentence-aligned dataset First step on the road toward Statistical Machine Translation Example Problem:  The motion to adjourn the House is now deemed to have been adopted.  La motion portant que la Chambre s'ajourne maintenant est réputée adoptée.

IBM Models 1 and 2 -Kevin Knight, A Statistical MT Tutorial Workbook, 1999 Each capable of being used to produce a word-aligned dataset separately. EM Algorithm Model 1 produces T-values based on normalized fractional counting of corresponding words. Additionally, Model 2 uses A-values for “reverse distortion probabilities” – probabilities based on the positions of the words

Training Data European Parliament Proceedings Parallel Corpus 1996-2003 Aligned Languages:  English - French  English - Dutch  English - Italian  English - Finish  English - Portuguese  English - Spanish  English - Greek

Training Data cont. Eliminated  Misaligned sentences  Sentences with 50 or more words  XML tags  Symbols and numerical characters other then commas and periods

Ideally… http://www.cs.berkeley.edu/~klein/cs294-5

Bypassing Interlingua: Models I-III Variables contributing to the probability of a sentence:  Correlation between words in the source/target languages  Fertility of a word  Correlation between order of words in source sentence and order of words in target

A Translation Matrix RobCatisDog Rob1000 Gato0100 es00.50 esta00.50 Perro0001

Building the Translation Matrix: Starting from alignments Find the sentence alignment If a word in the source aligns with a word in the target, then increment the translation matrix. Normalize the translation matrix

Can’t find alignments Most sentences in the hansards corpus are 60 words long. There are many that can be over 100. 100 100 possible alignments

Counting Rob is a boy. Rob es nino. Rob is tall.Rob es alto. Eric is tall.Eric es alto. … … Base counts on co-occurrence, weighting based on sentence length.

Iterative Convergence Use Estimation Maximization algorithm Creates translation matrix RobIsTallboy Rob.66.33.25 es.30.66.25 alto.2.05.50 nino.2.050.5

Distorting the Sentence Word order changes between languages How is a sentence with 2 words distorted? How is a sentence with 3 words distorted? How is a sentence with … To keep track of this information we use…

A tesseract! (A quadruply nested default dictionary) This could be a problem if there are more than 100 words in a sentence. 100x100x100x100 = too big for RAM and takes too much time

Broad Look at MT “The translation process can be described simply as: 1.Decoding the meaning of the source text, and 2.Re-encoding this meaning in the target language.” - “Translation Process”, Wikipedia, May 2006

Decoding How to go from the T-matrix and A-matrix to a word alignment? There are several approaches…

Viterbi If only doing alignment, much smaller memory and time requirements. Returns optimal path. T-Matrix probabilities function as the “emission” matrix A-Matrix probabilities concerned with the positioning of words

Decoding as a Translator Without supplying a translated sentence to the program, it is capable of being a stand-alone translator instead of a word aligner. However, while the Viterbi algorithm runs quickly with pruning for decoding, for translating the run time skyrockets.

Greedy Hill Climbing Knight & Koehn, What’s New in Statistical Machine Translation, 2003 Best first search 2-step look ahead to avoid getting stuck in most probable local maxima

Beam Search Knight & Koehn, What’s New in Statistical Machine Translation, 2003 Optimization of Best First Search with heuristics and “beam” of choices Exponential tradeoff when increasing the “beam” width

Other Decoding Methods Knight & Koehn, What’s New in Statistical Machine Translation, 2003 Finite State Transducer  Mapping between languages based on a finite automaton Parsing  String to Tree Model

Problem: One to Many Necessary to take all alignments over a certain probability in order to capture the “probability that e has fertility at least a given value” Al-Onaizan, Curin, Jahr, etc., Statistical Machine Translation, 1999

Results Study done in 2003 on word alignment error rates in Hansards corpus:  Model 2 – 29.3% on 8K training sentence pairs 19.5% on 1.47M training sentence pairs  Optimized Model 6 – 20.3% on 8K training sentence pairs 8.7% on 1.47M training sentence pairs Och and Ney, A Systematic Comparison of Various Statistical Alignment Models, 2003

Expected Accuracy 70% overall Language performance:  Dutch French Italian, Spanish, Portuguese  Greek  Finish

Possible Future Work Given more time, we would’ve implemented IBM Model 3 Additionally uses n, p, and d fertilities for weighted alignments:  N, number of words produced by one word  D, distortion  P, parameter involving words that aren’t involved directly Invokes Model 2 for scoring

Another Possible Translation Scheme Example-Based Machine Translation  Translation-by-Analogy  Can sometimes achieve better than the “gist” translations from other models

Why Is Improving Machine Translation Necessary?

A Chinese to English Translation

The End Are there any questions/comments?

Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.

Similar presentations

Presentation on theme: "Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.

Similar presentations

Presentation on theme: "Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau."— Presentation transcript:

Similar presentations

About project

Feedback