Computational Linguistics Seminar LING-696G Week 10.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Markov models and applications
Chapter 9 Approximating Eigenvalues
. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Word Alignment Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics University of Edinburgh Some slides.
Translation Model Parameters & Expectation Maximization Algorithm Lecture 2 (adapted from notes from Philipp Koehn & Mary Hearne) Dr. Declan Groves, CNGL,
. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Overview Full Bayesian Learning MAP learning
Machine Translation (II): Word-based SMT Ling 571 Fei Xia Week 10: 12/1/05-12/6/05.
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
Expectation Maximization Algorithm
Tutorial 10 Iterative Methods and Matrix Norms. 2 In an iterative process, the k+1 step is defined via: Iterative processes Eigenvector decomposition.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Natural Language Processing Expectation Maximization.
Face Detection using the Viola-Jones Method
Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.
Machine Translation Course 5 Diana Trandab ă ț Academic year:
1 Chapter 20 Section Slide Set 2 Perceptron examples Additional sources used in preparing the slides: Nils J. Nilsson’s book: Artificial Intelligence:
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Two-Sample Hypothesis Testing. Suppose you want to know if two populations have the same mean or, equivalently, if the difference between the population.
1 Topic 5 - Joint distributions and the CLT Joint distributions –Calculation of probabilities, mean and variance –Expectations of functions based on joint.
Gaussian Mixture Models and Expectation-Maximization Algorithm.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.
Computational Linguistics Seminar LING-696G Week 5.
Computational Linguistics Seminar LING-696G Week 6.
Computational Linguistics Seminar LING-696G Week 8.
Computational Linguistics Seminar LING-696G Week 9.
Computational Linguistics Seminar LING-696G Week 7.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Berger Jean-Baptiste
Statistical Machine Translation Part II: Word Alignments and EM
Introduction to Computational Thinking
Classification with Perceptrons Reading:
Learning Sequence Motif Models Using Expectation Maximization (EM)
3.1 Expectation Expectation Example
CS330 Discussion 6.
Introduction to Data Mining, 2nd Edition
Introduction to Probability & Statistics The Central Limit Theorem
CSCI 5832 Natural Language Processing
Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.
N-Gram Model Formulas Word sequences Chain rule of probability
Expectation-Maximization Algorithm
Word Embedding Word2Vec.
Word-based SMT Ling 580 Fei Xia Week 1: 1/3/06.
Machine Translation and MT tools: Giza++ and Moses
EM for Inference in MV Data
Branch instructions We’ll implement branch instructions for the eight different conditions shown here. Bits 11-9 of the opcode field will indicate the.
Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.
Word Alignment David Kauchak CS159 – Fall 2019 Philipp Koehn
Machine learning overview
Introduction to Sensor Interpretation
Machine Translation and MT tools: Giza++ and Moses
EM for Inference in MV Data
Trevor Brown DC 2338, Office hour M3-4pm
Introduction to Sensor Interpretation
Ch6: AM and BAM 6.1 Introduction AM: Associative Memory
CS224N Section 2: PA2 & EM Shrey Gupta January 21,2011.
Presentation transcript:

Computational Linguistics Seminar LING-696G Week 10

Today's Topics Rewrote code for IBM Models 1, 2 and 3a Chaucer Project

Resources Code: Source CodeExample UsageDescription ibm1pre.py python3 ibm1pre.py 0.1 ibm2x2align.txt – o ibm2x2align_model1.txt IBM Model 1 accepting formatted training data only ibm2pre.py python3 ibm2pre.py 0.1 ibm2x2align.txt ibm2x2align_model1.txt –o ibm2x2align_model2.txt IBM Model 2 accepting formatted training data only + Model 1 data ibm3apre.py python3 ibm3apre.py 0.1 ibm2x2align.txt ibm2x2align_model2.txt IBM Model 3 accepting formatted training data only + Model 2 data

Resources Data transformation: Source File: ibm_fertility.py Example Usage: python ibm_fertility.py -n 1 -f 1 ibm2x1.txt ibm2x1nonealign.txt Parameter:Value: -n (--none)0 (don't generate n 0ne ), 1 (generate n0ne ) -f (--fertility)0,1,2,3 etc.. (maximum fertility) filename (input raw training data) e.g. ibm2x1.txt filename (output formatted training data) e.g. ibm2x1nonealign.txt

Resources Training data: Raw training data Formatted training dataParameters ibm2x2.txtibm2x2align.txt ibm2x2nonealign.txt --none 0 --fertility 2 --none 1 –-fertility 2 ibm2x2r.txtibm2x2ralign.txt ibm2x2rnonealign.txt --none 0 --fertility 2 --none 1 –-fertility 2 ibm2x1.txtibm2x1align.txt ibm2x1nonealign.txt --none 0 --fertility 1 --none 1 –-fertility 1

Probabilities Training Data Training Data IBM Models 1, 2 and 3a A cascading architecture: unaligned sentence pairs, e.g. the house das Haus unaligned sentence pairs, e.g. the house das Haus Model 1: compute t(e|f) Model 2: compute t(e|f) and a(i|j,l e,l f ) t(e|f) Compute Alignment/Fe rtility possibilities aligned sentence pairs, e.g. the house n0ne das das Haus aligned sentence pairs, e.g. the house n0ne das das Haus t(e|f) a(i|j,l e,l f ) t(e|f) a(i|j,l e,l f ) Model 3a: compute t(e|f), a(i|j,l e,l f ) and n(ɸ|f) t(e|f) a(i|j,l e,l f ) n(ɸ|f) t(e|f) a(i|j,l e,l f ) n(ɸ|f) ibm_fertility.py ibm1pre.py ibm2pre.py ibm3apre.py

= 0.50 Training Data Assumptions: – no ordering implied – every word has a translation Example: 1.the house 1.das Haus Let t(e|f) be the probability that foreign word f translates into English word e Let E = set of all English words in the corpus Initially (uniform distribution): – t(e i |f) = 1/|E| We will estimate: 1.t(the|das) 2.t(house|das) 3.t(the|Haus) 4.t(house|Haus) – ∑ i t(e i |f) = 1 for all f e.g. = 0.99 = 0.01 = 0.05 = 0.95 Assumption: not all words are translated Example: 1.n0ne small 1.ja klein Add: t( n0ne |f j ) for all f j initially: t(e i |f) = 1/(|E|+1)

Expectation/Maximization (EM) Cycle: Training data: pairs e,f weighted count for # e,f pairs in training data c(e|f) += t(e|f)*occ(e,f) updated t(e|f) = t(e|f) * occ(e,f) rescaled wrt. f e = some English word f = some foreign word t(e|f) = probability that f translates as e occ(e,f) = # of occurrences of training pair e and f Initial t(e|f)

Training Recall IBM Model 1: – no assumption about alignment – i.e. all alignments are possible 1.the house 1.das Haus See training datum #2: 1.the house 1.das Haus Compute a weighted sum of the # times we see a datum: Initially, set: – t( the | Haus ) = 0.5 – t( house | Haus ) = 0.5 – uniform distribution t(house|Haus) = k c(house|H aus) +k c(Haus) +k Normalize: update t( house | Haus ) = c( house | Haus )/c( Haus ) Normalize: update t( house | Haus ) = c( house | Haus )/c( Haus ) This is the model 1 we've played with so far …

Training IBM Model 1: – no assumptions about alignment – Suppose it's also possible some f isn't translated at all 1.the house n0ne 2.das Haus Let's line the words up and rewrite this as: 1.the house n0ne ⇕ ⇕ ⇕ 2.das das Haus 1.the house 2.das Haus is equivalent to the 4 pairs below: 1.the house 2.das Haus 3.the house 4.Haus das 5.the house n0ne 6.das das Haus 7.n0ne the house 8.das Haus Haus (with e and f in ( ⇕ ) 1-to-1 correspondence) produced by ibm_fertility.py

2x2 training data with n0ne ibm2x2nonealign.txt Expansion of : 1.the a book house 2.das ein Buch Haus 3.the house 4.das Haus 5.the book 6.das Buch 7.a book 8.ein Buch with fertility possibilities 0,1,2 1.none the a book house 2.das ein Buch Haus 3.the house 4.das Haus 5.the house 6.Haus das 7.the house none 8.das das Haus 9.none the house 10.das Haus Haus 11.the book 12.das Buch 13.the book 14.Buch das 15.the book none 16.das das Buch 17.none the book 18.das Buch Buch 19.a book 20.ein Buch 21.a book 22.Buch ein 23.a book none 24.ein ein Buch 25.none a book 26.ein Buch Buch 3 pairs yields 12 pre-aligned pairs 3 pairs yields 12 pre-aligned pairs Assume training sentence pairs are now pre-formatted in 1-to-1 correspondence, i.e. pre-aligned Example: 5. the house none ⇕ ⇕ ⇕ 6. das das Haus Note: for simplicity, alignment data not shown here

2x2 training data with n0ne Iteration threshold: 0.1 Iteration 2 t(none |das ) = 0.18 t(the |das ) = 0.51 t(book |das ) = 0.16 t(house|das ) = 0.16 t(none |ein ) = 0.15 t(a |ein ) = 0.48 t(book |ein ) = 0.37 t(none |Buch ) = 0.18 t(the |Buch ) = 0.16 t(a |Buch ) = 0.16 t(book |Buch ) = 0.51 t(none |Haus ) = 0.15 t(the |Haus ) = 0.37 t(house|Haus ) = 0.48 Iteration threshold: 0.01 Iteration 11 t(none |das ) = 0.27 t(the |das ) = 0.71 t(book |das ) = 0.01 t(house|das ) = 0.01 t(none |ein ) = 0.05 t(a |ein ) = 0.74 t(book |ein ) = 0.21 t(none |Buch ) = 0.27 t(the |Buch ) = 0.01 t(a |Buch ) = 0.01 t(book |Buch ) = 0.71 t(none |Haus ) = 0.05 t(the |Haus ) = 0.21 t(house|Haus ) = 0.74 Iteration threshold: Iteration 26 t(none |das ) = 0.33 t(the |das ) = 0.67 t(a |ein ) = 0.76 t(book |ein ) = 0.24 t(none |Buch ) = 0.33 t(book |Buch ) = 0.67 t(the |Haus ) = 0.24 t(house|Haus ) = 0.76 Iteration threshold: 1e-08 Iteration 68 t(none |das ) = 0.33 t(the |das ) = 0.67 t(a |ein ) = 0.76 t(book |ein ) = 0.24 t(none |Buch ) = 0.33 t(book |Buch ) = 0.67 t(the |Haus ) = 0.24 t(house|Haus ) = 0.76 IBM Model 1 Best that can be done! python3 ibm1pre.py 0.1 ibm2x2nonealign.txt

2x2 training data ibm2x2align.txt Same dataset but no n0n e: 1.the a book house 2.das ein Buch Haus 3.the house 4.das Haus 5.the house 6.Haus das 7.the book 8.das Buch 9.the book 10.Buch das 11.a book 12.ein Buch 13.a book 14.Buch ein Iteration threshold: 0.1 Iteration 4 t(the |das ) = 0.83 t(book |das ) = 0.08 t(house|das ) = 0.09 t(a |ein ) = 0.72 t(book |ein ) = 0.28 t(the |Buch ) = 0.08 t(a |Buch ) = 0.09 t(book |Buch ) = 0.83 t(the |Haus ) = 0.28 t(house|Haus ) = 0.72 python3 ibm1pre.py 0.1 ibm2x2align.txt

2x2 training data Iteration threshold: 0.1 Iteration 4 t(the |das ) = 0.83 t(book |das ) = 0.08 t(house|das ) = 0.09 t(a |ein ) = 0.72 t(book |ein ) = 0.28 t(the |Buch ) = 0.08 t(a |Buch ) = 0.09 t(book |Buch ) = 0.83 t(the |Haus ) = 0.28 t(house|Haus ) = 0.72 Iteration threshold: Iteration 76 t(the |das ) = 1.00 t(a |ein ) = 0.99 t(book |ein ) = 0.01 t(book |Buch ) = 1.00 t(the |Haus ) = 0.01 t(house|Haus ) = 0.99 Iteration threshold: 1e-05 Iteration 229 t(the |das ) = 1.00 t(a |ein ) = 1.00 t(book |Buch ) = 1.00 t(house|Haus ) = 1.00 Iteration threshold: 0.01 Iteration 12 t(the |das ) = 1.00 t(a |ein ) = 0.94 t(book |ein ) = 0.06 t(book |Buch ) = 1.00 t(the |Haus ) = 0.06 t(house|Haus ) = 0.94 Iteration threshold: Iteration 28 t(the |das ) = 1.00 t(a |ein ) = 0.98 t(book |ein ) = 0.02 t(book |Buch ) = 1.00 t(the |Haus ) = 0.02 t(house|Haus ) = 0.98 Converges nicely!

2x2 training data Model 1 summary Training data: 1.the a book house 2.das ein Buch Haus 3.the house 4.das Haus 5.the book 6.das Buch 7.a book 8.ein Buch No n0ne: Iteration threshold: 1e-05 Iteration 229 t(the |das ) = 1.00 t(a |ein ) = 1.00 t(book |Buch ) = 1.00 t(house|Haus ) = 1.00 Iteration threshold: Iteration 26 t(none |das ) = 0.33 t(the |das ) = 0.67 t(a |ein ) = 0.76 t(book |ein ) = 0.24 t(none |Buch ) = 0.33 t(book |Buch ) = 0.67 t(the |Haus ) = 0.24 t(house|Haus ) = 0.76

2x1 training data with n0ne ibm2x1nonealign.txt Suppose we provide training data about a flavoring particle ja: 1.small big 2.ja klein groß 3.small 4.ja klein 5.small 6.klein 7.big 8.ja groß 9.big 10.groß Expands into pre-aligned: 1.none small big 2.ja klein groß 3.small none 4.ja klein 5.small none 6.klein ja 7.small 8.klein 9.big none 10.ja groß 11.big none 12.groß ja 13.big 14.groß 4 pairs yields 6 pre-aligned pairs 4 pairs yields 6 pre-aligned pairs

2x1 training data with n0ne Iteration threshold: 0.1 Iteration 4 t(none |ja ) = 0.92 t(small|ja ) = 0.04 t(big |ja ) = 0.04 t(none |klein) = 0.05 t(small|klein) = 0.95 t(none |groß ) = 0.05 t(big |groß ) = 0.95 Iteration threshold: 0.01 Iteration 7 t(none |ja ) = 0.99 t(none |klein) = 0.01 t(small|klein) = 0.99 t(none |groß ) = 0.01 t(big |groß ) = 0.99 IBM Model 1 Iteration threshold: Iteration 11 t(none |ja ) = 1.00 t(small|klein) = 1.00 t(big |groß ) = 1.00 Converges nicely! python3 ibm1pre.py 0.1 ibm2x1nonealign.txt

2x1 training data ibm2x1align.txt Same dataset but no n0n e: 1.small 2.ja klein 3.small 4.klein 5.big 6.ja groß 7.big 8.groß Iteration threshold: 0.1 Iteration 2 t(small|ja ) = 0.50 t(big |ja ) = 0.50 t(small|klein) = 1.00 t(big |groß ) = 1.00 python3 ibm1pre.py 0.1 ibm2x1align.txt Best we can do given limitation of no n0ne ! Best we can do given limitation of no n0ne !

With n0ne Summary Model 1: Iteration threshold: 0.1 Iteration 2 t(small|ja ) = 0.50 t(big |ja ) = 0.50 t(small|klein) = 1.00 t(big |groß ) = 1.00 Iteration threshold: Iteration 11 t(none |ja ) = 1.00 t(small|klein) = 1.00 t(big |groß ) = 1.00 Iteration threshold: 1e-05 Iteration 229 t(the |das ) = 1.00 t(a |ein ) = 1.00 t(book |Buch ) = 1.00 t(house|Haus ) = 1.00 Iteration threshold: Iteration 26 t(none |das ) = 0.33 t(the |das ) = 0.67 t(a |ein ) = 0.76 t(book |ein ) = 0.24 t(none |Buch ) = 0.33 t(book |Buch ) = 0.67 t(the |Haus ) = 0.24 t(house|Haus ) = 0.76 Best!

Training Recall IBM Model 2: – t(e|f) from Model 1 – alignment probability distribution: a(i|j,l e,l f ) l e,l f = length of English and foreign sentences (resp.) i,j = index into foreign and English sentences (resp.) Assume training pair: 1.the house 2.das Haus is equivalent to: 1.the house 2.das Haus the house 5.Haus das the house none 8.das das Haus the house none 11.Haus Haus das (e and f in ( ⇕ ) 1-to-1 correspondence) Alignment line: i j i j … 1st 2nd pair ( j =0 if n0ne )

Training Model 2 training data: 7.the house n0ne ⇕ ⇕ ⇕ 8.das das Haus See training datum #2: 7.house 8.das Compute a weighted sum of the # times we see a datum: Initially, set: – a( 1 | 1,2,2 ) = 0.50 – a( 2|1,2,2 ) = 0.50 – uniform distribution t(house|das)*a(1|2,2,2) = k c(house|d as) +k c(das) +k Normalize: new t( house | das ) = c( house | das )/c(d as ) new a(1|2,2,2) = c(1|2,2,2)/c(2,2,2) Normalize: new t( house | das ) = c( house | das )/c(d as ) new a(1|2,2,2) = c(1|2,2,2)/c(2,2,2) c(1|2,2,2) +k c(2,2,2) +k

2x2 training data with n0ne ibm2x2nonealign.txt Expansion of : 1.the a book house 2.das ein Buch Haus 3.the house 4.das Haus 5.the book 6.das Buch 7.a book 8.ein Buch Assume fertility 0, 1, 2. Assume n0ne. 1.n0ne the a book house 2.das ein Buch Haus 3.the house 4.das Haus the house 7.Haus das the house none 10.das das Haus the house none 13.Haus Haus das the book 16.das Buch the book 19.Buch das pairs yields 12 pre- aligned pairs 3 pairs yields 12 pre- aligned pairs 21.the book none 22.das das Buch the book none 25.Buch Buch das a book 28.ein Buch a book 31.Buch ein a book none 34.ein ein Buch a book none 37.Buch Buch ein

Probabilities Training Data Training Data IBM Models 1 and 2 Recall the cascading architecture: unaligned sentence pairs, e.g. the house das Haus unaligned sentence pairs, e.g. the house das Haus Model 1: compute t(e|f) Model 2: compute t(e|f) and a(i|j,l e,l f ) t(e|f) Compute Alignment/Fe rtility possibilities aligned sentence pairs, e.g. the house n0ne das das Haus aligned sentence pairs, e.g. the house n0ne das das Haus t(e|f) a(i|j,l e,l f ) t(e|f) a(i|j,l e,l f )

2x2 training data with n0ne Stage 1: python3 ibm1pre.py 0.1 ibm2x2nonealign.txt -o ibm2x2nonealign_model1.txt Iteration threshold: 0.1 Training data: ibm2x2nonealign.txt t(e|f) output: ibm2x2nonealign_model1.txt Number of pairs read: 12 Iteration 2 t(none |das ) = 0.18 t(the |das ) = 0.51 t(book |das ) = 0.16 t(house|das ) = 0.16 t(none |ein ) = 0.15 t(a |ein ) = 0.48 t(book |ein ) = 0.37 t(none |Buch ) = 0.18 t(the |Buch ) = 0.16 t(a |Buch ) = 0.16 t(book |Buch ) = 0.51 t(none |Haus ) = 0.15 t(the |Haus ) = 0.37 t(house|Haus ) = 0.48 Stage 2: python3 ibm2pre.py 0.01 ibm2x2nonealign.txt ibm2x2nonealign_model1.txt Read training data from file: ibm2x2nonealign.txt, pairs: 12 Read IBM model 1 from file: ibm2x2nonealign_model1.txt Iteration 5 t(the |das ) = 1.00 t(a |ein ) = 1.00 t(book |Buch ) = 1.00 t(house|Haus ) = 1.00 a(1|0,2,2)=0.50 a(2|0,2,2)=0.50 a(1|1,2,2)=1.00 a(2|2,2,2)=1.00 foreign -> English 1 -> 1 2 -> 2 foreign -> English 1 -> 1 2 -> 2

2x2 training data ibm2x2align.txt Same dataset: 1.the a book house 2.das ein Buch Haus 3.the house 4.das Haus 5.the book 6.das Buch 7.a book 8.ein Buch Assume fertility 1. Assume no n0ne. 1.the a book house 2.das ein Buch Haus 3.the house 4.das Haus the house 7.Haus das the book 10.das Buch the book 13.Buch das a book 16.ein Buch a book 19.Buch ein pairs yields 6 pre-aligned pairs 3 pairs yields 6 pre-aligned pairs

2x2 training data Stage 1: python3 ibm1pre.py 0.1 ibm2x2align.txt -o ibm2x2align_model1.txt Iteration threshold: 0.1 Training data: ibm2x2align.txt t(e|f) output: ibm2x2align_model1.txt Number of pairs read: 6 Iteration 4 t(the |das ) = 0.83 t(book |das ) = 0.08 t(house|das ) = 0.09 t(a |ein ) = 0.72 t(book |ein ) = 0.28 t(the |Buch ) = 0.08 t(a |Buch ) = 0.09 t(book |Buch ) = 0.83 t(the |Haus ) = 0.28 t(house|Haus ) = 0.72 Stage 2: python3 ibm2pre.py 0.1 ibm2x2align.txt ibm2x2align_model1.txt Iteration threshold: 0.1 Read training data from file: ibm2x2align.txt, pairs: 6 Read IBM model 1 from file: ibm2x2align_model1.txt Iteration 1 t(the |das ) = 1.00 t(a |ein ) = 1.00 t(book |Buch ) = 1.00 t(house|Haus ) = 1.00 a(1|1,2,2)=1.00 a(2|2,2,2)=1.00 foreign -> English 1 -> 1 2 -> 2 foreign -> English 1 -> 1 2 -> 2

2x2 reverse training data ibm2x2ralign.txt Word order reversed: 1.the a book house 2.das ein Buch Haus 3.the house 4.Haus das 5.the book 6.Buch das 7.a book 8.Buch ein Assume fertility 1. Assume no n0ne. 1.the a book house 2.das ein Buch Haus 3.the house 4.Haus das the house 7.das Haus the book 10.Buch das the book 13.das Buch a book 16.Buch ein a book 19.ein Buch pairs yields 6 pre-aligned pairs 3 pairs yields 6 pre-aligned pairs

2x2 reverse training data Stage 1: python3 ibm1pre.py 0.1 ibm2x2ralign.txt -o ibm2x2ralign_model1.txt Iteration threshold: 0.1 Training data: ibm2x2ralign.txt t(e|f) output: ibm2x2ralign_model1.txt Number of pairs read: 6 Iteration 4 t(the |das ) = 0.83 t(book |das ) = 0.08 t(house|das ) = 0.09 t(a |ein ) = 0.72 t(book |ein ) = 0.28 t(the |Buch ) = 0.08 t(a |Buch ) = 0.09 t(book |Buch ) = 0.83 t(the |Haus ) = 0.28 t(house|Haus ) = 0.72 Stage 2: python3 ibm2pre.py 0.01 ibm2x2ralign.txt ibm2x2ralign_model1.txt Iteration threshold: 0.01 Read training data from file: ibm2x2ralign.txt, pairs: 6 Read IBM model 1 from file: ibm2x2ralign_model1.txt Iteration 4 t(the |das ) = 1.00 t(a |ein ) = 1.00 t(book |Buch ) = 1.00 t(house|Haus ) = 1.00 a(2|1,2,2)=1.00 a(1|2,2,2)=1.00 foreign -> English 2 -> 1 1 -> 2 foreign -> English 2 -> 1 1 -> 2

Reverse With n0ne Summary: 2x2 training data Model 2: Iteration threshold: 0.01 Iteration 4 t(the |das ) = 1.00 t(a |ein ) = 1.00 t(book |Buch ) = 1.00 t(house|Haus ) = 1.00 a(2|1,2,2)=1.00 a(1|2,2,2)=1.00 Iteration threshold: 0.01 Iteration 5 t(the |das ) = 1.00 t(a |ein ) = 1.00 t(book |Buch ) = 1.00 t(house|Haus ) = 1.00 a(1|0,2,2)=0.50 a(2|0,2,2)=0.50 a(2|1,2,2)=1.00 a(1|2,2,2)=1.00 Iteration threshold: 0.1 Iteration 1 t(the |das ) = 1.00 t(a |ein ) = 1.00 t(book |Buch ) = 1.00 t(house|Haus ) = 1.00 a(1|1,2,2)=1.00 a(2|2,2,2)=1.00 Iteration threshold: 0.01 Iteration 5 t(the |das ) = 1.00 t(a |ein ) = 1.00 t(book |Buch ) = 1.00 t(house|Haus ) = 1.00 a(1|0,2,2)=0.50 a(2|0,2,2)=0.50 a(1|1,2,2)=1.00 a(2|2,2,2)=1.00

2x1 training data with n0ne 1.none small big 2.ja klein groß 3.small none 4.klein ja small none 7.ja klein small 10.klein big none 13.groß ja big none 16.ja groß big 19.groß pairs yields 6 pre-aligned pairs 4 pairs yields 6 pre-aligned pairs ibm2x1nonealign.txt From: 1.small big 2.ja klein groß 3.small 4.ja klein 5.small 6.klein 7.big 8.ja groß 9.big 10.groß. Assume n0ne. Assume fertility 0 and 1.

2x1 training data with n0ne Stage 1: python3 ibm1pre.py 0.1 ibm2x1nonealign.txt -o ibm2x1nonealign_model1.txt Iteration threshold: 0.1 Training data: ibm2x1nonealign.txt t(e|f) output: ibm2x1nonealign_model1.txt Number of pairs read: 6 Iteration 4 t(none |ja ) = 0.92 t(small|ja ) = 0.04 t(big |ja ) = 0.04 t(none |klein) = 0.05 t(small|klein) = 0.95 t(none |groß ) = 0.05 t(big |groß ) = 0.95 Stage 2: python3 ibm2pre.py 0.01 ibm2x1nonealign.txt ibm2x1nonealign_model1.txt Iteration threshold: 0.01 Read training data from file: ibm2x1nonealign.txt, pairs: 6 Read IBM model 1 from file: ibm2x1nonealign_model1.txt Iteration 3 t(none |ja ) = 1.00 t(small|klein) = 1.00 t(big |groß ) = 1.00 a(1|0,1,2)=1.00 a(2|1,1,2)=1.00 a(1|1,1,1)=1.00 foreign -> English 2 -> 1 1 -> 0 foreign -> English 2 -> 1 1 -> 0 foreign -> English 1 -> 1 foreign -> English 1 -> 1

2x1 training data 1.small big 2.ja klein groß 3.small 4.ja klein small 7.klein big 10.ja groß big 13.groß pairs yields 4 pre-aligned pairs 4 pairs yields 4 pre-aligned pairs ibm2x1align.txt From: 1.small big 2.ja klein groß 3.small 4.ja klein 5.small 6.klein 7.big 8.ja groß 9.big 10.groß Assume fertility 1. Assume no n0ne.

2x1 training data Stage 1: python3 ibm1pre.py 0.1 ibm2x1align.txt -o ibm2x1align_model1.txt Iteration threshold: 0.1 Training data: ibm2x1align.txt t(e|f) output: ibm2x1align_model1.txt Number of pairs read: 6 Iteration 2 t(small|ja ) = 0.50 t(big |ja ) = 0.50 t(small|klein) = 1.00 t(big |groß ) = 1.00 Stage 2: python3 ibm2pre.py 0.01 ibm2x1align.txt ibm2x1align_model1.txt Iteration threshold: 0.01 Read training data from file: ibm2x1align.txt, pairs: 6 Read IBM model 1 from file: ibm2x1align_model1.txt Iteration 1 t(small|ja ) = 0.50 t(big |ja ) = 0.50 t(small|klein) = 1.00 t(big |groß ) = 1.00 a(1|1,1,2)=0.33 a(2|1,1,2)=0.67 a(1|1,1,1)=1.00 foreign -> English 1 -> 1 (33%), 2 -> 1 (67%) when l e = 1, l f = 2, and 1 -> 1, l e = 1, l f = 1 foreign -> English 1 -> 1 (33%), 2 -> 1 (67%) when l e = 1, l f = 2, and 1 -> 1, l e = 1, l f = 1

with n0ne Summary: 2x1 training data Model 2: Iteration threshold: 0.01 Iteration 1 t(small|ja ) = 0.50 t(big |ja ) = 0.50 t(small|klein) = 1.00 t(big |groß ) = 1.00 a(1|1,1,2)=0.33 a(2|1,1,2)=0.67 a(1|1,1,1)=1.00 Iteration threshold: 0.1 Iteration 3 t(none |ja ) = 1.00 t(small|klein) = 1.00 t(big |groß ) = 1.00 a(1|0,1,2)=1.00 a(2|1,1,2)=1.00 a(1|1,1,1)=1.00

Training IBM Model "3a": – t(e|f) from Model 1 – a(i|j,l e,l f ) from Model 2 – fertility probability distribution: n(ɸ|f) ɸ = fertility (0,1,2..) f = foreign word – no sampling: "hill climbing" Assume training pair with alignment: 1.the house n0ne ⇕ ⇕ ⇕ 1.das das Haus (e and f in ( ⇕ ) 1-to-1 correspondence) Infer fertility mapping: 1 -> 2 2 -> 0 t( n0ne |Haus)*a(2|0,2,2) *n(0|Haus) = k c( n0ne |Ha us) +k c(Haus) +k c(2|0,2,2) +k c(0,2,2) +k c(0|Haus)c(Haus) +k nn a at t

Training Model 3 training data: 7.the house n0ne ⇕ ⇕ ⇕ 8.das das Haus Initially, set: – n( 0 | Haus ) = 0.33 – n( 1|Haus ) = 0.33 – n( 2|Haus ) = 0.33 – uniform distribution Normalize: new t( n0ne | Haus ) = c( n0ne | Haus )/c( Haus ) new a(2|0,2,2) = c(2|0,2,2)/c(0,2,2) new n(0| Haus ) = c n (0| Haus )/c n ( Haus ) Normalize: new t( n0ne | Haus ) = c( n0ne | Haus )/c( Haus ) new a(2|0,2,2) = c(2|0,2,2)/c(0,2,2) new n(0| Haus ) = c n (0| Haus )/c n ( Haus ) t( n0ne |Haus)*a(2|0,2,2) *n(0|Haus) = k c( n0ne |Ha us) +k c(Haus) +k c(2|0,2,2) +k c(0,2,2) +k c(0|Haus)c(Haus) +k nn a at t

Probabilities Training Data Training Data IBM Models 1, 2 and 3a A cascading architecture: unaligned sentence pairs, e.g. the house das Haus unaligned sentence pairs, e.g. the house das Haus Model 1: compute t(e|f) Model 2: compute t(e|f) and a(i|j,l e,l f ) t(e|f) Compute Alignment/Fe rtility possibilities aligned sentence pairs, e.g. the house n0ne das das Haus aligned sentence pairs, e.g. the house n0ne das das Haus t(e|f) a(i|j,l e,l f ) t(e|f) a(i|j,l e,l f ) Model 3a: compute t(e|f), a(i|j,l e,l f ) and n(ɸ|f) t(e|f) a(i|j,l e,l f ) n(ɸ|f) t(e|f) a(i|j,l e,l f ) n(ɸ|f)

2x1 training data with n0ne ibm2x1nonealign.txt From original dataset: 1.small big 2.ja klein groß 3.small 4.ja klein 5.small 6.klein 7.big 8.ja groß 9.big 10.groß Assume fertility 0 and 1. Assume n0ne. Initially: n(0|ja)=0.50 n(1|ja)=0.50 n(0|klein)=0.50 n(1|klein)=0.50 n(0|groß)=0.50 n(1|groß)=0.50

2x1 training data with n0ne Stage 2: python3 ibm2pre.py 0.1 ibm2x1nonealign.txt ibm2x1nonealign_model1.txt -o ibm2x1nonealign_model2.txt Iteration threshold: 0.1 Read training data from file: ibm2x1nonealign.txt, pairs: 6 Read IBM model 1 from file: ibm2x1nonealign_model1.txt Iteration 1 t(none |ja ) = 0.96 t(small|ja ) = 0.02 t(big |ja ) = 0.02 t(none |klein) = 0.02 t(small|klein) = 0.98 t(none |groß ) = 0.02 t(big |groß ) = 0.98 a(1|0,1,2)=0.95 a(2|0,1,2)=0.05 a(1|1,1,2)=0.04 a(2|1,1,2)=0.96 a(1|1,1,1)=1.00 Stage 3: python3 ibm3apre.py 0.1 ibm2x1nonealign.txt ibm2x1nonealign_model2.txt Read training data from ibm2x1nonealign.txt, pairs: 6, none: True, max fertility: 1 Read IBM model 2 from ibm2x1nonealign_model2.txt Iteration 1 t(none |ja ) = 1.00 t(small|klein) = 1.00 t(big |groß ) = 1.00 a(1|0,1,2)=1.00 a(2|1,1,2)=1.00 a(1|1,1,1)=1.00 n(0|ja)=1.00 n(1|klein)=1.00 n(1|groß)=1.00 fertility converges nicely!

2x1 training data ibm2x1align.txt From original dataset: 1.small big 2.ja klein groß 3.small 4.ja klein 5.small 6.klein 7.big 8.ja groß 9.big 10.groß Assume fertility 1. Assume no n0ne. 1.small big 2.ja klein groß 3.small 4.ja klein small 7.klein ja small 10.klein big 13.ja groß big 16.groß ja big 19.groß

2x1 training data Stage 2: python3 ibm2pre.py 0.1 ibm2x1align.txt ibm2x1align_model1.txt -o ibm2x1align_model2.txt Iteration threshold: 0.1 Read training data from file: ibm2x1align.txt, pairs: 6 Read IBM model 1 from file: ibm2x1align_model1.txt Iteration 1 t(small|ja ) = 0.50 t(big |ja ) = 0.50 t(small|klein) = 1.00 t(big |groß ) = 1.00 a(1|1,1,2)=0.33 a(2|1,1,2)=0.67 a(1|1,1,1)=1.00 Stage 3: python3 ibm3apre.py 0.1 ibm2x1align.txt ibm2x1align_model2.txt Read training data from ibm2x1align.txt, pairs: 6, none: False, max fertility: 1 Read IBM model 2 from ibm2x1align_model2.txt Iteration 1 t(small|ja ) = 0.50 t(big |ja ) = 0.50 t(small|klein) = 1.00 t(big |groß ) = 1.00 a(1|1,1,2)=0.20 a(2|1,1,2)=0.80 a(1|1,1,1)=1.00 n(1|ja)=1.00 n(1|klein)=1.00 n(1|groß)=1.00 Only fertility 1 is possible given n0ne not available

2x2 training data ibm2x2.txt Data: 1.the a book house 2.das ein Buch Haus 3.the house 4.das Haus 5.the book 6.das Buch 7.a book 8.ein Buch Files: no n0ne, max fertility 2: ibm2x2align.txt Model 1: ibm2x2align_model1.txt Model 2: ibm2x2align_model2.txt no n0ne, max fertility 2: ibm2x2nonealign.txt Model 1: ibm2x2nonealign_model1.txt Model 2: ibm2x2nonealign_model2.txt

Stage 1: python3 ibm1pre.py 0.1 ibm2x2align.txt -o ibm2x2align_model1.txt Iteration threshold: 0.1 Training data: ibm2x2align.txt t(e|f) output: ibm2x2align_model1.txt Number of pairs read: 6 Iteration 4 t(the |das ) = 0.83 t(book |das ) = 0.08 t(house|das ) = 0.09 t(a |ein ) = 0.72 t(book |ein ) = 0.28 t(the |Buch ) = 0.08 t(a |Buch ) = 0.09 t(book |Buch ) = 0.83 t(the |Haus ) = 0.28 t(house|Haus ) = x2 training data Stage 2: python3 ibm2pre.py 0.1 ibm2x2align.txt ibm2x2align_model1.txt Iteration threshold: 0.1 Read training data from file: ibm2x2align.txt, pairs: 6 Read IBM model 1 from file: ibm2x2align_model1.txt Iteration 1 t(the |das ) = 1.00 t(a |ein ) = 1.00 t(book |Buch ) = 1.00 t(house|Haus ) = 1.00 a(1|1,2,2)=1.00 a(2|2,2,2)=1.00 Stage 3: python3 ibm3apre.py 0.1 ibm2x2align.txt ibm2x2align_model2.txt Read training data from ibm2x2align.txt, pairs: 6, none: False, max fertility: 1 Read IBM model 2 from ibm2x2align_model2.txt Iteration 1 t(the |das ) = 1.00 t(a |ein ) = 1.00 t(book |Buch ) = 1.00 t(house|Haus ) = 1.00 a(1|1,2,2)=1.00 a(2|2,2,2)=1.00 n(1|das)=1.00 n(1|ein)=1.00 n(1|Buch)=1.00 n(1|Haus)=1.00

2x2 training data with n0ne python3 ibm2pre.py 0.1 ibm2x2nonealign.txt ibm2x2nonealign_model1.txt Iteration threshold: 0.1 Read training data from file: ibm2x2nonealign.txt, pairs: 12 Read IBM model 1 from file: ibm2x2nonealign_model1.txt Iteration 4 t(the |das ) = 1.00 t(a |ein ) = 0.99 t(book |Buch ) = 1.00 t(house|Haus ) = 0.99 a(1|0,2,2)=0.50 a(2|0,2,2)=0.50 a(1|1,2,2)=1.00 a(2|2,2,2)=1.00 python3 ibm3apre.py 0.1 ibm2x2nonealign.txt ibm2x2nonealign_model2.txt Read training data from ibm2x2nonealign.txt, pairs: 12, none: True, max fertility: 2 Read IBM model 2 from ibm2x2nonealign_model2.txt Iteration 1 t(the |das ) = 1.00 t(a |ein ) = 1.00 t(book |Buch ) = 1.00 t(house|Haus ) = 1.00 a(1|0,2,2)=0.50 a(2|0,2,2)=0.50 a(1|1,2,2)=1.00 a(2|2,2,2)=1.00 n(1|das)=0.50 n(2|das)=0.50 n(1|ein)=0.50 n(2|ein)=0.50 n(1|Buch)=0.50 n(2|Buch)=0.50 n(1|Haus)=0.50 n(2|Haus)=0.50 Initially, fertility: n(0|das)=0.33 n(1|das)=0.33 n(2|das)=0.33 n(0|ein)=0.33 n(1|ein)=0.33 n(2|ein)=0.33 n(0|Buch)=0.33 n(1|Buch)=0.33 n(2|Buch)=0.33 n(0|Haus)=0.33 n(1|Haus)=0.33 n(2|Haus)=0.33 Fertility doesn't converge!