12/07/1999 JHU CS 600.465/Jan Hajic 1 *Introduction to Natural Language Processing (600.465) Statistical Machine Translation Dr. Jan Hajič cCS Dept., Johns.

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Translation Model Parameters & Expectation Maximization Algorithm Lecture 2 (adapted from notes from Philipp Koehn & Mary Hearne) Dr. Declan Groves, CNGL,

1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.

EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

CSE115/ENGR160 Discrete Mathematics 03/03/11 Ming-Hsuan Yang UC Merced 1.

Machine Translation (II): Word-based SMT Ling 571 Fei Xia Week 10: 12/1/05-12/6/05.

A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.

Expectation Maximization Algorithm

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.

Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??

(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.

Jan 2005Statistical MT1 CSA4050: Advanced Techniques in NLP Machine Translation III Statistical MT.

MACHINE TRANSLATION AND MT TOOLS: GIZA++ AND MOSES -Nirdesh Chauhan.

Albert Gatt Corpora and Statistical Methods Lecture 9.

THE MATHEMATICS OF STATISTICAL MACHINE TRANSLATION Sriraman M Tallam.

Natural Language Processing Expectation Maximization.

Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?

Statistical Alignment and Machine Translation

Genetic Algorithm.

Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.

11/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Shift-Reduce Parsing in Detail Dr. Jan Hajič CS Dept., Johns.

10/04/1999 JHU CS /Jan Hajic 1 *Introduction to Natural Language Processing ( ) Words and the Company They Keep Dr. Jan Hajič CS Dept., Johns.

12/08/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Translation: Alignment and Parameter Estimation.

12/06/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Parsing Dr. Jan Hajič CS Dept., Johns Hopkins Univ.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

Martin KayTranslation—Meaning1 Martin Kay Stanford University with thanks to Kevin Knight.

Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.

9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.

1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab

PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,

1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Machine Learning 5. Parametric Methods.

NLP. Machine Translation Source-channel model of communication Parametric probabilistic models of language and translation.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing Probability AI-Lab

Machine Translation Course 4 Diana Trandab ă ț Academic year:

Non-LP-Based Approximation Algorithms Fabrizio Grandoni IDSIA

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Ling 575: Machine Translation Yuval Marton Winter 2016 January 19: Spill-over from last class, some prob+stats, word alignment, phrase-based and hierarchical.

Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.

Lecture 15: Text Classification & Naive Bayes

Statistical NLP: Lecture 13

Hidden Markov Models Part 2: Algorithms

Objective of This Course

CSCI 5832 Natural Language Processing

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

N-Gram Model Formulas Word sequences Chain rule of probability

Word-based SMT Ling 580 Fei Xia Week 1: 1/3/06.

Machine Translation and MT tools: Giza++ and Moses

Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures

Machine Translation and MT tools: Giza++ and Moses

Pushpak Bhattacharyya CSE Dept., IIT Bombay 31st Jan, 2011

Presentation transcript:

12/07/1999 JHU CS /Jan Hajic 1 *Introduction to Natural Language Processing ( ) Statistical Machine Translation Dr. Jan Hajič cCS Dept., Johns Hopkins Univ.

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 2 The Main Idea Treat translation as a noisy channel problem: Input (Source) “ Noisy ” Output (target) The channel E: English words... (adds “ noise ” ) F: Les mots Anglais... The Model: P(E|F) = P(F|E) P(E) / P(F) Interested in rediscovering E given F: After the usual simplification (P(F) fixed): argmax E P(E|F) = argmax E P(F|E) P(E) !

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 3 The Necessities Language Model (LM) P(E) Translation Model (TM): Target given source P(F|E) Search procedure –Given E, find best F using the LM and TM distributions. Usual problem: sparse data –We cannot create a “ sentence dictionary ” E ↔  F –Typically, we do not see a sentence even twice!

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 4 The Language Model Any LM will do: –3-gram LM –3-gram class-based LM –decision tree LM with hierarchical classes Does not necessarily operates on word forms: –cf. later the “ analysis ” and “ generation ” procedures –for simplicity, imagine now it does operate on word forms

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 5 The Translation Models Do not care about correct strings of English words (that ’ s the task of the LM) Therefore, we can make more independence assumptions: –for start, use the “ tagging ” approach: 1 English word ( “ tag ” ) ~ 1 French word ( “ word ” ) –not realistic: rarely even the number of words is the same in both sentences (let alone there is 1:1 correspondence!)  use “ Alignment ”.

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 6 The Alignment e 0 And the program has been implemented f 0 Le programme a é t é mis en application Linear notation: f 0 (1) Le(2) programme(3) a(4) é t é (5) mis(6) en(6) application(6) e 0 And(0) the(1) program(2) has(3) been(4) implemented(5,6,7)

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 7 Alignment Mapping In general: –|F| = m, |E| = l (length of sent.): lm connections (each French word to any English word), 2 lm different alignments for any pair (E,F) (any subset) In practice: –From English to French each English word 1-n connections (n - empirical max.-fertility?) each French word exactly 1 connection –therefore, “ only ” (l+1) m alignments ( << 2 lm ) a j = i (link from j-th French word goes to i-th English word)

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 8 Elements of Translation Model(s) Basic distribution: P(F,A,E) - the joint distribution of the English sentence, the Alignment, and the French sentence (length m ) Interested also in marginal distributions: P(F,E) =  A P(F,A,E) P(F|E) = P(F,E) / P(E) =  A P(F,A,E) /  A,F P(F,A,E) =  A P(F,A|E) Useful decomposition [one of possible decompositions]: P(F,A|E) = P( m | E)  j=1..m P(a j |a 1 j-1,f 1 j-1, m,E) P(f j |a 1 j,f 1 j-1, m,E)

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 9 Decomposition Decomposition formula again: P(F,A|E) = P( m | E)  j=1..m P(a j |a 1 j-1,f 1 j-1, m,E) P(f j |a 1 j,f 1 j-1, m,E) m - length of French sentence a j - the alignment (single connection) going from j-th French w. f j - the j-th French word from F a 1 j-1 - sequence of alignments a i up to the word preceding f j a 1 j - sequence of alignments a i up to and including the word f j f 1 j-1 - sequence of French words up to the word preceding f j

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 10 Decomposition and the Generative Model...and again: P(F,A|E) = P( m | E)  j=1..m P(a j |a 1 j-1,f 1 j-1, m,E) P(f j |a 1 j,f 1 j-1, m,E) Generate: –first, the length of the French given the English words E; –then, the link from the first position in F (not knowing the actual word yet)  now we know the English word –then, given the link (and thus the English word), generate the French word at the current position –then, move to the next position in F until m position filled.

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 11 Approximations Still too many parameters –similar situation as in n-gram model with “ unlimited ” n –impossible to estimate reliably. Use 5 models, from the simplest to the most complex (i.e. from heavy independence assumptions to light) Parameter estimation: Estimate parameters of Model 1; use as an initial estimate for estimating Model 2 parameters; etc.

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 12 Model 1 Approximations: –French length P( m | E) is constant (small  ) –Alignment link distribution P(a j |a 1 j-1,f 1 j-1, m,E) depends on English length l only (= 1/(l+1)) –French word distribution depends only on the English and French word connected with link a j.  Model 1 distribution: P(F,A|E) =  / (l+1) m  j=1..m p(f j |e a j )

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 13 Models 2-5 Model 2 –adds more detail into P(a j |...): more “ vertical ” links preferred Model 3 –adds “ fertility ” (number of links for a given English word is explicitly modeled: P(n|e i ) –“ distortion ” replaces alignment probabilities from Model 2 Model 4 –the notion of “ distortion ” extended to chunks of words Model 5 is Model 4, but not deficient (does not waste probability to non-strings)

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 14 The Search Procedure “ Decoder ” : –given “ output ” (French), discover “ input ” (English) Translation model goes in the opposite direction: p(f|e) =.... Naive methods do not work. Possible solution (roughly): –generate English words one-by-one, keep only n-best (variable n) list; also, account for different lengths of the English sentence candidates!

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 15 Analysis - Translation - Generation (ATG) Word forms: too sparse Use four basic analysis, generation steps: –tagging –lemmatization –word-sense disambiguation –noun-phrase “ chunks ” (non-compositional translations) Translation proper: –use chunks as “ words ”

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 16 Training vs. Test with ATG Training: –analyze both languages using all four analysis steps –train TM(s) on the result (i.e. on chunks, tags, etc.) –train LM on analyzed source (English) Runtime/Test: –analyze given language sentence (French) using identical tools as in training –translate using the trained Translation/Language model(s) –generate source (English), reversing the analysis process

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 17 Analysis: Tagging and Morphology Replace word forms by morphologically processed text: –lemmas –tags original approach: mix them into the text, call them “ words ” e.g. She bought two books.  she buy VBP two book NNS. Tagging: yes –but reversed order: tag first, then lemmatize [NB: does not work for inflective languages] technically easy Hand-written deterministic rules for tag+form  lemma

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 18 Word Sense Disambiguation, Word Chunking Sets of senses for each E, F word: –e.g. book-1, book-2,..., book-n –prepositions (de-1, de-2, de-3,...), many others Senses derived automatically using the TM –translation probabilities measured on senses: p(de-3|from-5) Result: –statistical model for assigning senses monolingually based on context (also MaxEnt model used here for each word) Chunks: group words for non-compositional translation

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 19 Generation Inverse of analysis Much simpler: –Chunks  words (lemmas) with senses (trivial) –Words (lemmas) with senses  words (lemmas) (trivial) –Words (lemmas) + tags  word forms Additional step: –Source-language ambiguity: electric vs. electrical, hath vs. has, you vs. thou: treated as a single unit in translation proper, but must be disambiguated at the end of generation phase; using additional pure LM on word forms.

12/07/1999 JHU CS /Jan Hajic 20 *Introduction to Natural Language Processing ( ) Statistical Translation: Alignment and Parameter Estimation Dr. Jan Hajič CS Dept., Johns Hopkins Univ.

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 21 Alignment Available corpus assumed: –parallel text (translation E ↔  F) No alignment present (day marks only)! Sentence alignment –sentence detection –sentence alignment Word alignment –tokenization –word alignment (with restrictions)

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 22 Sentence Boundary Detection Rules, lists: –Sentence breaks: paragraphs (if marked) certain characters: ?, !, ; (...almost sure) The Problem: period. –could be end of sentence (... left yesterday. He was heading to...) –decimal point: 3.6 (three-point-six) –thousand segment separator: (three-thousand-two-hundred) –abbreviation never at the end of sentence: cf., e.g., Calif., Mt., Mr. –ellipsis:... –other languages: ordinal number indication (2nd ~ 2.) –initials: A. B. Smith Statistical methods: e.g., Maximum Entropy

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 23 Sentence Alignment The Problem: sentences detected only: E: F: Desired output: Segmentation with equal number of segments, spanning continuously the whole text. Original sentence boundaries kept: E: F: Alignments obtained: 2-1, 1-1, 1-1, 2-2, 2-1, 0-1 New segments called “ sentences ” from now on.

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 24 Alignment Methods Several methods (probabilistic and not prob.) –character-length based –word-length based –“ cognates ” (word identity used) using an existing dictionary (F: prendre ~ E: make, take) using word “ distance ” (similarity): names, numbers, borrowed words, Latin origin words,... Best performing: –statistical, word- or character- length based (with some words perhaps)

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 25 Length-based Alignment First, define the problem probabilistically: argmax A P(A|E,F) = argmax A P(A,E,F) (E,F fixed) Define a “ bead ” : E: F: Approximate: P(A,E,F)   i=1..n P(B i ), where B i is a bead; P(B i ) does not depend on the rest of E,F. “ bead ” (2:2 in this case)

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 26 The Alignment Task Given the model definition, P(A,E,F)   i=1..n P(B i ), find the partitioning of (E,F) into n beads B i=1..n, that maximizes P(A,E,F) over training data. Define B i = p:q  i, where p:q  {0:1,1:0,1:1,1:2,2:1,2:2} –describes the type of alignment Want to use some sort of dynamic programming: Define Pref(i,j)... probability of the best alignment from the start of (E,F) data (1,1) up to (i,j)

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 27 Recursive Definition Initialize: Pref(0,0) = 0. Pref(i,j) = max ( Pref(i,j-1) P( 0:1  k ), Pref(i-1,j) P( 1:0  k ), Pref(i-1,j-1) P( 1:1  k ), Pref(i-1,j-2) P( 1:2  k ), Pref(i-2,j-1) P( 2:1  k ), Pref(i-2,j-2) P( 2:2  k ) ) This is enough for a Viterbi-like search. E: F: i j Pref(i-2,j-2) P( 2:2  k ) Pref(i-2,j-1) P( 2:1  k ) Pref(i-1,j-2) P( 1:2  k ) Pref(i-1,j-1) P( 1:1  k ) Pref(i-1,j) P( 1:0  k ) Pref(i,j-1) P( 0:1  k )

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 28 Probability of a Bead Remains to define P( p:q  k ) (the red part): –k refers to the “ next ” bead, with segments of p and q sentences, lengths l k,e and l k,f. Use normal distribution for length variation: P( p:q  k ) = P(  l k,e,l k,f, ,  2 ,p:q)  P(  l k,e,l k,f, ,  2  )P(p:q)  l k,e,l k,f, ,  2  = ( l k,f -  l k,e )/  l k,e  2 Estimate P(p:q) from small amount of data, or even guess and re-estimate after aligning some data. Words etc. might be used as better clues in P( p:q a k ) def.

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 29 Saving time For long texts (> 10 4 sentences), even Viterbi (in the version needed) is not effective (o(S 2 ) time) Go paragraph by paragraph if they are aligned 1:1 What if not? Apply the same method first to paragraphs! –identify paragraphs roughly in both languages –run the algorithm to get aligned paragraph-like segments –then, run on sentences within paragraphs. Performs well if not many consecutive 1:0 or 0:1 beads.

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 30 Word alignment Length alone does not help anymore. –mainly because words can be swapped, and mutual translations have often vastly different length....but at least, we have “ sentences ” (sentence-like segments) aligned; that will be exploited heavily. Idea: –Assume some (simple) translation model (such as Model 1). –Find its parameters by considering virtually all alignments. –After we have the parameters, find the best alignment given those parameters.

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 31 Word Alignment Algorithm Start with sentence-aligned corpus. Let (E,F) be a pair of sentences (actually, a bead). Initialize p(f|e) randomly (e.g., uniformly), f  F, e  E. Compute expected counts over the corpus: c(f,e) =  (E,F);e  E,f  F p(f|e)  aligned pair (E,F), find if e in E and f in F; if yes, add p(f|e). Reestimate: p(f|e) = c(f,e) / c(e) [c(e) =  f c(f,e)] Iterate until change of p(f|e) is small.

12/07/1999 JHU CS / Intro to NLP/Jan Hajic 32 Best Alignment Select, for each (E,F), A = argmax A P(A|F,E) = argmax A P(F,A|E)/P(F) = argmax A P(F,A|E) = argmax A (  / (l+1) m  j=1..m p(f j |e a j )) = argmax A  j=1..m p(f j |e a j ) (IBM Model 1) Again, use dynamic programming, Viterbi-like algorithm. Recompute p(f|e) based on the best alignment (only if you are inclined to do so; the “ original ” summed-over-all distribution might perform better). Note: we have also got all Model 1 parameters.