12/08/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing (600.465) Statistical Translation: Alignment and Parameter Estimation.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
CS252: Systems Programming Ninghui Li Program Interview Questions.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Translation Model Parameters & Expectation Maximization Algorithm Lecture 2 (adapted from notes from Philipp Koehn & Mary Hearne) Dr. Declan Groves, CNGL,
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Clustering II.
Identifying Translations Philip Resnik, Noah Smith University of Maryland.
Machine Translation (II): Word-based SMT Ling 571 Fei Xia Week 10: 12/1/05-12/6/05.
Simulation Modeling and Analysis Session 12 Comparing Alternative System Designs.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Expectation Maximization Algorithm
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Natural Language Processing Expectation Maximization.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Statistical Alignment and Machine Translation
Genetic Algorithm.
11/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Shift-Reduce Parsing in Detail Dr. Jan Hajič CS Dept., Johns.
12/06/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Parsing Dr. Jan Hajič CS Dept., Johns Hopkins Univ.
Introduction to Algorithms Jiafen Liu Sept
12/07/1999 JHU CS /Jan Hajic 1 *Introduction to Natural Language Processing ( ) Statistical Machine Translation Dr. Jan Hajič cCS Dept., Johns.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
CS654: Digital Image Analysis
Chapter 23: Probabilistic Language Models April 13, 2004.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.
NLP. Machine Translation Source-channel model of communication Parametric probabilistic models of language and translation.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing Probability AI-Lab
Genetic Algorithm Dr. Md. Al-amin Bhuiyan Professor, Dept. of CSE Jahangirnagar University.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory II AI-lab
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
January 2012Spelling Models1 Human Language Technology Spelling Models.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Computational Linguistics Seminar LING-696G Week 6.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Hidden Markov Models BMI/CS 576
Statistical Machine Translation
CSCI 5832 Natural Language Processing
Three classic HMM problems
Word-based SMT Ling 580 Fei Xia Week 1: 1/3/06.
Machine Translation and MT tools: Giza++ and Moses
Text Categorization Berlin Chen 2003 Reference:
Machine Translation and MT tools: Giza++ and Moses
Pushpak Bhattacharyya CSE Dept., IIT Bombay 31st Jan, 2011
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

12/08/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Translation: Alignment and Parameter Estimation Dr. Jan Hajič CS Dept., Johns Hopkins Univ.

12/08/1999JHU CS / Intro to NLP/Jan Hajic2 Alignment Available corpus assumed: –parallel text (translation E ↔  F) No alignment present (day marks only)! Sentence alignment –sentence detection –sentence alignment Word alignment –tokenization –word alignment (with restrictions)

12/08/1999JHU CS / Intro to NLP/Jan Hajic3 Sentence Boundary Detection Rules, lists: –Sentence breaks: paragraphs (if marked) certain characters: ?, !, ; (...almost sure) The Problem: period. –could be end of sentence (... left yesterday. He was heading to...) –decimal point: 3.6 (three-point-six) –thousand segment separator: (three-thousand-two-hundred) –abbreviation never at the end of sentence: cf., e.g., Calif., Mt., Mr. –ellipsis:... –other languages: ordinal number indication (2nd ~ 2.) –initials: A. B. Smith Statistical methods: e.g., Maximum Entropy

12/08/1999JHU CS / Intro to NLP/Jan Hajic4 Sentence Alignment The Problem: sentences detected only: E: F: Desired output: Segmentation with equal number of segments, spanning continuously the whole text. Original sentence boundaries kept: E: F: Alignments obtained: 2-1, 1-1, 1-1, 2-2, 2-1, 0-1 New segments called “ sentences ” from now on.

12/08/1999JHU CS / Intro to NLP/Jan Hajic5 Alignment Methods Several methods (probabilistic and not prob.) –character-length based –word-length based –“ cognates ” (word identity used) using an existing dictionary (F: prendre ~ E: make, take) using word “ distance ” (similarity): names, numbers, borrowed words, Latin origin words,... Best performing: –statistical, word- or character- length based (with some words perhaps)

12/08/1999JHU CS / Intro to NLP/Jan Hajic6 Length-based Alignment First, define the problem probabilistically: argmax A P(A|E,F) = argmax A P(A,E,F) (E,F fixed) Define a “ bead ” : E: F: Approximate: P(A,E,F)   i=1..n P(B i ), where B i is a bead; P(B i ) does not depend on the rest of E,F. “ bead ” (2:2 in this case)

12/08/1999JHU CS / Intro to NLP/Jan Hajic7 The Alignment Task Given the model definition, P(A,E,F)   i=1..n P(B i ), find the partitioning of (E,F) into n beads B i=1..n, that maximizes P(A,E,F) over training data. Define B i = p:q  i, where p:q  {0:1,1:0,1:1,1:2,2:1,2:2} –describes the type of alignment Want to use some sort of dynamic programming: Define Pref(i,j)... probability of the best alignment from the start of (E,F) data (1,1) up to (i,j)

12/08/1999JHU CS / Intro to NLP/Jan Hajic8 Recursive Definition Initialize: Pref(0,0) = 0. Pref(i,j) = max ( Pref(i,j-1) P( 0:1  k ), Pref(i-1,j) P( 1:0  k ), Pref(i-1,j-1) P( 1:1  k ), Pref(i-1,j-2) P( 1:2  k ), Pref(i-2,j-1) P( 2:1  k ), Pref(i-2,j-2) P( 2:2  k ) ) This is enough for a Viterbi-like search. E: F: i j Pref(i-2,j-2) P( 2:2  k ) Pref(i-2,j-1) P( 2:1  k ) Pref(i-1,j-2) P( 1:2  k ) Pref(i-1,j-1) P( 1:1  k ) Pref(i-1,j) P( 1:0  k ) Pref(i,j-1) P( 0:1  k )

12/08/1999JHU CS / Intro to NLP/Jan Hajic9 Probability of a Bead Remains to define P( p:q  k ) (the red part): –k refers to the “ next ” bead, with segments of p and q sentences, lengths l k,e and l k,f. Use normal distribution for length variation: P( p:q  k ) = P(  l k,e,l k,f, ,  2 ,p:q)  P(  l k,e,l k,f, ,  2  )P(p:q)  l k,e,l k,f, ,  2  = ( l k,f -  l k,e )/  l k,e  2 Estimate P(p:q) from small amount of data, or even guess and re-estimate after aligning some data. Words etc. might be used as better clues in P( p:q a k ) def.

12/08/1999JHU CS / Intro to NLP/Jan Hajic10 Saving time For long texts (> 10 4 sentences), even Viterbi (in the version needed) is not effective (o(S 2 ) time) Go paragraph by paragraph if they are aligned 1:1 What if not? Apply the same method first to paragraphs! –identify paragraphs roughly in both languages –run the algorithm to get aligned paragraph-like segments –then, run on sentences within paragraphs. Performs well if not many consecutive 1:0 or 0:1 beads.

12/08/1999JHU CS / Intro to NLP/Jan Hajic11 Word alignment Length alone does not help anymore. –mainly because words can be swapped, and mutual translations have often vastly different length....but at least, we have “ sentences ” (sentence-like segments) aligned; that will be exploited heavily. Idea: –Assume some (simple) translation model (such as Model 1). –Find its parameters by considering virtually all alignments. –After we have the parameters, find the best alignment given those parameters.

12/08/1999JHU CS / Intro to NLP/Jan Hajic12 Word Alignment Algorithm Start with sentence-aligned corpus. Let (E,F) be a pair of sentences (actually, a bead). Initialize p(f|e) randomly (e.g., uniformly), f  F, e  E. Compute expected counts over the corpus: c(f,e) =  (E,F);e  E,f  F p(f|e)  aligned pair (E,F), find if e in E and f in F; if yes, add p(f|e). Reestimate: p(f|e) = c(f,e) / c(e) [c(e) =  f c(f,e)] Iterate until change of p(f|e) is small.

12/08/1999JHU CS / Intro to NLP/Jan Hajic13 Best Alignment Select, for each (E,F), A = argmax A P(A|F,E) = argmax A P(F,A|E)/P(F) = argmax A P(F,A|E) = argmax A (  / (l+1) m  j=1..m p(f j |e a j )) = argmax A  j=1..m p(f j |e a j ) Again, use dynamic programming, Viterbi-like algorithm. Recompute p(f|e) based on the best alignment (only if you are inclined to do so; the “ original ” summed-over-all distribution might perform better). Note: we have also got all Model 1 parameters.