Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical NLP Spring 2011

Similar presentations


Presentation on theme: "Statistical NLP Spring 2011"— Presentation transcript:

1 Statistical NLP Spring 2011
Lecture 9: Word Alignment II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA

2 Learning with EM Hard EM: alternate between Example: K-Means
E-step: Find best “completions” Y for fixed  M-step: Find best parameters  for fixed Y

3 K-Means An iterative clustering algorithm
Pick K random points as cluster centers (means) Alternate: Assign data instances to closest mean Assign each mean to the average of its assigned points Stop when no points’ assignments change

4 K-Means Example

5 K-Means Getting Stuck A local optimum:

6 Naïve-Bayes Models y x1 x2 xn . . .
Model: pick a topic, then generate a document using a language model for that topic. Naïve-Bayes assumption: all words are independent given the topic. y x1 x2 xn . . .

7 Hard EM for Naïve-Bayes
Procedure: (1) we calculate best completions: (2) compute relevant counts from the completed data: (3) compute new parameters from these counts (divide) (4) repeat until convergence Can also do this when some docs are labeled

8 Hard EM: More Formally Hard EM: Improve completions Improve parameters
Each step either does nothing or increases the objective

9 Soft EM for Naïve-Bayes
Procedure: (1) calculate posteriors (soft completions): (2) compute expected counts under those posteriors: (3) compute new parameters from these counts (divide) (4) repeat until convergence

10 EM in General We’ll use EM over and over again to fill in missing data
Convenience Scenario: we want P(x), including y just makes the model simpler (e.g. mixing weights for language models) Induction Scenario: we actually want to know y (e.g. clustering) NLP differs from much of statistics / machine learning in that we often want to interpret or use the induced variables (which is tricky at best) General approach: alternately update y and  E-step: compute posteriors P(y|x,) This means scoring all completions with the current parameters Usually, we do this implicitly with dynamic programming M-step: fit  to these completions This is usually the easy part – treat the completions as (fractional) complete data Initialization: start with some noisy labelings and the noise adjusts into patterns based on the data and the model We’ll see lots of examples in this course EM is only locally optimal (why?)

11 KL Divergence

12 General Setup KL divergence to true posterior

13 Approximations

14 General Solution

15 Example: Two-Mixture

16 Example Posteriors

17 Approximate Posteriors

18 Approximate Posteriors

19 IBM Models 1/2 de muy buen grado -- independent emission
3 4 5 7 6 8 9 E: Thank you , I shall do so gladly . 1 9 3 7 6 8 A: F: Gracias , lo haré de muy buen grado . de muy buen grado -- independent emission Model Parameters Transitions: P( A2 = 3) Emissions: P( F1 = Gracias | EA1 = Thank )

20 Problems with Model 1 There’s a reason they designed models 2-5!
Problems: alignments jump around, align everything to rare words Experimental setup: Training data: 1.1M sentences of French-English text, Canadian Hansards Evaluation metric: alignment error Rate (AER) Evaluation data: 447 hand-aligned sentences

21 Monotonic Translation
Japan shaken by two new quakes Le Japon secoué par deux nouveaux séismes

22 Local Order Change Japan is at the junction of four tectonic plates
Le Japon est au confluent de quatre plaques tectoniques

23 IBM Model 2 Alignments tend to the diagonal (broadly at least)
Other schemes for biasing alignments towards the diagonal: Relative vs absolute alignment Asymmetric distances Learning a full multinomial over distances

24 EM for Models 1/2 Model 1 Parameters: Start with uniform, including
Translation probabilities (1+2) Distortion parameters (2 only) Start with uniform, including For each sentence: For each French position j Calculate posterior over English positions (or just use best single alignment) Increment count of word fj with word ei by these amounts Also re-estimate distortion probabilities for model 2 Iterate until convergence

25 Example

26 Phrase Movement On Thursday Nov. 4, earthquakes rocked Japan once again Des tremblements de terre ont à nouveau touché le Japon jeudi 4 novembre.

27 The HMM Model de muy buen grado -- independent emission
1 2 3 4 5 7 6 8 9 E: Thank you , I shall do so gladly . 1 9 3 7 6 8 A: F: Gracias , lo haré de muy buen grado . de muy buen grado -- independent emission Model Parameters Transitions: P( A2 = 3 | A1 = 1) Emissions: P( F1 = Gracias | EA1 = Thank )

28 The HMM Model Model 2 preferred global monotonicity
We want local monotonicity: Most jumps are small HMM model (Vogel 96) Re-estimate using the forward-backward algorithm Handling nulls requires some care What are we still missing?

29 HMM Examples

30 AER for HMMs Model AER Model 1 INT 19.5 HMM EF 11.4 HMM FE 10.8
HMM AND 7.1 HMM INT 4.7 GIZA M4 AND 6.9

31 IBM Models 3/4/5 Mary did not slap the green witch
n(3|slap) Mary not slap slap slap the green witch P(NULL) Mary not slap slap slap NULL the green witch t(la|the) Mary no daba una botefada a la verde bruja d(j|i) Mary no daba una botefada a la bruja verde [from Al-Onaizan and Knight, 1998]

32 Examples: Translation and Fertility

33 Example: Idioms he is nodding il hoche la tête

34 Example: Morphology

35 Some Results [Och and Ney 03]

36 Decoding In these word-to-word models Finding best alignments is easy
Finding translations is hard (why?)

37 Bag “Generation” (Decoding)

38 Bag Generation as a TSP Imagine bag generation with a bigram LM
Words are nodes Edge weights are P(w|w’) Valid sentences are Hamiltonian paths Not the best news for word-based MT! is it . not clear

39 IBM Decoding as a TSP

40 Phrase Weights

41

42 Phrase Scoring Learning weights has been tried, several times:
[Marcu and Wong, 02] [DeNero et al, 06] … and others Seems not to work well, for a variety of partially understood reasons Main issue: big chunks get all the weight, obvious priors don’t help Though, [DeNero et al 08] les chats aiment le poisson cats like fresh fish . frais

43 Phrase Size Phrases do help But they don’t need to be long
Why should this be?

44 Lexical Weighting

45 Many-to-Many Alignments

46 Crash Course in EM

47


Download ppt "Statistical NLP Spring 2011"

Similar presentations


Ads by Google