Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 14b 24 August 2007.

Similar presentations


Presentation on theme: "Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 14b 24 August 2007."— Presentation transcript:

1 Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 14b 24 August 2007

2 Lecture 1, 7/21/2005Natural Language Processing2 LING 180 SYMBSYS 138 Intro to Computer Speech and Language Processing Lecture 13: Machine Translation (II) November 9, 2006 Dan Jurafsky Thanks to Kevin Knight for much of this material, and many slides also came from Bonnie Dorr and Christof Monz!

3 Lecture 1, 7/21/2005Natural Language Processing3 Outline for MT Week  Intro and a little history  Language Similarities and Divergences  Four main MT Approaches Transfer Interlingua Direct Statistical  Evaluation

4 Lecture 1, 7/21/2005Natural Language Processing4 Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

5 Lecture 1, 7/21/2005Natural Language Processing5 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izokenemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Slide from Kevin Knight

6 Lecture 1, 7/21/2005Natural Language Processing6 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Slide from Kevin Knight

7 Lecture 1, 7/21/2005Natural Language Processing7 Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Slide from Kevin Knight

8 Lecture 1, 7/21/2005Natural Language Processing8 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ??? Slide from Kevin Knight

9 Lecture 1, 7/21/2005Natural Language Processing9 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Slide from Kevin Knight

10 Lecture 1, 7/21/2005Natural Language Processing10 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

11 Lecture 1, 7/21/2005Natural Language Processing11 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Slide from Kevin Knight

12 Lecture 1, 7/21/2005Natural Language Processing12 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ???

13 Lecture 1, 7/21/2005Natural Language Processing13 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Slide from Kevin Knight

14 Lecture 1, 7/21/2005Natural Language Processing14 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp process of elimination

15 Lecture 1, 7/21/2005Natural Language Processing15 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp cognate? Slide from Kevin Knight

16 Lecture 1, 7/21/2005Natural Language Processing16 Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp } Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. zero fertility

17 Lecture 1, 7/21/2005Natural Language Processing17 Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa It’s Really Spanish/English 1a. Garcia and associates. 1b. Garcia y asociados. 7a. the clients and the associates are enemies. 7b. los clients y los asociados son enemigos. 2a. Carlos Garcia has three associates. 2b. Carlos Garcia tiene tres asociados. 8a. the company has three groups. 8b. la empresa tiene tres grupos. 3a. his associates are not strong. 3b. sus asociados no son fuertes. 9a. its groups are in Europe. 9b. sus grupos estan en Europa. 4a. Garcia has a company also. 4b. Garcia tambien tiene una empresa. 10a. the modern groups sell strong pharmaceuticals. 10b. los grupos modernos venden medicinas fuertes. 5a. its clients are angry. 5b. sus clientes estan enfadados. 11a. the groups do not sell zenzanine. 11b. los grupos no venden zanzanina. 6a. the associates are also angry. 6b. los asociados tambien estan enfadados. 12a. the small groups are not modern. 12b. los grupos pequenos no son modernos. Slide from Kevin Knight

18 Lecture 1, 7/21/2005Natural Language Processing18 Statistical MT Systems Spanish Broken English Spanish/English Bilingual Text English Text Statistical Analysis Que hambre tengo yo What hunger have I, Hungry I am so, I am so hungry, Have I that hunger … I am so hungry Slide from Kevin Knight

19 Lecture 1, 7/21/2005Natural Language Processing19 Statistical MT Systems Spanish Broken English Spanish/English Bilingual Text English Text Statistical Analysis Que hambre tengo yoI am so hungry Translation Model P(s|e) Language Model P(e) Decoding algorithm argmax P(e) * P(s|e) e Slide from Kevin Knight

20 Lecture 1, 7/21/2005Natural Language Processing20 Bayes Rule Spanish Broken English Que hambre tengo yoI am so hungry Translation Model P(s|e) Language Model P(e) Decoding algorithm argmax P(e) * P(s|e) e Given a source sentence s, the decoder should consider many possible translations … and return the target string e that maximizes P(e | s) By Bayes Rule, we can also write this as: P(e) x P(s | e) / P(s) and maximize that instead. P(s) never changes while we compare different e’s, so we can equivalently maximize this: P(e) x P(s | e) Slide from Kevin Knight

21 Lecture 1, 7/21/2005Natural Language Processing21 Three Problems for Statistical MT  Language model Given an English string e, assigns P(e) by formula good English string -> high P(e) random word sequence -> low P(e)  Translation model Given a pair of strings, assigns P(f | e) by formula look like translations -> high P(f | e) don’t look like translations -> low P(f | e)  Decoding algorithm Given a language model, a translation model, and a new sentence f … find translation e maximizing P(e) * P(f | e) Slide from Kevin Knight

22 Lecture 1, 7/21/2005Natural Language Processing22 The Classic Language Model Word N-Grams Goal of the language model -- choose among: He is on the soccer field He is in the soccer field Is table the on cup the The cup is on the table Rice shrine American shrine Rice company American company Slide from Kevin Knight

23 Lecture 1, 7/21/2005Natural Language Processing23 Intuition of phrase-based translation (Koehn et al. 2003)  Generative story has three steps 1) Group words into phrases 2) Translate each phrase 3) Move the phrases around

24 Lecture 1, 7/21/2005Natural Language Processing24 Generative story again 1) Group English source words into phrases e 1, e 2, …, e n 2) Translate each English phrase e i into a Spanish phrase f j. The probability of doing this is  (f j |e i ) 3) Then (optionally) reorder each Spanish phrase We do this with a distortion probability A measure of distance between positions of a corresponding phrase in the 2 lgs. “What is the probability that a phrase in position X in the English sentences moves to position Y in the Spanish sentence?”

25 Lecture 1, 7/21/2005Natural Language Processing25 Distortion probability  The distortion probability is parameterized by a i -b i-1 Where a i is the start position of the foreign (Spanish) phrase generated by the ith English phrase e i. And b i-1 is the end position of the foreign (Spanish) phrase generated by the I-1th English phrase e i-1.  We’ll call the distortion probability d(a i -b i-1 ).  And we’ll have a really stupid model: d(a i -b i-1 ) =  |ai-bi-1| Where  is some small constant.

26 Lecture 1, 7/21/2005Natural Language Processing26 Final translation model for phrase- based MT  Let’s look at a simple example with no distortion

27 Lecture 1, 7/21/2005Natural Language Processing27 Phrase-based MT  Language model P(E)  Translation model P(F|E) Model How to train the model  Decoder: finding the sentence E that is most probable

28 Lecture 1, 7/21/2005Natural Language Processing28 Training P(F|E)  What we mainly need to train is  (f j |e i )  Suppose we had a large bilingual training corpus A bitext In which each English sentence is paired with a Spanish sentence  And suppose we knew exactly which phrase in Spanish was the translation of which phrase in the English  We call this a phrase alignment  If we had this, we could just count-and-divide:

29 Lecture 1, 7/21/2005Natural Language Processing29 But we don’t have phrase alignments  What we have instead are word alignments:

30 Lecture 1, 7/21/2005Natural Language Processing30 Getting phrase alignments  To get phrase alignments: 1) We first get word alignments 2) Then we “symmetrize” the word alignments into phrase alignments

31 Lecture 1, 7/21/2005Natural Language Processing31 How to get Word Alignments  Word alignment: a mapping between the source words and the target words in a set of parallel sentences.  Restriction: each foreign word comes from exactly 1 English word  Advantage: represent an alignment by the index of the English word that the French word comes from  Alignment above is thus 2,3,4,5,6,6,6

32 Lecture 1, 7/21/2005Natural Language Processing32 One addition: spurious words  A word in the foreign sentence  That doesn’t align with any word in the English sentence  Is called a spurious word.  We model these by pretending they are generated by an English word e 0 :

33 Lecture 1, 7/21/2005Natural Language Processing33 More sophisticated models of alignment

34 Lecture 1, 7/21/2005Natural Language Processing34 Computing word alignments: IBM Model 1  For phrase-based machine translation  We want a word-alignment  To extract a set of phrases  A word alignment algorithm gives us P(F,E)  We want this to train our phrase probabilities  (f j |e i ) as part of P(F|E)  But a word-alignment algorithm can also be part of a mini-translation model itself.

35 Lecture 1, 7/21/2005Natural Language Processing35 IBM Model 1

36 Lecture 1, 7/21/2005Natural Language Processing36 IBM Model 1

37 Lecture 1, 7/21/2005Natural Language Processing37 How does the generative story assign P(F|E) for a Spanish sentence F?  Terminology:  Suppose we had done steps 1 and 2, I.e. we already knew the Spanish length J and the alignment A (and English source E):

38 Lecture 1, 7/21/2005Natural Language Processing38 Let’s formalize steps 1 and 2  We want P(A|E) of an alignment A (of length J) given an English sentence E  IBM Model 1 makes the (very) simplifying assumption that each alignment is equally likely.  How many possible alignments are there between English sentence of length I and Spanish sentence of length J?  Hint: Each Spanish word must come from one of the English source words (or the NULL word)  (I+1) J  Let’s assume probability of choosing length J is small constant epsilon

39 Lecture 1, 7/21/2005Natural Language Processing39 Model 1 continued  Prob of choosing a length and then one of the possible alignments:  Combining with step 3:  The total probability of a given foreign sentence F:

40 Lecture 1, 7/21/2005Natural Language Processing40 Decoding  How do we find the best A?

41 Lecture 1, 7/21/2005Natural Language Processing41 Training alignment probabilities  Step 1: get a parallel corpus Hansards  Canadian parliamentary proceedings, in French and English  Hong Kong Hansards: English and Chinese  Step 2: sentence alignment  Step 3: use EM (Expectation Maximization) to train word alignments

42 Lecture 1, 7/21/2005Natural Language Processing42 Step 1: Parallel corpora EnglishGerman Diverging opinions about planned tax reform Unterschiedliche Meinungen zur geplanten Steuerreform The discussion around the envisaged major tax reform continues. Die Diskussion um die vorgesehene grosse Steuerreform dauert an. The FDP economics expert, Graf Lambsdorff, today came out in favor of advancing the enactment of significant parts of the overhaul, currently planned for 1999. Der FDP - Wirtschaftsexperte Graf Lambsdorff sprach sich heute dafuer aus, wesentliche Teile der fuer 1999 geplanten Reform vorzuziehen.  Example from DE-News (8/1/1996) Slide from Christof Monz

43 Lecture 1, 7/21/2005Natural Language Processing43 Step 2: Sentence Alignment The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await. Intuition: - use length in words or chars - together with dynamic programming - or use a simpler MT model El viejo está feliz porque ha pescado muchos veces. Su mujer habla con é l. Los tiburones esperan. Slide from Kevin Knight

44 Lecture 1, 7/21/2005Natural Language Processing44 Sentence Alignment 1. The old man is happy. 2. He has fished many times. 3. His wife talks to him. 4. The fish are jumping. 5. The sharks await. El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan. Slide from Kevin Knight

45 Lecture 1, 7/21/2005Natural Language Processing45 Sentence Alignment 1. The old man is happy. 2. He has fished many times. 3. His wife talks to him. 4. The fish are jumping. 5. The sharks await. El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan. Slide from Kevin Knight

46 Lecture 1, 7/21/2005Natural Language Processing46 Sentence Alignment 1. The old man is happy. He has fished many times. 2. His wife talks to him. 3. The sharks await. El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan. Note that unaligned sentences are thrown out, and sentences are merged in n-to-m alignments (n, m > 0). Slide from Kevin Knight

47 Lecture 1, 7/21/2005Natural Language Processing47 Step 3: word alignments  It turns out we can bootstrap alignments  From a sentence-aligned bilingual corpus  We use is the Expectation-Maximization or EM algorithm

48 Lecture 1, 7/21/2005Natural Language Processing48 EM for training alignment probs … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … All word alignments equally likely All P(french-word | english-word) equally likely Slide from Kevin Knight

49 Lecture 1, 7/21/2005Natural Language Processing49 EM for training alignment probs … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … “la” and “the” observed to co-occur frequently, so P(la | the) is increased. Slide from Kevin Knight

50 Lecture 1, 7/21/2005Natural Language Processing50 EM for training alignment probs … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … “house” co-occurs with both “la” and “maison”, but P(maison | house) can be raised without limit, to 1.0, while P(la | house) is limited because of “the” (pigeonhole principle) Slide from Kevin Knight

51 Lecture 1, 7/21/2005Natural Language Processing51 EM for training alignment probs … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … settling down after another iteration Slide from Kevin Knight

52 Lecture 1, 7/21/2005Natural Language Processing52 EM for training alignment probs … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … Inherent hidden structure revealed by EM training! For details, see: Section 24.6.1 in the chapter “A Statistical MT Tutorial Workbook” (Knight, 1999). “The Mathematics of Statistical Machine Translation” (Brown et al, 1993) Software: GIZA++ Slide from Kevin Knight

53 Lecture 1, 7/21/2005Natural Language Processing53 Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … P(juste | fair) = 0.411 P(juste | correct) = 0.027 P(juste | right) = 0.020 … new French sentence Possible English translations, to be rescored by language model Slide from Kevin Knight

54 Lecture 1, 7/21/2005Natural Language Processing54 A more complex model: IBM Model 3 Brown et al., 1993 Mary did not slap the green witch Mary not slap slap slap the green witch n(3|slap) Maria no d ió una bofetada a la bruja verde d(j|i) Mary not slap slap slap NULL the green witch P-Null Maria no d ió una bofetada a la verde bruja t(la|the) Generative approach: Probabilities can be learned from raw bilingual text.

55 Lecture 1, 7/21/2005Natural Language Processing55 How do we evaluate MT? Human tests for fluency  Rating tests: Give the raters a scale (1 to 5) and ask them to rate Or distinct scales for  Clarity, Naturalness, Style Or check for specific problems  Cohesion (Lexical chains, anaphora, ellipsis)  Hand-checking for cohesion.  Well-formedness  5-point scale of syntactic correctness  Comprehensibility tests Noise test Multiple choice questionnaire  Readability tests cloze

56 Lecture 1, 7/21/2005Natural Language Processing56 How do we evaluate MT? Human tests for fidelity  Adequacy Does it convey the information in the original? Ask raters to rate on a scale  Bilingual raters: give them source and target sentence, ask how much information is preserved  Monolingual raters: give them target + a good human translation  Informativeness Task based: is there enough info to do some task? Give raters multiple-choice questions about content

57 Lecture 1, 7/21/2005Natural Language Processing57 Evaluating MT: Problems  Asking humans to judge sentences on a 5-point scale for 10 factors takes time and $$$ (weeks or months!)  We can’t build language engineering systems if we can only evaluate them once every quarter!!!!  We need a metric that we can run every time we change our algorithm.  It would be OK if it wasn’t perfect, but just tended to correlate with the expensive human metrics, which we could still run in quarterly. Bonnie Dorr

58 Lecture 1, 7/21/2005Natural Language Processing58 Automatic evaluation  Miller and Beebe-Center (1958)  Assume we have one or more human translations of the source passage  Compare the automatic translation to these human translations Bleu NIST Meteor Precision/Recall

59 Lecture 1, 7/21/2005Natural Language Processing59 BiLingual Evaluation Understudy (BLEU —Papineni, 2001)  Automatic Technique, but ….  Requires the pre-existence of Human (Reference) Translations  Approach: Produce corpus of high-quality human translations Judge “closeness” numerically (word-error rate) Compare n-gram matches between candidate translation and 1 or more reference translations http://www.research.ibm.com/people/k/kishore/RC22176.pdf Slide from Bonnie Dorr

60 Lecture 1, 7/21/2005Natural Language Processing60 Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. BLEU Evaluation Metric (Papineni et al, ACL-2002) N-gram precision (score is between 0 & 1) –What percentage of machine n-grams can be found in the reference translation? –An n-gram is an sequence of n words –Not allowed to use same portion of reference translation twice (can’t cheat by typing out “the the the the the”) Brevity penalty –Can’t just type out single word “the” (precision 1.0!) *** Amazingly hard to “game” the system (i.e., find a way to change machine output so that BLEU goes up, but quality doesn’t) Slide from Bonnie Dorr

61 Lecture 1, 7/21/2005Natural Language Processing61 Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. BLEU Evaluation Metric (Papineni et al, ACL-2002) BLEU4 formula (counts n-grams up to length 4) exp (1.0 * log p1 + 0.5 * log p2 + 0.25 * log p3 + 0.125 * log p4 – max(words-in-reference / words-in-machine – 1, 0) p1 = 1-gram precision P2 = 2-gram precision P3 = 3-gram precision P4 = 4-gram precision Slide from Bonnie Dorr

62 Lecture 1, 7/21/2005Natural Language Processing62 Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden, which threatens to launch a biochemical attack on such public places as airport. Guam authority has been on alert. Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia. They said there would be biochemistry air raid to Guam Airport and other public places. Guam needs to be in high precaution about this matter. Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. Multiple Reference Translations Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden, which threatens to launch a biochemical attack on such public places as airport. Guam authority has been on alert. Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia. They said there would be biochemistry air raid to Guam Airport and other public places. Guam needs to be in high precaution about this matter. Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. Slide from Bonnie Dorr

63 Lecture 1, 7/21/2005Natural Language Processing63 BLEU in Action 枪手被警方击毙。 (Foreign Original) the gunman was shot to death by the police. (Reference Translation) the gunman was police kill. #1 wounded police jaya of #2 the gunman was shot dead by the police. #3 the gunman arrested by police kill. #4 the gunmen were killed. #5 the gunman was shot to death by the police. #6 gunmen were killed by police ?SUB>0 ?SUB>0 #7 al by the police. #8 the ringer is killed by the police. #9 police killed the gunman. #10 Slide from Bonnie Dorr

64 Lecture 1, 7/21/2005Natural Language Processing64 BLEU in Action 枪手被警方击毙。 (Foreign Original) the gunman was shot to death by the police. (Reference Translation) the gunman was police kill. #1 wounded police jaya of #2 the gunman was shot dead by the police. #3 the gunman arrested by police kill. #4 the gunmen were killed. #5 the gunman was shot to death by the police. #6 gunmen were killed by police ?SUB>0 ?SUB>0 #7 al by the police. #8 the ringer is killed by the police. #9 police killed the gunman. #10 green = 4-gram match (good!) red = word not matched (bad!) Slide from Bonnie Dorr

65 Lecture 1, 7/21/2005Natural Language Processing65 Bleu Comparison Chinese-English Translation Example: Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. Slide from Bonnie Dorr

66 Lecture 1, 7/21/2005Natural Language Processing66 How Do We Compute Bleu Scores?  Intuition: “What percentage of words in candidate occurred in some human translation?”  Proposal: count up # of candidate translation words (unigrams) # in any reference translation, divide by the total # of words in # candidate translation  But can’t just count total # of overlapping N-grams! Candidate: the the the the the the Reference 1: The cat is on the mat  Solution: A reference word should be considered exhausted after a matching candidate word is identified. Slide from Bonnie Dorr

67 Lecture 1, 7/21/2005Natural Language Processing67 “Modified n-gram precision”  For each word compute: (1) total number of times it occurs in any single reference translation (2) number of times it occurs in the candidate translation  Instead of using count #2, use the minimum of #2 and #2, I.e. clip the counts at the max for the reference transcription  Now use that modified count.  And divide by number of candidate words. Slide from Bonnie Dorr

68 Lecture 1, 7/21/2005Natural Language Processing68 Modified Unigram Precision: Candidate #1 Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. It(1) is(1) a(1) guide(1) to(1) action(1) which(1) ensures(1) that(2) the(4) military(1) always(1) obeys(0) the commands(1) of(1) the party(1) What’s the answer???17/18 Slide from Bonnie Dorr

69 Lecture 1, 7/21/2005Natural Language Processing69 Modified Unigram Precision: Candidate #2 It(1) is(1) to(1) insure(0) the(4) troops(0) forever(1) hearing(0) the activity(0) guidebook(0) that(2) party(1) direct(0) What’s the answer????8/14 Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. Slide from Bonnie Dorr

70 Lecture 1, 7/21/2005Natural Language Processing70 Modified Bigram Precision: Candidate #1 It is(1) is a(1) a guide(1) guide to(1) to action(1) action which(0) which ensures(0) ensures that(1) that the(1) the military(1) military always(0) always obeys(0) obeys the(0) the commands(0) commands of(0) of the(1) the party(1) What’s the answer???? 10/17 Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. Slide from Bonnie Dorr

71 Lecture 1, 7/21/2005Natural Language Processing71 Modified Bigram Precision: Candidate #2 Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. It is(1) is to(0) to insure(0) insure the(0) the troops(0) troops forever(0) forever hearing(0) hearing the(0) the activity(0) activity guidebook(0) guidebook that(0) that party(0) party direct(0) What’s the answer????1/13 Slide from Bonnie Dorr

72 Lecture 1, 7/21/2005Natural Language Processing72 Catching Cheaters Reference 1: The cat is on the mat Reference 2: There is a cat on the mat the(2) the the the(0) the(0) the(0) the(0) What’s the unigram answer?2/7 What’s the bigram answer?0/7 Slide from Bonnie Dorr

73 Lecture 1, 7/21/2005Natural Language Processing73 Bleu distinguishes human from machine translations Slide from Bonnie Dorr

74 Lecture 1, 7/21/2005Natural Language Processing74 Bleu problems with sentence length  Candidate: of the  Solution: brevity penalty; prefers candidates translations which are same length as one of the references Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. Problem: modified unigram precision is 2/2, bigram 1/1! Slide from Bonnie Dorr

75 Lecture 1, 7/21/2005Natural Language Processing75 BLEU Tends to Predict Human Judgments slide from G. Doddington (NIST) (variant of BLEU)

76 Lecture 1, 7/21/2005Natural Language Processing76 Summary  Intro and a little history  Language Similarities and Divergences  Four main MT Approaches Transfer Interlingua Direct Statistical  Evaluation

77 Lecture 1, 7/21/2005Natural Language Processing77 Classes  LINGUIST 139M/239M. Human and Machine Translation. (Martin Kay)  CS 224N. Natural Language Processing (Chris Manning)


Download ppt "Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 14b 24 August 2007."

Similar presentations


Ads by Google