Download presentation
Presentation is loading. Please wait.
Published byJordan Earles Modified over 10 years ago
1
Machine Translation Domain Adaptation Day 19 1
2
PROJECT #2 2
3
MEMM tools Online description of project #2 has been updated with more information
4
Quick walk through training.txt I/PRP left/VBD./. John/NNP arrived/VBD./. I/PRP left/VBD./. John/NNP arrived/VBD./.
5
Quick walk through training.txt I/PRP left/VBD./. John/NNP arrived/VBD./. I/PRP left/VBD./. John/NNP arrived/VBD./. training.feats PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 You write code to convert this to features! “featurize.pl training.txt training.feats” You write code to convert this to features! “featurize.pl training.txt training.feats”
6
Quick walk through training.txt I/PRP left/VBD./. John/NNP arrived/VBD./. I/PRP left/VBD./. John/NNP arrived/VBD./. training.feats PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 trigram.model Run memm_train to train this model “memm_train --input training.feats --classifier trigram.model --markovOrder 2” Run memm_train to train this model “memm_train --input training.feats --classifier trigram.model --markovOrder 2”
7
Quick walk through training.txt I/PRP left/VBD./. John/NNP arrived/VBD./. I/PRP left/VBD./. John/NNP arrived/VBD./. training.feats PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 trigram.model test.txt he/PRP arrived/VBD./. John/NNP left/VBD./. he/PRP arrived/VBD./. John/NNP left/VBD./. Get some unseen test data…
8
Quick walk through training.txt I/PRP left/VBD./. John/NNP arrived/VBD./. I/PRP left/VBD./. John/NNP arrived/VBD./. training.feats PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 trigram.model test.txt he/PRP arrived/VBD./. John/NNP left/VBD./. he/PRP arrived/VBD./. John/NNP left/VBD./. test.feats PRP w0=he:1 w-1= :1 VBD w0=arrived:1 w-1=he:1. w0=.:1 w-1=arrived:1 NNP w0=John:1 w-1= :1 VBD w0=left:1 w-1=John:1. w0=.:1 w-1=left:1 PRP w0=he:1 w-1= :1 VBD w0=arrived:1 w-1=he:1. w0=.:1 w-1=arrived:1 NNP w0=John:1 w-1= :1 VBD w0=left:1 w-1=John:1. w0=.:1 w-1=left:1 Use the same featurization code on test data “featurize.pl test.txt test.feats” Use the same featurization code on test data “featurize.pl test.txt test.feats”
9
Quick walk through training.txt I/PRP left/VBD./. John/NNP arrived/VBD./. I/PRP left/VBD./. John/NNP arrived/VBD./. training.feats PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 trigram.model test.txt he/PRP arrived/VBD./. John/NNP left/VBD./. he/PRP arrived/VBD./. John/NNP left/VBD./. test.feats PRP w0=he:1 w-1= :1 VBD w0=arrived:1 w-1=he:1. w0=.:1 w-1=arrived:1 NNP w0=John:1 w-1= :1 VBD w0=left:1 w-1=John:1. w0=.:1 w-1=left:1 PRP w0=he:1 w-1= :1 VBD w0=arrived:1 w-1=he:1. w0=.:1 w-1=arrived:1 NNP w0=John:1 w-1= :1 VBD w0=left:1 w-1=John:1. w0=.:1 w-1=left:1 test.tags PRP VBD. NNP VBD. PRP VBD. NNP VBD. memm_test predicts tags (memm_test ignores first column; can include true tags) “memm_test --input test.feats --classifier trigram.model --markovOrder 2 --output test.tags” memm_test predicts tags (memm_test ignores first column; can include true tags) “memm_test --input test.feats --classifier trigram.model --markovOrder 2 --output test.tags”
10
MEMM features training.txt I/PRP left/VBD./. John/NNP arrived/VBD./. I/PRP left/VBD./. John/NNP arrived/VBD./. training.feats PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 Actual features used by MEMM PRP w0=I:1 w-1= :1 t[-1]= :1 t[-1]=,t[-2]= :1 VBD w0=left:1 w-1=I:1 t[-1]=PRP:1 t[-1]=PRP,t[-2]= :1. w0=.:1 w-1=left:1 t[-1]=VBD:1 t[-1]=VBD,t[-2]=PRP:1 t[-1]=.:1 t[-1]=.,t[-2]=VBD:1 NNP w0=John:1 w-1= :1 t[-1]= :1 t[-1]=,t[-2]= :1 VBD w0=arrived:1 w-1=John:1 t[-1]=NNP:1 t[-1]=NNP,t[-2]= :1. w0=.:1 w-1=arrived:1 t[-1]=VBD:1 t[-1]=VBD,t[-2]=NNP:1 t[-1]=.:1 t[-1]=.,t[-2]=VBD:1 PRP w0=I:1 w-1= :1 t[-1]= :1 t[-1]=,t[-2]= :1 VBD w0=left:1 w-1=I:1 t[-1]=PRP:1 t[-1]=PRP,t[-2]= :1. w0=.:1 w-1=left:1 t[-1]=VBD:1 t[-1]=VBD,t[-2]=PRP:1 t[-1]=.:1 t[-1]=.,t[-2]=VBD:1 NNP w0=John:1 w-1= :1 t[-1]= :1 t[-1]=,t[-2]= :1 VBD w0=arrived:1 w-1=John:1 t[-1]=NNP:1 t[-1]=NNP,t[-2]= :1. w0=.:1 w-1=arrived:1 t[-1]=VBD:1 t[-1]=VBD,t[-2]=NNP:1 t[-1]=.:1 t[-1]=.,t[-2]=VBD:1 You provide these features… …and add the argument “--markovOrder 2” You provide these features… …and add the argument “--markovOrder 2” The MEMM adds in features about tag context add training and test time
11
MACHINE TRANSLATION 11
12
Acknowledgments Many thanks to (for helpful content and input on content): – Chris Callison-Burch, Matt Post, & Adam Lopez (JHU) – Philipp Koehn & Barry Haddow (U Edinburgh) – Kevin Knight (ISI) 12
13
13
14
14
15
Translation: global problem and interesting research problem 15 Non-English Internet content and user communities are increasing explosively Human translation costs are excessive: major languages range from 10-50 cents per word Non-English Internet content and user communities are increasing explosively Human translation costs are excessive: major languages range from 10-50 cents per word Result: the vast majority of published material remains untranslated!
16
Prevalence of MT on the Web From Rarrick et al, 2010 16
17
17
18
The Goal: (sentence) translation Translate source sentences into target sentences – For now, ignore discourse structure, co-reference, and phenomena across sentence boundaries 滴水之恩當 以涌泉相報 A drop of water shall be returned with a burst of spring. 18
19
Types of MT systems Source of information – Rule based: People write rules to specify translations of words, phrases – Data-driven: Use learning techniques to derive translation “rules” from data sources (e.g., parallel corpora) Level of representation Interlingua Semantic forms Syntax trees Phrases Words 19 Modified Vauquois pyramid
20
Advantages of data-driven translation We can model the genres of documents that we would like to model – Learn contextually appropriate translations for technical data, chat data, etc. Very flexible system – Given corpus C = ({x 1,y 1 }, {x 2,y 2 }, …) of sentence pairs – Translate(C, x) = y is a function of the training data and the input sentence – To build a new system (or optimize our old one) we just change the data – But…we need oodles of data to get “good” models 20
21
Statistical MT Learn word and phrase alignments from “parallel” data 21
22
Statistical MT Learn word and phrase alignments from “parallel” data – Parallel data? – Parallel documents? 22
23
Statistical MT Learn word and phrase alignments from “parallel” data – Parallel documents? 23
24
Statistical MT Learn word and phrase alignments from “parallel” data – Parallel documents? 24
25
Statistical MT Learn word and phrase alignments from “parallel” data – Parallel documents? 25
26
Statistical MT Learn word and phrase alignments from “parallel” data – Start with parallel documents Need parallel sentences Sentence break and sentence align – Word align and produce word and phrase translation tables (our translation models) 26
27
27
28
28
29
Some Hmong a houseib lub tsev a new houseib lub tsev tshiab my new housekuv lub tsev tshiab eight new housesyim lub tsev tshiab my eight new houseskuv yim lub tsev tshiab 29
30
Some More Hmong a houseib lub tsev a new houseib lub tsev tshiab my new housekuv lub tsev tshiab eight new housesyim lub tsev tshiab my eight new houseskuv yim lub tsev tshiab the houselub tsev 30
31
Even More Hmong kuv pluag heevI'm very poor ib pluag mova meal ib taig mova bowl of rice ib taig zauba bowl of vegetables 31
32
Statistical MT Learn word and phrase alignments from “parallel” data – Start with parallel documents Need parallel sentences Sentence break and sentence align – Word align and produce word and phrase translation tables (our translation models) 32
33
Statistical MT Learn word and phrase alignments from “parallel” data – Start with parallel documents Need parallel sentences Sentence break and sentence align – Word align and produce word and phrase translation tables (our translation models) Use monolingual data to – Build language models Inform ordering Choose best translation from n-best list 33
34
Statistical MT Recipe Start With Parallel sentences – Align words & phrases, & generate counts Build These Components Translation Model – Probs associated with aligned words & phrases – P (E|F) 34
35
Statistical MT Recipe Start With Parallel sentences – Align words & phrases, & generate counts Monolingual data Build These Components Translation Model – Probs associated with aligned words & phrases – P (E|F) Language Model – P(E) 35
36
Statistical MT Recipe Start With Parallel sentences – Align words & phrases, & generate counts Monolingual data Decoding Algorithm Build These Components Translation Model – Probs associated with aligned words & phrases – P (E|F) Language Model – P(E) Decoder – Maximizes P(F|E)*P(E) 36
37
Statistical Machine Translation Given foreign f, find best English translation e* e* = argmax e P(e | f) Use Bayes’ rule to get “noisy channel” model P(e | f) = P(f | e) ∙ P(e) / P(f) argmax e P(e | f) = argmax P(f | e) ∙ P(e) P(f | e) is the channel or translation model P(e) is the language model 37
38
Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 38 Slides 38-74 adapted from Kevin Knight and CCB’s JHU crew
39
Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 39
40
Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 40
41
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. 41
42
Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ??? 42
43
Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 43
44
Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 44
45
Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 45
46
Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ??? 46
47
Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 47
48
Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp process of elimination 48
49
Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp cognate? 49
50
Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. zero fertility Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 50
51
Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa It’s Really Spanish/English 1a. Garcia and associates. 1b. Garcia y asociados. 7a. the clients and the associates are enemies. 7b. los clients y los asociados son enemigos. 2a. Carlos Garcia has three associates. 2b. Carlos Garcia tiene tres asociados. 8a. the company has three groups. 8b. la empresa tiene tres grupos. 3a. his associates are not strong. 3b. sus asociados no son fuertes. 9a. its groups are in Europe. 9b. sus grupos estan en Europa. 4a. Garcia has a company also. 4b. Garcia tambien tiene una empresa. 10a. the modern groups sell strong pharmaceuticals. 10b. los grupos modernos venden medicinas fuertes. 5a. its clients are angry. 5b. sus clientes estan enfadados. 11a. the groups do not sell zenzanine. 11b. los grupos no venden zanzanina. 6a. the associates are also angry. 6b. los asociados tambien estan enfadados. 12a. the small groups are not modern. 12b. los grupos pequenos no son modernos. 51
52
Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa It’s Really Spanish/English 1a. Garcia and associates. 1b. Garcia y asociados. 7a. the clients and the associates are enemies. 7b. los clients y los asociados son enemigos. 2a. Carlos Garcia has three associates. 2b. Carlos Garcia tiene tres asociados. 8a. the company has three groups. 8b. la empresa tiene tres grupos. 3a. his associates are not strong. 3b. sus asociados no son fuertes. 9a. its groups are in Europe. 9b. sus grupos estan en Europa. 4a. Garcia has a company also. 4b. Garcia tambien tiene una empresa. 10a. the modern groups sell strong pharmaceuticals. 10b. los grupos modernos venden medicinas fuertes. 5a. its clients are angry. 5b. sus clientes estan enfadados. 11a. the groups do not sell zenzanine. 11b. los grupos no venden zanzanina. 6a. the associates are also angry. 6b. los asociados tambien estan enfadados. 12a. the small groups are not modern. 12b. los grupos pequenos no son modernos. 52
53
Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp } Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. zero fertility 53
54
Reorder 54
55
Reorder 55
56
Reorder 56
57
Reorder 5040 Possible Orderings!! 57
58
58
59
59
60
60
61
61
62
62
63
63
64
64
65
65
66
66
67
67
68
68
69
69
70
70
71
71
72
72
73
73
74
74
75
Language Model Use a standard n-gram language model for P(E). Trained on large monolingual corpus – 4- or 5-gram is typical – Often uses target side of parallel data + monolingual data 75
76
Translation Model “Phrase table” – N-gram pairs and probabilities 76
77
Statistical Machine Translation 77
78
EVALUATING MT 78
79
MT Evaluation I have a throbbing pain. I am experiencing a throbbing pain. I am suffering from a throbbing pain. I am feeling a throbbing pain. It is a throbbing pain. It's throbbing and it really hurts. It's painful and it's throbbing. It's throbbing with pain. It's in throbbing pain. It hurts so much it's throbbing. I've got a throbbing pain. I can feel a throbbing pain. I am suffering from a throbbing pain. I am experiencing a throbbing pain. I have a painful throbbing. I feel a painful throbbing. Source : ズキズキ 痛み ます 。 16 human translations: 79 Data from International Workshop on Spoken Language Translation
80
MT Evaluation No “right answer”! What can we test instead? – Human adequacy / fluency ratings – Human efficacy in an application (e.g. question answering from translated foreign documents vs. native documents) – Very accurate, but slow & expensive Agreement with reference translations – BLEU (BiLingual Evaluation Understudy: IBM) – Fast system development 80
81
BLEU (Papineni, ACL 2002) MT output: 1: It is a guide to action which ensures that the military always obeys the commands of the party. 2: It is to insure the troops forever hearing the activity guidebook that party direct. Human (reference) translations: 1: It is a guide to action that ensures that the military will forever heed Party commands. 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. 3: It is the practical guide for the army always to heed the directions of the party. 81
82
BLEU MT output: 1: It is a guide to action which ensures that the military always obeys the commands of the party. 2: It is to insure the troops forever hearing the activity guidebook that party direct. Human (reference) translations: 1: It is a guide to action that ensures that the military will forever heed Party commands. 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. 3: It is the practical guide for the army always to heed the directions of the party. 82
83
BLEU MT output: 1: It is a guide to action which ensures that the military always obeys the commands of the party. 2: It is to insure the troops forever hearing the activity guidebook that party direct. Human (reference) translations: 1: It is a guide to action that ensures that the military will forever heed Party commands. 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. 3: It is the practical guide for the army always to heed the directions of the party. 83
84
BLEU: observations 1: It is a guide to action which ensures that the military always obeys the commands of the party. 2: It is to insure the troops forever hearing the activity guidebook that party direct. Observations – Word overlap is indicative – n-gram (word sequence) overlap is even more distinct – Drawing from multiple reference translations helps 84
85
BLEU metric Compute n-gram precisions: P n = c(matched n-grams) / c(n-grams in candidate) Compute a brevity penalty (Prevent candidates from deleting difficult words) BP = exp( min( 1 – r/c, 0 ) ), r = reference length, c = candidate length Combine using geometric mean BLEU = BP ∙ (∏ i=1 n P i )^(1/n) Produces score on a 0-1 scale – often expressed as a “percentage” (e.g., * 100) 85
86
BLEU results circa 2002 [from Papineni et al., ACL 2002][from G. Doddington, NIST] Distinguishes humans from machines……correlates well with human judgments 86 However nowadays we’re starting to see problems: - Some systems score better than human translations - In competitions, some “gaming of BLEU” - Rule based systems are at a disadvantage after tuning
87
Next Time MT & Word Alignment Application of EM 87
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.