Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2.

Slides:



Advertisements
Similar presentations
EcoTherm Plus WGB-K 20 E 4,5 – 20 kW.
Advertisements

Números.
1 A B C
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
PDAs Accept Context-Free Languages
ALAK ROY. Assistant Professor Dept. of CSE NIT Agartala
AP STUDY SESSION 2.
EuroCondens SGB E.
Worksheets.
Slide 1Fig 26-CO, p.795. Slide 2Fig 26-1, p.796 Slide 3Fig 26-2, p.797.
Slide 1Fig 25-CO, p.762. Slide 2Fig 25-1, p.765 Slide 3Fig 25-2, p.765.
Add Governors Discretionary (1G) Grants Chapter 6.
CALENDAR.
CHAPTER 18 The Ankle and Lower Leg
The 5S numbers game..
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Media-Monitoring Final Report April - May 2010 News.
Break Time Remaining 10:00.
The basics for simulations
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
EE, NCKU Tien-Hao Chang (Darby Chang)
PP Test Review Sections 6-1 to 6-6
MM4A6c: Apply the law of sines and the law of cosines.
Introduction to Statistical Machine Translation Philipp Koehn Kevin Knight USC/Information Sciences Institute USC/Computer Science Department CSAIL Massachusetts.
Computing & Information Sciences Kansas State University Lecture 38 of 42 CIS 530 / 730 Artificial Intelligence Lecture 38 of 42 Natural Language Processing,
+. + Natural Language Processing CS311, Spring 2013 David Kauchak.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 24 Jim Martin.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 14b 24 August 2007.
Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Progressive Aerobic Cardiovascular Endurance Run
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Facebook Pages 101: Your Organization’s Foothold on the Social Web A Volunteer Leader Webinar Sponsored by CACO December 1, 2010 Andrew Gossen, Senior.
When you see… Find the zeros You think….
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
Before Between After.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
Statistical Machine Translation Kevin Knight USC/Information Sciences Institute USC/Computer Science Department.
Slide R - 1 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Prentice Hall Active Learning Lecture Slides For use with Classroom Response.
Subtraction: Adding UP
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Introduction to Statistical Machine Translation Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics.
CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.
Static Equilibrium; Elasticity and Fracture
Converting a Fraction to %
Resistência dos Materiais, 5ª ed.
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
Select a time to count down from the clock above
Copyright Tim Morris/St Stephen's School
1.step PMIT start + initial project data input Concept Concept.
WARNING This CD is protected by Copyright Laws. FOR HOME USE ONLY. Unauthorised copying, adaptation, rental, lending, distribution, extraction, charging.
A Data Warehouse Mining Tool Stephen Turner Chris Frala
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Introduction Embedded Universal Tools and Online Features 2.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Natural Language Processing Lecture 23—12/1/2015 Jim Martin.
Machine Translation Diana Trandab ă ţ Academic Year
Spring 2010 Lecture 2 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn, Kevin Knight, Chris Quirk LING 575: Seminar on statistical machine.
Machine Translation: Introduction
CSCI 5832 Natural Language Processing
LING 180 SYMBSYS 138 Intro to Computer Speech and Language Processing
Introduction to Statistical Machine Translation
Machine Translation: Word alignment models
Presentation transcript:

Machine Translation Domain Adaptation Day 19 1

PROJECT #2 2

MEMM tools Online description of project #2 has been updated with more information

Quick walk through training.txt I/PRP left/VBD./. John/NNP arrived/VBD./. I/PRP left/VBD./. John/NNP arrived/VBD./.

Quick walk through training.txt I/PRP left/VBD./. John/NNP arrived/VBD./. I/PRP left/VBD./. John/NNP arrived/VBD./. training.feats PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 You write code to convert this to features! “featurize.pl training.txt training.feats” You write code to convert this to features! “featurize.pl training.txt training.feats”

Quick walk through training.txt I/PRP left/VBD./. John/NNP arrived/VBD./. I/PRP left/VBD./. John/NNP arrived/VBD./. training.feats PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 trigram.model Run memm_train to train this model “memm_train --input training.feats --classifier trigram.model --markovOrder 2” Run memm_train to train this model “memm_train --input training.feats --classifier trigram.model --markovOrder 2”

Quick walk through training.txt I/PRP left/VBD./. John/NNP arrived/VBD./. I/PRP left/VBD./. John/NNP arrived/VBD./. training.feats PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 trigram.model test.txt he/PRP arrived/VBD./. John/NNP left/VBD./. he/PRP arrived/VBD./. John/NNP left/VBD./. Get some unseen test data…

Quick walk through training.txt I/PRP left/VBD./. John/NNP arrived/VBD./. I/PRP left/VBD./. John/NNP arrived/VBD./. training.feats PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 trigram.model test.txt he/PRP arrived/VBD./. John/NNP left/VBD./. he/PRP arrived/VBD./. John/NNP left/VBD./. test.feats PRP w0=he:1 w-1= :1 VBD w0=arrived:1 w-1=he:1. w0=.:1 w-1=arrived:1 NNP w0=John:1 w-1= :1 VBD w0=left:1 w-1=John:1. w0=.:1 w-1=left:1 PRP w0=he:1 w-1= :1 VBD w0=arrived:1 w-1=he:1. w0=.:1 w-1=arrived:1 NNP w0=John:1 w-1= :1 VBD w0=left:1 w-1=John:1. w0=.:1 w-1=left:1 Use the same featurization code on test data “featurize.pl test.txt test.feats” Use the same featurization code on test data “featurize.pl test.txt test.feats”

Quick walk through training.txt I/PRP left/VBD./. John/NNP arrived/VBD./. I/PRP left/VBD./. John/NNP arrived/VBD./. training.feats PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 trigram.model test.txt he/PRP arrived/VBD./. John/NNP left/VBD./. he/PRP arrived/VBD./. John/NNP left/VBD./. test.feats PRP w0=he:1 w-1= :1 VBD w0=arrived:1 w-1=he:1. w0=.:1 w-1=arrived:1 NNP w0=John:1 w-1= :1 VBD w0=left:1 w-1=John:1. w0=.:1 w-1=left:1 PRP w0=he:1 w-1= :1 VBD w0=arrived:1 w-1=he:1. w0=.:1 w-1=arrived:1 NNP w0=John:1 w-1= :1 VBD w0=left:1 w-1=John:1. w0=.:1 w-1=left:1 test.tags PRP VBD. NNP VBD. PRP VBD. NNP VBD. memm_test predicts tags (memm_test ignores first column; can include true tags) “memm_test --input test.feats --classifier trigram.model --markovOrder 2 --output test.tags” memm_test predicts tags (memm_test ignores first column; can include true tags) “memm_test --input test.feats --classifier trigram.model --markovOrder 2 --output test.tags”

MEMM features training.txt I/PRP left/VBD./. John/NNP arrived/VBD./. I/PRP left/VBD./. John/NNP arrived/VBD./. training.feats PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 PRP w0=I:1 w-1= :1 VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1= :1 VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 Actual features used by MEMM PRP w0=I:1 w-1= :1 t[-1]= :1 t[-1]=,t[-2]= :1 VBD w0=left:1 w-1=I:1 t[-1]=PRP:1 t[-1]=PRP,t[-2]= :1. w0=.:1 w-1=left:1 t[-1]=VBD:1 t[-1]=VBD,t[-2]=PRP:1 t[-1]=.:1 t[-1]=.,t[-2]=VBD:1 NNP w0=John:1 w-1= :1 t[-1]= :1 t[-1]=,t[-2]= :1 VBD w0=arrived:1 w-1=John:1 t[-1]=NNP:1 t[-1]=NNP,t[-2]= :1. w0=.:1 w-1=arrived:1 t[-1]=VBD:1 t[-1]=VBD,t[-2]=NNP:1 t[-1]=.:1 t[-1]=.,t[-2]=VBD:1 PRP w0=I:1 w-1= :1 t[-1]= :1 t[-1]=,t[-2]= :1 VBD w0=left:1 w-1=I:1 t[-1]=PRP:1 t[-1]=PRP,t[-2]= :1. w0=.:1 w-1=left:1 t[-1]=VBD:1 t[-1]=VBD,t[-2]=PRP:1 t[-1]=.:1 t[-1]=.,t[-2]=VBD:1 NNP w0=John:1 w-1= :1 t[-1]= :1 t[-1]=,t[-2]= :1 VBD w0=arrived:1 w-1=John:1 t[-1]=NNP:1 t[-1]=NNP,t[-2]= :1. w0=.:1 w-1=arrived:1 t[-1]=VBD:1 t[-1]=VBD,t[-2]=NNP:1 t[-1]=.:1 t[-1]=.,t[-2]=VBD:1 You provide these features… …and add the argument “--markovOrder 2” You provide these features… …and add the argument “--markovOrder 2” The MEMM adds in features about tag context add training and test time

MACHINE TRANSLATION 11

Acknowledgments Many thanks to (for helpful content and input on content): – Chris Callison-Burch, Matt Post, & Adam Lopez (JHU) – Philipp Koehn & Barry Haddow (U Edinburgh) – Kevin Knight (ISI) 12

13

14

Translation: global problem and interesting research problem 15 Non-English Internet content and user communities are increasing explosively Human translation costs are excessive: major languages range from cents per word Non-English Internet content and user communities are increasing explosively Human translation costs are excessive: major languages range from cents per word Result: the vast majority of published material remains untranslated!

Prevalence of MT on the Web From Rarrick et al,

17

The Goal: (sentence) translation Translate source sentences into target sentences – For now, ignore discourse structure, co-reference, and phenomena across sentence boundaries 滴水之恩當 以涌泉相報 A drop of water shall be returned with a burst of spring. 18

Types of MT systems Source of information – Rule based: People write rules to specify translations of words, phrases – Data-driven: Use learning techniques to derive translation “rules” from data sources (e.g., parallel corpora) Level of representation Interlingua Semantic forms Syntax trees Phrases Words 19 Modified Vauquois pyramid

Advantages of data-driven translation We can model the genres of documents that we would like to model – Learn contextually appropriate translations for technical data, chat data, etc. Very flexible system – Given corpus C = ({x 1,y 1 }, {x 2,y 2 }, …) of sentence pairs – Translate(C, x) = y is a function of the training data and the input sentence – To build a new system (or optimize our old one) we just change the data – But…we need oodles of data to get “good” models 20

Statistical MT Learn word and phrase alignments from “parallel” data 21

Statistical MT Learn word and phrase alignments from “parallel” data – Parallel data? – Parallel documents? 22

Statistical MT Learn word and phrase alignments from “parallel” data – Parallel documents? 23

Statistical MT Learn word and phrase alignments from “parallel” data – Parallel documents? 24

Statistical MT Learn word and phrase alignments from “parallel” data – Parallel documents? 25

Statistical MT Learn word and phrase alignments from “parallel” data – Start with parallel documents Need parallel sentences Sentence break and sentence align – Word align and produce word and phrase translation tables (our translation models) 26

27

28

Some Hmong a houseib lub tsev a new houseib lub tsev tshiab my new housekuv lub tsev tshiab eight new housesyim lub tsev tshiab my eight new houseskuv yim lub tsev tshiab 29

Some More Hmong a houseib lub tsev a new houseib lub tsev tshiab my new housekuv lub tsev tshiab eight new housesyim lub tsev tshiab my eight new houseskuv yim lub tsev tshiab the houselub tsev 30

Even More Hmong kuv pluag heevI'm very poor ib pluag mova meal ib taig mova bowl of rice ib taig zauba bowl of vegetables 31

Statistical MT Learn word and phrase alignments from “parallel” data – Start with parallel documents Need parallel sentences Sentence break and sentence align – Word align and produce word and phrase translation tables (our translation models) 32

Statistical MT Learn word and phrase alignments from “parallel” data – Start with parallel documents Need parallel sentences Sentence break and sentence align – Word align and produce word and phrase translation tables (our translation models) Use monolingual data to – Build language models Inform ordering Choose best translation from n-best list 33

Statistical MT Recipe Start With Parallel sentences – Align words & phrases, & generate counts Build These Components Translation Model – Probs associated with aligned words & phrases – P (E|F) 34

Statistical MT Recipe Start With Parallel sentences – Align words & phrases, & generate counts Monolingual data Build These Components Translation Model – Probs associated with aligned words & phrases – P (E|F) Language Model – P(E) 35

Statistical MT Recipe Start With Parallel sentences – Align words & phrases, & generate counts Monolingual data Decoding Algorithm Build These Components Translation Model – Probs associated with aligned words & phrases – P (E|F) Language Model – P(E) Decoder – Maximizes P(F|E)*P(E) 36

Statistical Machine Translation Given foreign f, find best English translation e* e* = argmax e P(e | f) Use Bayes’ rule to get “noisy channel” model P(e | f) = P(f | e) ∙ P(e) / P(f) argmax e P(e | f) = argmax P(f | e) ∙ P(e) P(f | e) is the channel or translation model P(e) is the language model 37

Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 38 Slides adapted from Kevin Knight and CCB’s JHU crew

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 39

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 40

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. 41

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ??? 42

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 43

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 44

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 45

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ??? 46

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 47

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp process of elimination 48

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp cognate? 49

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. zero fertility Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 50

Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa It’s Really Spanish/English 1a. Garcia and associates. 1b. Garcia y asociados. 7a. the clients and the associates are enemies. 7b. los clients y los asociados son enemigos. 2a. Carlos Garcia has three associates. 2b. Carlos Garcia tiene tres asociados. 8a. the company has three groups. 8b. la empresa tiene tres grupos. 3a. his associates are not strong. 3b. sus asociados no son fuertes. 9a. its groups are in Europe. 9b. sus grupos estan en Europa. 4a. Garcia has a company also. 4b. Garcia tambien tiene una empresa. 10a. the modern groups sell strong pharmaceuticals. 10b. los grupos modernos venden medicinas fuertes. 5a. its clients are angry. 5b. sus clientes estan enfadados. 11a. the groups do not sell zenzanine. 11b. los grupos no venden zanzanina. 6a. the associates are also angry. 6b. los asociados tambien estan enfadados. 12a. the small groups are not modern. 12b. los grupos pequenos no son modernos. 51

Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa It’s Really Spanish/English 1a. Garcia and associates. 1b. Garcia y asociados. 7a. the clients and the associates are enemies. 7b. los clients y los asociados son enemigos. 2a. Carlos Garcia has three associates. 2b. Carlos Garcia tiene tres asociados. 8a. the company has three groups. 8b. la empresa tiene tres grupos. 3a. his associates are not strong. 3b. sus asociados no son fuertes. 9a. its groups are in Europe. 9b. sus grupos estan en Europa. 4a. Garcia has a company also. 4b. Garcia tambien tiene una empresa. 10a. the modern groups sell strong pharmaceuticals. 10b. los grupos modernos venden medicinas fuertes. 5a. its clients are angry. 5b. sus clientes estan enfadados. 11a. the groups do not sell zenzanine. 11b. los grupos no venden zanzanina. 6a. the associates are also angry. 6b. los asociados tambien estan enfadados. 12a. the small groups are not modern. 12b. los grupos pequenos no son modernos. 52

Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp } Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. zero fertility 53

Reorder 54

Reorder 55

Reorder 56

Reorder 5040 Possible Orderings!! 57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

Language Model Use a standard n-gram language model for P(E). Trained on large monolingual corpus – 4- or 5-gram is typical – Often uses target side of parallel data + monolingual data 75

Translation Model “Phrase table” – N-gram pairs and probabilities 76

Statistical Machine Translation 77

EVALUATING MT 78

MT Evaluation I have a throbbing pain. I am experiencing a throbbing pain. I am suffering from a throbbing pain. I am feeling a throbbing pain. It is a throbbing pain. It's throbbing and it really hurts. It's painful and it's throbbing. It's throbbing with pain. It's in throbbing pain. It hurts so much it's throbbing. I've got a throbbing pain. I can feel a throbbing pain. I am suffering from a throbbing pain. I am experiencing a throbbing pain. I have a painful throbbing. I feel a painful throbbing. Source : ズキズキ 痛み ます 。 16 human translations: 79 Data from International Workshop on Spoken Language Translation

MT Evaluation No “right answer”! What can we test instead? – Human adequacy / fluency ratings – Human efficacy in an application (e.g. question answering from translated foreign documents vs. native documents) – Very accurate, but slow & expensive Agreement with reference translations – BLEU (BiLingual Evaluation Understudy: IBM) – Fast system development 80

BLEU (Papineni, ACL 2002) MT output: 1: It is a guide to action which ensures that the military always obeys the commands of the party. 2: It is to insure the troops forever hearing the activity guidebook that party direct. Human (reference) translations: 1: It is a guide to action that ensures that the military will forever heed Party commands. 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. 3: It is the practical guide for the army always to heed the directions of the party. 81

BLEU MT output: 1: It is a guide to action which ensures that the military always obeys the commands of the party. 2: It is to insure the troops forever hearing the activity guidebook that party direct. Human (reference) translations: 1: It is a guide to action that ensures that the military will forever heed Party commands. 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. 3: It is the practical guide for the army always to heed the directions of the party. 82

BLEU MT output: 1: It is a guide to action which ensures that the military always obeys the commands of the party. 2: It is to insure the troops forever hearing the activity guidebook that party direct. Human (reference) translations: 1: It is a guide to action that ensures that the military will forever heed Party commands. 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. 3: It is the practical guide for the army always to heed the directions of the party. 83

BLEU: observations 1: It is a guide to action which ensures that the military always obeys the commands of the party. 2: It is to insure the troops forever hearing the activity guidebook that party direct. Observations – Word overlap is indicative – n-gram (word sequence) overlap is even more distinct – Drawing from multiple reference translations helps 84

BLEU metric Compute n-gram precisions: P n = c(matched n-grams) / c(n-grams in candidate) Compute a brevity penalty (Prevent candidates from deleting difficult words) BP = exp( min( 1 – r/c, 0 ) ), r = reference length, c = candidate length Combine using geometric mean BLEU = BP ∙ (∏ i=1 n P i )^(1/n) Produces score on a 0-1 scale – often expressed as a “percentage” (e.g., * 100) 85

BLEU results circa 2002 [from Papineni et al., ACL 2002][from G. Doddington, NIST] Distinguishes humans from machines……correlates well with human judgments 86 However nowadays we’re starting to see problems: - Some systems score better than human translations - In competitions, some “gaming of BLEU” - Rule based systems are at a disadvantage after tuning

Next Time MT & Word Alignment Application of EM 87