CSCI 5832 Natural Language Processing

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Problems for Statistical MT Preprocessing Language modeling Translation modeling Decoding Parameter optimization Evaluation.
Introduction to Statistical Machine Translation Philipp Koehn Kevin Knight USC/Information Sciences Institute USC/Computer Science Department CSAIL Massachusetts.
Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 39 of 42 Wednesday, 29 November.
Computing & Information Sciences Kansas State University Lecture 38 of 42 CIS 530 / 730 Artificial Intelligence Lecture 38 of 42 Natural Language Processing,
Machine Translation Domain Adaptation Day PROJECT #2 2.
+. + Natural Language Processing CS311, Spring 2013 David Kauchak.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 24 Jim Martin.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 14b 24 August 2007.
Machine Translation: Introduction Slides from: Dan Jurafsky.
Statistical Machine Translation Kevin Knight USC/Information Sciences Institute USC/Computer Science Department.
Introduction to Statistical Machine Translation Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics.
CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Machine Translation- 4 Autumn 2008 Lecture Sep 2008.
Machine Translation Introduction to Statistical MT.
MT Evaluation CA446 Week 9 Andy Way School of Computing Dublin City University, Dublin 9, Ireland
Word Alignment Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics University of Edinburgh Some slides.
A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.
CSCI 5582 Artificial Intelligence
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 26 Jim Martin.
Introduction to Statistical Machine Translation Philipp Koehn Kevin Knight USC/Information Sciences Institute USC/Computer Science Department CSAIL Massachusetts.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Jan 2005Statistical MT1 CSA4050: Advanced Techniques in NLP Machine Translation III Statistical MT.
Automatic Evaluation Philipp Koehn Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology.
Natural Language Processing Expectation Maximization.
Translation Model Parameters (adapted from notes from Philipp Koehn & Mary Hearne) 24 th March 2011 Dr. Declan Groves, CNGL, DCU
Machine Translation- 5 Autumn 2008 Lecture Sep 2008.
Computing & Information Sciences Kansas State University Friday, 01 Dec 2006CIS 490 / 730: Artificial Intelligence Lecture 40 of 42 Friday, 01 December.
An Introduction to SMT Andy Way, DCU. Statistical Machine Translation (SMT) Translation Model Language Model Bilingual and Monolingual Data* Decoder:
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 16 5 September 2007.
SMT – Final thoughts Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics University of Edinburgh Some.
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Human Language Technologies (HLT) Workshop.
Machine Translation Course 5 Diana Trandab ă ț Academic year:
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Machine Translation- 2 Autumn 2008 Lecture 17 4 Sep 2008.
Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.
NLP. Machine Translation Source-channel model of communication Parametric probabilistic models of language and translation.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Natural Language Processing Lecture 23—12/1/2015 Jim Martin.
Machine Translation Diana Trandab ă ţ Academic Year
Spring 2010 Lecture 2 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn, Kevin Knight, Chris Quirk LING 575: Seminar on statistical machine.
Ling 575: Machine Translation Yuval Marton Winter 2016 January 19: Spill-over from last class, some prob+stats, word alignment, phrase-based and hierarchical.
Computing & Information Sciences Kansas State University Friday, 05 Dec 2008CIS 530 / 730: Artificial Intelligence Lecture 39 of 42 Friday, 05 December.
Machine Translation, Statistical Approach Heshaam Faili Natural Language and Text Processing Laboratory School of Electrical and Computer Engineering,
Learning in Search Slide 1 Hal Daumé III Learning in Search Hal Daumé III Information Sciences Institute University of Southern California.
Statistical Machine Translation Part II: Word Alignments and EM
CSC 594 Topics in AI – Natural Language Processing
CSE 517 Natural Language Processing Winter 2015
Alexander Fraser CIS, LMU München Machine Translation
Statistical NLP: Lecture 13
Statistical Machine Translation
Machine Translation: Introduction
Statistical Machine Translation Part III – Phrase-based SMT / Decoding
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
LING 180 SYMBSYS 138 Intro to Computer Speech and Language Processing
Machine Translation(MT)
Word Alignment David Kauchak CS159 – Fall 2019 Philipp Koehn
Introduction to Statistical Machine Translation
Lecture 12: Machine Translation (II) November 4, 2004 Dan Jurafsky
CSCI 5832 Natural Language Processing
Machine Translation: Word alignment models
SMT – Final thoughts David Kauchak CS159 – Spring 2019
Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi
CSCI 5582 Artificial Intelligence
Presentation transcript:

CSCI 5832 Natural Language Processing Jim Martin Lecture 23 11/7/2018

Change in plans Going straight to Chapter 25 (translation) I’ll come back to 23 (q/a, summarization) 11/7/2018

Slides mostly stolen from Kevin Knight (USC/ISI) Machine Translation Slides mostly stolen from Kevin Knight (USC/ISI) 11/7/2018

Today 4/22 Machine translation framework State of the art results Evaluation methods Word-based models 11/7/2018

Progress 2002 2003 insistent Wednesday may recurred her trips to Libya tomorrow for flying Cairo 6-4 ( AFP ) - An official announced today in the Egyptian lines company for flying Tuesday is a company "insistent for flying" may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment. Egyptair Has Tomorrow to Resume Its Flights to Libya Cairo 4-6 (AFP) - Said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya. 11/7/2018

Commercial Applications news broadcast foreign language speech recognition English translation searchable archive 11/7/2018

Commercial Applications 11/7/2018

Statistical Machine Translation Hmm, every time he sees “banco”, he either types “bank” or “bench” … but if he sees “banco de…”, he always types “bank”, never “bench”… Man, this is so boring. Translated documents 11/7/2018

Things are Consistently Improving Annual evaluation of Arabic-to-English MT systems Translation quality 70 60 50 40 30 20 Exceeded commercial-grade translation here. 10 2002 2003 2004 2005 2006 11/7/2018

Progress Driven by Empirical Measures of Success Translation quality 35 30 25 20 USC/ISI Syntax-Based MT System. Chinese/English NIST 2002 Test Set 15 Mar 1 Apr 1 May 1 2005 11/7/2018

Current Approaches Same old noisy channel model… If we’re translating French to English the French we’re seeing is just a weird garbled version of English There must have been some process that generated the French from the original English The key is to decode the garbles back into the original English by… Argmax P(E | F) by Bayes A very old idea 11/7/2018

Warren Weaver (1947) When I look at an article in Russian, I say to myself: This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. 11/7/2018

Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . 11/7/2018

Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . 11/7/2018

Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . 11/7/2018

Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . ??? 11/7/2018

Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . 11/7/2018

Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . 11/7/2018

Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . 11/7/2018

Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . ??? 11/7/2018

Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . 11/7/2018

Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . process of elimination 11/7/2018

Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . cognate? 11/7/2018

Centauri/Arcturan [Knight, 1997] Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp } 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . zero fertility 11/7/2018

Translate: Clients do not sell pharmaceuticals in Europe. Spanish/English text Translate: Clients do not sell pharmaceuticals in Europe. 1a. Garcia and associates . 1b. Garcia y asociados . 7a. the clients and the associates are enemies . 7b. los clients y los asociados son enemigos . 2a. Carlos Garcia has three associates . 2b. Carlos Garcia tiene tres asociados . 8a. the company has three groups . 8b. la empresa tiene tres grupos . 3a. his associates are not strong . 3b. sus asociados no son fuertes . 9a. its groups are in Europe . 9b. sus grupos estan en Europa . 4a. Garcia has a company also . 4b. Garcia tambien tiene una empresa . 10a. the modern groups sell strong pharmaceuticals . 10b. los grupos modernos venden medicinas fuertes . 5a. its clients are angry . 5b. sus clientes estan enfadados . 11a. the groups do not sell zenzanine . 11b. los grupos no venden zanzanina . 6a. the associates are also angry . 6b. los asociados tambien estan enfadados . 12a. the small groups are not modern . 12b. los grupos pequenos no son modernos .   11/7/2018

Bilingual Training Data Millions of words (English side) + 1m-20m words for many language pairs (Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn). 11/7/2018

Sample Learning Curves Swedish/English French/English German/English Finnish/English BLEU score # of sentence pairs used in training Experiments by Philipp Koehn 11/7/2018

MT Evaluation Traditionally difficult because there is no single “right answer”. 20 human translators will translate the same sentence 20 different ways. 11/7/2018

Evaluation Metric (BLEU) Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport . Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance. N-gram precision (score is between 0 & 1) What percentage of machine n-grams can be found in the reference translation? An n-gram is an sequence of n words Not allowed to use same portion of reference translation twice (can’t cheat by typing out “the the the the the”) Brevity penalty Can’t just type out single word “the” (precision 1.0!) Amazingly hard to “game” the system (i.e., find a way to change machine output so that BLEU goes up, but quality doesn’t) Contra doesn’t hold. Can find perfectly good improvements that hurt, or don’t help, BLEU 11/7/2018

Multiple Reference Translations Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport . Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert . Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter . Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places . Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance. 11/7/2018

BLEU in Action 枪手被警方击毙。 (Foreign Original) the gunman was shot to death by the police . (Reference Translation) System Outputs the gunman was police kill . #1 wounded police jaya of #2 the gunman was shot dead by the police . #3 the gunman arrested by police kill . #4 the gunmen were killed . #5 the gunman was shot to death by the police . #6 gunmen were killed by police #7 al by the police . #8 the ringer is killed by the police . #9 police killed the gunman . #10 11/7/2018

BLEU in Action 枪手被警方击毙。 (Foreign Original) the gunman was shot to death by the police . (Reference Translation) System Outputs the gunman was police kill . #1 wounded police jaya of #2 the gunman was shot dead by the police . #3 the gunman arrested by police kill . #4 the gunmen were killed . #5 the gunman was shot to death by the police . #6 gunmen were killed by police #7 al by the police . #8 the ringer is killed by the police . #9 police killed the gunman . #10 green = 4-gram match (good!) red = word not matched (bad!) 11/7/2018

NIST 2006 Results 11/7/2018

Statistical MT Systems Spanish Garbled English Spanish/English Bilingual Text Text Statistical Analysis What hunger have I, Hungry I am so, I am so hungry, Have I that hunger … Que hambre tengo yo I am so hungry 11/7/2018

Statistical MT Systems Spanish/English Bilingual Text English Text Statistical Analysis Statistical Analysis Garbled English Spanish English Translation Model P(s|e) Language Model P(e) Que hambre tengo yo I am so hungry Decoding algorithm argmax P(e) * P(s|e) e 11/7/2018

Bayes Rule/Noisy Channel Garbled English Spanish English Translation Model P(s|e) Language Model P(e) Que hambre tengo yo I am so hungry Decoding algorithm argmax P(e) * P(s|e) e Given a source sentence s, the decoder should consider many possible translations … and return the target string e that maximizes P(e | s) By Bayes Rule, we can also write this as: P(e) x P(s | e) / P(s) and maximize that instead. P(s) never changes while we compare different e’s, so we can equivalently maximize this: P(e) x P(s | e) 11/7/2018

Three Sub-Problems of Statistical MT Language model Given an English string e, assigns P(e) by formula good English string -> high P(e) random word sequence -> low P(e) Translation model Given a pair of strings <f,e>, assigns P(f | e) by formula <f,e> look like translations -> high P(f | e) <f,e> don’t look like translations -> low P(f | e) Decoding algorithm Given a language model, a translation model, and a new sentence f … find translation e maximizing P(e) * P(f | e) 11/7/2018

Translation Model Mary did not slap the green witch Generative story: Mary did not slap the green witch Source-language morphological analysis Source parse tree Semantic representation Generate target structure Maria no dió una botefada a la bruja verde 11/7/2018

Translation Model? Mary did not slap the green witch Generative story: Mary did not slap the green witch Source-language morphological analysis Source parse tree Semantic representation Generate target structure Way too hard. Maria no dió una botefada a la bruja verde 11/7/2018

The Classic Translation Model Word Substitution/Permutation [IBM Model 3, Brown et al., 1993] Generative story: Mary did not slap the green witch n(3|slap) Mary not slap slap slap the green witch p-Null Mary not slap slap slap NULL the green witch t(la|the) Maria no dió una botefada a la verde bruja d(j|i) Maria no dió una botefada a la bruja verde 11/7/2018

Parts List We need probabilities for n (x|y) The probability that word y will yield x outputs in the translation… (fertility) p The probability of a null insertion t The actual word translation probability table d(j|i) the probability that a word at position i will make an appearance at position j in the translation 11/7/2018

Parts List Every one of these can be learned from a sentence aligned corpus… Ie. A corpus where sentences are paired but nothing else is specified And the EM algorithm 11/7/2018

Word Alignment … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … Assume that All word alignments equally likely. That is, that all P(french-word | english-word) are equal Recall that we want P(f|e) 11/7/2018

Word Alignment … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … “la” and “the” observed to co-occur frequently, so P(la | the) is increased. 11/7/2018

Word Alignment … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … 11/7/2018

settling down after another iteration Word Alignment … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … settling down after another iteration 11/7/2018

Word Alignment … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … Inherent hidden structure revealed by EM training 11/7/2018

Parts List Given a sentence alignment we can induce a word alignment Given that word alignment we can get the p, t, d and n parameters we need for the model. Ie. We can argmax P(e|f) by maxing over P(f|e)*P(e)… and we can do that by iterating over some large space of f possibilities. 11/7/2018

Decoding Remember Viterbi? Just a fancier Viterbi Given foreign sentence f, find English sentence e that maximizes P(e) x P(f | e) 11/7/2018

Decoding Que hambre tengo yo what hunger have I that hungry am me so make where 11/7/2018

Decoding Que hambre tengo yo what hunger have I that hungry am me so make where 11/7/2018

Decoding Que hambre tengo yo what hunger have I that hungry am me so make where 11/7/2018

Decoding Que hambre tengo yo what hunger have I that hungry am me so make where 11/7/2018

Decoder: Actually Translates New Sentences 1st target word 2nd target word 3rd target word 4th target word start end all source words covered Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far) 11/7/2018

Dynamic Programming Beam Search 1st target word 2nd target word 3rd target word 4th target word best predecessor link start end all source words covered Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far) [Jelinek, 1969; Brown et al, 1996 US Patent; (Och, Ueffing, and Ney, 2001] 11/7/2018

Flaws of Word-Based MT Multiple English words for one foreign word IBM models can do one-to-many (fertility) but not many-to-one Phrasal Translation “real estate”, “note that”, “interest in” Syntactic Transformations Verb at the beginning in Arabic Translation model penalizes any proposed re-ordering Language model not strong enough to force the verb to move to the right place 11/7/2018

Next Time Read Chapter 25 Guest presentation Tuesday Tom Chikoore, Founder/CTO filtrbox Tuesday EM alignment example Phrase-based translation 11/7/2018