CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Slides:



Advertisements
Similar presentations
WillHelpYouOut.com Hits 1000 Let’s get Started.
Advertisements

Using Matrices in Real Life
Far-reaching Impact of MMT Zhendong Dong Center of Computer Language Center of Computer Language Information Engineering, CAS Information.
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation.
1 Post-1949 Chinese Local Gazetteers: Digitization and Collaborative Collection Susan Xue Electronic Resources Librarian University of California at Berkeley.
Eric A. Hanushek Stanford University
1Regional policy responses to demographic challenges, Bruxelles, January 2007 EUROSTAT regional population projections Giampaolo LANZIERI Eurostat.
Jeopardy Q 1 Q 2 Q 3 Q 4 Q 5 Q 6Q 16Q 11Q 21 Q 7Q 12Q 17Q 22 Q 8Q 13Q 18 Q 23 Q 9 Q 14Q 19Q 24 Q 10Q 15Q 20Q 25 Final Jeopardy Writing Terms.
Readers Build Good Habits
1 English or Portuguese: language or literature? Richard Hudson Lisbon, May 2007.
10. Juni 1998reto ambühler ( WELCOME TO THE GATHERING PLACE.
Preliminary Findings from Cleantech Incubation Cluster Analysis on Identifying Best Practice Cleantech Incubation Policies Pauline van der Vorm, TU Delft,
Georgia has Led the Nation for 3 Consecutive Years.
3.2 What changes have taken place in the FLOW of GOODS and CAPITAL? 3.2b- TNCs control a substantial part of the global economy and have created a GLOBAL.
Resourcing parents of infants to support literacy learning and development: an examination of textual networks and information pathways Helen Nixon School.
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
Introduction to Statistical Machine Translation Philipp Koehn Kevin Knight USC/Information Sciences Institute USC/Computer Science Department CSAIL Massachusetts.
Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 39 of 42 Wednesday, 29 November.
Computing & Information Sciences Kansas State University Lecture 38 of 42 CIS 530 / 730 Artificial Intelligence Lecture 38 of 42 Natural Language Processing,
Machine Translation Domain Adaptation Day PROJECT #2 2.
+. + Natural Language Processing CS311, Spring 2013 David Kauchak.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 24 Jim Martin.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 14b 24 August 2007.
Machine Translation: Introduction Slides from: Dan Jurafsky.
1. SAGE Portal – Technology Specifications released – Windows and Mac Secure Browsers released SAGE Formative Upcoming Key Dates 2.
Machine Translation II How MT works Modes of use.
© 2012 National Heart Foundation of Australia. Slide 2.
Statistical Machine Translation Kevin Knight USC/Information Sciences Institute USC/Computer Science Department.
The Maryland Common Core Frameworks for Braille: Identifying the Next Generation Grade Level Braille Literacy Needs of Students Lisa Wright & Heather Johnson.
Co-funded by the European Union Semantic CMS Community Content Management From free text input to automatic entity enrichment Copyright IKS Consortium.
25 seconds left…...
Tuesday, 12/17/13 class plan Go over the assignment sheet Revise and expand your prewriting for your story to include plot characters setting theme (if.
Introduction to Statistical Machine Translation Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics.
June 10, Representative products In ICP 2005 price collectors were asked to identify “representative” products among all the products for household.
REGISTRATION OF STUDENTS Master Settings STUDENT INFORMATION PRABANDHAK DEFINE FEE STRUCTURE FEE COLLECTION Attendance Management REPORTS Architecture.
Chapter 11 Describing Process Specifications and Structured Decisions
Basics of Statistical Estimation
Student Interface for Online Testing Training Module Copyright © 2014 American Institutes for Research. All rights reserved.
Student Interface for Online Testing Training Module Copyright © 2014 American Institutes for Research. All rights reserved.
0 WPI First Experience Teaching Software Testing Lessons Learned Gary Pollice Worcester Polytechnic Institute and Rational Software Corp.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Introduction to Statistical Machine Translation Philipp Koehn Kevin Knight USC/Information Sciences Institute USC/Computer Science Department CSAIL Massachusetts.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Course Overview  What is AI?  What are the Major Challenges?  What are the Main Techniques?  Where are we failing, and why?  Step back and look at.
Thanks to Dan Klein of UC Berkeley and Chris Manning of Stanford for many of the materials used in this lecture. CS 479, section 1: Natural Language Processing.
Machine Translation Course 5 Diana Trandab ă ț Academic year:
CHAPTER 13 NATURAL LANGUAGE PROCESSING. Machine Translation.
INTERNATIONAL TRADE LECTURE 1: The World of International Economics.
Natural Language Processing Lecture 23—12/1/2015 Jim Martin.
Machine Translation Diana Trandab ă ţ Academic Year
Introduction to Machine Translation
Spring 2010 Lecture 2 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn, Kevin Knight, Chris Quirk LING 575: Seminar on statistical machine.
Machine Translation, Statistical Approach Heshaam Faili Natural Language and Text Processing Laboratory School of Electrical and Computer Engineering,
Approaches to Machine Translation
Introduction to Machine Translation
Statistical NLP: Lecture 13
Machine Translation: Introduction
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
LING 180 SYMBSYS 138 Intro to Computer Speech and Language Processing
Approaches to Machine Translation
Introduction to Machine Translation
Realities, Challenges, and Promises - Promoting the Next Generation of English Teachers in China Jun Liu May 18, 2007 Beijing, China.
Introduction to Statistical Machine Translation
Machine Translation: Word alignment models
Presentation transcript:

CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley

Announcements  Assignment 7 is up.  Grid-world and robot crawler.  Due 5/3.  Extra Office Hours first two weeks of May  This week as usual Thursday 11-1 PM  5/2 extra (Tuesday 11-1 PM)  5/3 usual 11-1 PM  Next assignment (not graded) will be a final exam review.

Reinforcement Learning  What you should know  MDPs  Basics, discounted reward  Policy Evaluation  Bellman’s equation  Value iteration  Policy iteration  Reinforcement Learning  Adaptive Dynamic Programming  TD learning (Model-free)  Q Learning

Where we are  Past:  Basic Techniques of AI  Search, Representation, Uncertainty and Inference, Learning  Next  Applications  MT, NLU (this week)  Neural Computation, Perception (next week).  Today: Machine Translation (MT)  (Semi) Automatically translating text/speech from one language to another.

Translation is hard In a Bucharest hotel lobby. The lift is being fixed for the next day. During that time we regret that you will be unbearable. In a Paris hotel elevator: Please leave your values at the front desk. In a hotel in Athens: Visitors are expected to complain at the office between the hours of 9 and 11 a.m. daily. In a Japanese hotel: You are invited to take advantage of the chambermaid. In the lobby of a Moscow hotel across from a Russian Orthodox monastery: You are welcome to visit the cemetery where famous Russian and Soviet composers, artists, and writers are buried daily except Thursday.

MT History  1946 (Pre-AI) Booth and Weaver discuss MT at Rockefeller foundation in New York;  idea of dictionary-based direct translation  1949 Weaver memorandum popularized idea  1952 all 18 MT researchers in world meet at MIT  1954 IBM/Georgetown Demo Russian-English MT  lots of labs take up MT

Early translation problems  English to Russian to English  The spirit is willing but the flesh is weak.  The vodka is good but the meat is rotten.

History of MT: Pessimism  1959/1960: Bar-Hillel “Report on the state of MT in US and GB”  Argued FAHQT too hard (semantic ambiguity, etc)  Should work on semi-automatic instead of automatic  His argument Little John was looking for his toy box. Finally, he found it. The box was in the pen. John was very happy.  Only human knowledge let’s us know that ‘playpens’ are bigger than boxes, but ‘writing pens’ are smaller  His claim: we would have to encode all of human knowledge

History of MT  Systran (Babelfish) been used for 30 years  1970’s:  European focus in MT; mainly ignored in US  1980’s  ideas of using AI techniques in MT (KBMT, CMU)  1990’s  Commercial MT systems  Statistical MT (SMT), Speech-to-speech translation  2000’s  SMT matures to be an exciting AI technology  Well funded, high-payoff, can make a real difference.

Levels of Transfer Interlingua Semantic Structure Semantic Structure Syntactic Structure Syntactic Structure Word Structure Word Structure Source Text Target Text Semantic Composition Semantic Decomposition Semantic Analysis Semantic Generation Syntactic Analysis Syntactic Generation Morphological Analysis Morphological Generation Semantic Transfer Syntactic Transfer Direct (Vauquois triangle)

General Approaches  Rule-based approaches  Expert system style rewrite systems  Interlingua methods (analyze and generate)  Lexicons come from humans or dictionaries  Can be very fast, and can accumulate a lot of knowledge over time (e.g. Systran)  Statistical approaches  Noisy channel systems  Lower-level transfer  Lexicons discovered using parallel corpora  Require little human declaration of knowledge

What makes a good translation  Translators often talk about two factors we want to maximize:  Faithfulness or fidelity  How close is the meaning of the translation to the meaning of the original  (Even better: does the translation cause the reader to draw the same inferences as the original would have)  Fluency or naturalness  How natural the translation is, just considering its fluency in the target language

The Coding View  “One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’ ”  Warren Weaver (1955:18, quoting a letter he wrote in 1947)

MT System Components source P(e) e f decoder observed argmax P(e|f) = argmax P(f|e)P(e) e e ef best channel P(f|e) Language ModelTranslation Model Finds an English translation which is both fluent and semantically faithful to the French source

The Classic Language Model Word N-Grams Generative approach: w1 = START repeat until END is generated: produce word w2 according to a big table P(w2 | w1) w1 := w2 P(I saw water on the table) = P(I | START) * P(saw | I) * P(water | saw) * P(on | water) * P(the | on) * P(table | the) * P(END | table) Probabilities can be learned from online English text. w1w1 w2w2 w n-1 END START

Parallel Corpora  Parallel corpora (or bitexts)  Collection of source- target translation pairs  Main resource for learning a translation model  Either naturally occurring (e.g. parliamentary proceedings, news translation services) or commissioned

Building a Translation Model  Steps in building a simple statistical translation model  Match up words in training sentence pairs (word alignment)  Learn a lexicon from these alignments  Learn larger phrases What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de les droits ?

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat.

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ???

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ???

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp process of elimination

Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp cognate?

Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp } Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. zero fertility

Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa It’s Really Spanish/English 1a. Garcia and associates. 1b. Garcia y asociados. 7a. the clients and the associates are enemies. 7b. los clients y los asociados son enemigos. 2a. Carlos Garcia has three associates. 2b. Carlos Garcia tiene tres asociados. 8a. the company has three groups. 8b. la empresa tiene tres grupos. 3a. his associates are not strong. 3b. sus asociados no son fuertes. 9a. its groups are in Europe. 9b. sus grupos estan en Europa. 4a. Garcia has a company also. 4b. Garcia tambien tiene una empresa. 10a. the modern groups sell strong pharmaceuticals. 10b. los grupos modernos venden medicinas fuertes. 5a. its clients are angry. 5b. sus clientes estan enfadados. 11a. the groups do not sell zenzanine. 11b. los grupos no venden zanzanina. 6a. the associates are also angry. 6b. los asociados tambien estan enfadados. 12a. the small groups are not modern. 12b. los grupos pequenos no son modernos.

Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … All word alignments equally likely All P(french-word | english-word) equally likely

Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … “la” and “the” observed to co-occur frequently, so P(la | the) is increased.

Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … “house” co-occurs with both “la” and “maison”, but P(maison | house) can be raised without limit, to 1.0, while P(la | house) is limited because of “the” (pigeonhole principle)

Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … settling down after another iteration

Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … Inherent hidden structure revealed by EM training! For details, see: “ A Statistical MT Tutorial Workbook” (Knight, 1999). “The Mathematics of Statistical Machine Translation” (Brown et al, 1993) Software: GIZA++

Decoding  Now we have a phrase table:  A huge list of translation phrases (e.g. 1M phrases)  Each phrase has a probability P(f|e)  When we see a new input sentence:  Grow a translation left to right  Extend translation using known phrases  Also multiply by language model score

The Pharaoh Decoder  Probabilities at each step include LM and TM

Recent Progress in Statistical MT insistent Wednesday may recurred her trips to Libya tomorrow for flying Cairo 6-4 ( AFP ) - an official announced today in the Egyptian lines company for flying Tuesday is a company " insistent for flying " may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment. And said the official " the institution sent a speech to Ministry of Foreign Affairs of lifting on Libya air, a situation her receiving replying are so a trip will pull to Libya a morning Wednesday ". Egyptair Has Tomorrow to Resume Its Flights to Libya Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya. " The official said that the company had sent a letter to the Ministry of Foreign Affairs, information on the lifting of the air embargo on Libya, where it had received a response, the first take off a trip to Libya on Wednesday morning " slide from C. Wayne, DARPA

Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … P(juste | fair) = P(juste | correct) = P(juste | right) = … new French sentence Possible English translations, to be rescored by language model

What is MT not (yet) good for?  Really hard stuff  Literature  Natural spoken speech (meetings, court reporting)  Really important stuff  Medical translation in hospitals, 911

What is MT good for?  Tasks for which a rough translation is fine  Web pages,  Multilingual Speech-based queries  Tasks for which MT can be post-edited  MT as first pass  “Computer-aided human translation”  Tasks in sublanguage domains where high-quality MT is possible

The next five years  Bootstrapping Resources  Trying to design better learning methods to work from scarce data (see Knight 2003, Plauche et al 2007)  Germann and the ISI experiment in Tamil  MT in a month  100K tokens achieved tolerable performance in 2002  Including Syntactic/Semantic Information in SMT  Markup on the Web  Multi-lingual Lexical resources  WordNet PropBank FrameNet  Combining MT methods

Pos LanguageFamilyScript(s) Used Speakers Where Spoken (Major) 1MandarinSino-TibetanChinese Characters1051China, Malaysia, Taiwan 2EnglishIndo-EuropeanLatin510USA, UK, Australia, Canada, New Zealand 3HindiIndo-EuropeanDevanagari490North and Central India 4SpanishIndo-EuropeanLatin425The Americas, Spain 5ArabicAfro-AsiaticArabic255Middle East, Arabia, North Africa 6RussianIndo-EuropeanCyrillic254Russia, Central Asia 7PortugueseIndo-EuropeanLatin218Brazil, Portugal, Southern Africa 8BengaliIndo-EuropeanBengali215Bangladesh, Eastern India 9IndonesianMalayoPolynesianLatin175Indonesia, Malaysia, Singapore 10FrenchIndo-EuropeanLatin130France, Canada, West Africa, Central Africa 11JapaneseAltaicChinese Characters and 2 Japanese Alphabets127Japan 12GermanIndo-EuropeanLatin123Germany, Austria, Central Europe 13Farsi (Persian)Indo-EuropeanNastaliq110Iran, Afghanistan, Central Asia 14UrduIndo-EuropeanNastaliq104Pakistan, India 15PunjabiIndo-EuropeanGurumukhi103Pakistan, India 16VietnameseAustroasiaticBased on Latin86Vietnam, China 17TamilDravidianTamil78Southern India, Sri Lanka, Malyasia 18WuSino-TibetanChinese Characters77China 19JavaneseMalayo-PolynesianJavanese76Indonesia 20TurkishAltaicLatin75Turkey, Central Asia 21TeluguDravidianTelugu74Southern India 22KoreanAltaicHangul72Korean Peninsula 23MarathiIndo-EuropeanDevanagari71Western India 24ItalianIndo-EuropeanLatin61Italy, Central Europe 25ThaiSino-TibetanThai60Thailand, Laos 26CantoneseSino-TibetanChinese Characters55Southern China 27GujaratiIndo-EuropeanGujarati47Western India, Kenya 28PolishIndo-EuropeanLatin46Poland, Central Europe 29KannadaDravidianKannada44Southern India 30BurmeseSino-TibetanBurmese42Myanmar

Top Ten Internet Languages

MT in Developing Countries Traditional Rec Community Rec

Related Berkeley work at TIER  Kiosks / Livelihood  Cellphones for pricing in rural Rwandan coffee markets  Computers and livelihood development in urban slums in Brazil  E-literacy / Entrepreneurship in rural Kerala  Education  Studies of social impacts of Computer Aided Learning in rural areas  Observations of shared computer usage among children in resource strapped areas  Telemedicine  Long-distance diagnosis using b  Teaching  ‘Technology and Development’ graduate class design (see reader/syllabus)  Conference  First peer-reviewed IEEE/ACM conference in series

URL bibliography  website.  website.  website.     browser.       WordNet Association.     HowNet.  multilingual semantic network.  project.  project.  project for  Japanese.  project for Spanish.  project.   of VerbNet and  FrameNet.  NomBank

References