Statistical Machine Translation Alona Fyshe Based on slides from Colin Cherry and Dekang Lin.

Slides:

Advertisements

Similar presentations

Sunken Millions Simple Machines Level One >>>> >>>>

Advertisements

CDC’s Division of Diabetes Translation. National Diabetes Surveillance System available at Obesity (BMI≥30 kg/m.

Slide 1Fig. 2.1a, p.25. Slide 2Fig. 2.1b, p.25 Slide 3Table 2.1, p.25.

Slide 1Fig. 22.1, p.669. Slide 2Fig. 22.3, p.670.

Index Values NATIONAL STATISTICAL COORDINATION BOARD.

Slide 1Fig. 17.1, p.513. Slide 2Table 17.1, p.514.

# of people per square kilometer # of assaults SCATTERPLOT OF ASSAULTS BY # OF PEOPLE PER SQUARE KILOMETER.

Use with Management and Cost Accounting 8e by Colin Drury ISBN © 2012 Colin Drury Use with Management and Cost Accounting 8e by Colin Drury.

Machine Translators By: Holly Slemp. What Do They Do? Translate words from one language to another You can speak and translate words into another language.

 Methods of abortion  Statistics  Possible solutions.

Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alex Fraser Institute for Natural Language Processing University of Stuttgart

Departments in Business Business Name 1 Business Name 2.

Geometry Vocabulary- transformation- the change in the size, shape, or position of a figure. translation- slide--every point in a figure moves the same.

Hadoop + Mahout Anton Slutsky, Lead Data Scientist, EPAM Systems

Translations. Definitions: Transformations: It is a change that occurs that maps or moves a shape in a specific directions onto an image. These are translations,

Transformations Translation “MATH 11” –

ADAMS Assignment 7 ME451:Kinematics and Dynamics of Machine Systems (Spring 09)

Towards Syntactically Constrained Statistical Word Alignment Greg Hanneman : Advanced Machine Translation Seminar April 30, 2008.

1 Numbers & Basic Algebra – Math 103 Math, Statistics & Physics.

3.4 Slope and Rate of Change Math, Statistics & Physics 1.

Return to Home! Go To Next Slide! Return to Home! Go To Next Slide!

Machine Language Computer languages cannot be directly interpreted by the computer – they are not in binary. All commands need to be translated into binary.

Geometry Transformation s.  There are 3 types of rigid transformations:  Translation – shapes slide  Rotation – shapes turn  Reflection – shapes flip.

Slide 1 Insert your own content.. Slide 2 Insert your own content.

Images were sourced from the following web sites: Slide 2:commons.wikimedia.org/wiki/File:BorromeanRing...commons.wikimedia.org/wiki/File:BorromeanRing...

DIGITAL TATTOO MS. MADER PERIOD 0. DEFINITION DIGITAL TATTOO IS….

Test slide upload.

Piero Belforte 1995: CSELT THRIS SLIDES

Statistics Review Most Missed

Trainers name(s) here.

continued on next slide

Indiana U t Wendy Balmer

Sample Presentation. Slide 1 Info Slide 2 Info.

Presentation Test. Second Slide Third Slide It worked.

continued on next slide

continued on next slide

الوسائل التعليمية الجمعة 14 آذار 2014 إعداد : هيثم شعيب.

Entrepreneurship, Strategy and Information Systems

HELLO THERE. THIS IS A TEST SLIDE SLIDE NUMBER 1.

בארגונים במוסדות ובחברה

Slide 1 Insert your own content.. Slide 1 Insert your own content.

I miss explained pseudoknots in class!!!

I miss explained pseudoknots in class!!!

Slide 1 Insert your own content.. Slide 1 Insert your own content.

Slide 1 Insert your own content.. Slide 1 Insert your own content.

HSPA Practice – Slide 1 Simplify: #1) ÷2+5 10÷

Machine Learning Course.

Videos & Songs! - Colin Dodds song - transformation style.

PowerPoint by: Ben and Colin

Webcast slides presentation

АВЛИГАТАЙ ТЭМЦЭХ ҮНДЭСНИЙ ХӨТӨЛБӨР /танилцуулга/

CATEGORY ONE Enter category name on this slide..

THE BATTLE OF LONG TAN.

Chapter 10 – The Gene and Protein Synthesis

Transformations: Translations Rotations Reflections

INVESTMENT STATISTICS Approved Investment (USD$)

Note Pages 11 – 13.

Statistical Machine Translation Part VI – Phrase-based Decoding

COMPUTER HISTORY, PRESENT & FUTURE. What is a Computer? A computer is a machine that can be instructed to carry out sequences of arithmetic or logical.

continued on next slide

continued on next slide

Presentation transcript:

Statistical Machine Translation Alona Fyshe Based on slides from Colin Cherry and Dekang Lin

Basic statistics 0 <= P(x) <=1 P(A) Probability that A happens P(A,B) Probabiliy that A and B happen P(A|B) Probability that A happens given that we know B happened

Basic statistics Conditional probability

Basic Statistics Use definition of conditional probability to derive the chain rule

Basic Statistics Bayes Rule

Basic Statistics Just remember Definition of cond. prob. Bayes rule Chain rule

Goal Translate. Ill use French (F) into English (E) as the running example.

Oh, Canada Im Canadian Mandatory French class in school until grade 6 I speak Cereal Box French Gratuit Gagner Chocolat Glaçage Sans gras Sans cholestérol Élevé dans la fibre

Oh, Canada

Machine Translation Translation is easy for (bilingual) people Process: Read the text in French Understand it Write it down in English

Machine Translation Translation is easy for (bilingual) people Process: Read the text in French Understand it Write it down in English

Machine Translation Understanding language Writing well formed text Hard tasks for computers The human process is invisible, intangible

One approach: Babelfish A rule-based approach to machine translation A 30-year-old feat in Software Eng. Programming knowledge in by hand is difficult and expensive

Alternate Approach: Statistics We are trying to model P(E|F) I give you a French sentence You give me back English How are we going to model this? We could use Bayes rule:

Alternate Approach: Statistics

Why Bayes rule at all? Why not model P(E|F) directly? P(F|E)P(E) decomposition allows us to be sloppy P(E) worries about good English P(F|E) worries about French that matches English The two can be trained independently

Crime Scene Analogy F is a crime scene. E is a person who may have committed the crime P(E|F) - look at the scene - who did it? P(E) - who had a motive? (Profiler) P(F|E) - could they have done it? (CSI - transportation, access to weapons, alabi) Some people might have great motives, but no means - you need both!

On voit Jon à la télévision good English? P(E)good match to French? P(F|E) Jon appeared in TV. It back twelve saw. In Jon appeared TV. Jon is happy today. Jon appeared on TV. TV appeared on Jon. Jon was not happy. Table borrowed from Jason Eisner

On voit Jon à la télévision good English? P(E)good match to French? P(F|E) Jon appeared in TV. It back twelve saw. In Jon appeared TV. Jon is happy today. Jon appeared on TV. TV appeared on Jon. Jon was not happy. Table borrowed from Jason Eisner

I speak English good. How are we going to model good English? How do we know these sentences are not good English? Jon appeared in TV. It back twelve saw. In Jon appeared TV. TV appeared on Jon. Je ne parle pas l'anglais.

I speak English good. Je ne parle pas l'anglais. These arent English words. It back twelve saw. These are English words, but its jibberish. Jon appeared in TV. appeared in TV isnt proper English

I speak English good. Lets say we have a huge collection of documents written in English Like, say, the Internet. It would be a pretty comprehensive list of English words Save for named entities People, places, things Might include some non-English words Speling mitsakes! lol! Could also tell if a phrase is good English

Google, is this good English? Jon appeared in TV. Jon appeared 1,800,000 Google results appeared in TV 45,000 Google results appeared on TV 210,000 Google results It back twelve saw. twelve saw 1,100 Google results It back twelve 586 Google results back twelve saw 0 Google results Imperfect counting… why?

Google, is this good English? Language is often modeled this way Collect statistics about the frequency of words and phrases N-gram statistics 1-gram = unigram 2-gram = bigram 3-gram = trigram 4-gram = four-gram 5-gram = five-gram

Google, is this good English? Seriously, you want to query google for every phrase in the translation? Google created and released a 5-gram data set. Now you can query Google locally (kind of)

Language Modeling Whats P(e)? P(English sentence) P(e 1, e 2, e 3 … e i ) Using the chain rule

Language Modeling Markov assumption The choice of word e i depends only on the n words before e i Definition of conditional probability

Language Modeling

Approximate probability using counts Use the n-gram corpus!

Language Modeling Use the n-gram corpus! Not surprisingly, given that you love to eat, loving to eat chocolate is more probable (0.177)

Language Modeling But what if Then P(e) = 0 Happens even if the sentence is grammatically correct Al Gores pink Hummer was stolen.

Language Modeling Smoothing Many techniques Add one smoothing Add one to every count No more zeros, no problems Backoff If P(e 1, e 2, e 3, e 4, e 5 ) = 0 use something related to P(e 1, e 2, e 3, e 4 )

Language Modeling Wait… Is this how people generate English sentences? Do you choose your fifth word based on B Admittedly, this is an approximation to process which is both intangible and hard for humans themselves to explain If you disagree, and care to defend yourself, consider a PhD in NLP

Back to Translation Anyway, where were we? Oh right… So, weve got P(e), lets talk P(f|e)

Where will we get P(F|E)? Cereal boxes in English Same cereal Boxes, in French Machine Learning Magic P(F|E) model

Where will we get P(F|E)? Books in English Same books, in French Machine Learning Magic P(F|E) model We call collections stored in two languages parallel corpora or parallel texts Want to update your system? Just add more text!

Translated Corpora The Canadian Parliamentary Debates Available in both French and English UN documents Available in Arabic, Chinese, English, French, Russian and Spanish

Problem: How are we going to generalize from examples of translations? Ill spend the rest of this lecture telling you: What makes a useful P(F|E) How to obtain the statistics needed for P(F|E) from parallel texts

Strategy: Generative Story When modeling P(X|Y): Assume you start with Y Decompose the creation of X from Y into some number of operations Track statistics of individual operations For a new example X,Y: P(X|Y) can be calculated based on the probability of the operations needed to get X from Y

What if…? The quick fox jumps over the lazy dog Le renard rapide saut par - dessus le chien parasseux

New Information Call this new info a word alignment (A) With A, we can make a good story The quick fox jumps over the lazy dog Le renard rapide saut par - dessus le chien parasseux

P(F,A|E) Story null The quick fox jumps over the lazy dog

P(F,A|E) Story null The quick fox jumps over the lazy dog f1f2f2 f3f3 …f 10 Simplifying assumption: Choose the length of the French sentence f. All lengths have equal probability

P(F,A|E) Story null The quick fox jumps over the lazy dog f1f2f2 f3f3 …f 10 There are (l+1) m = (8+1) 10 possible alignments

P(F,A|E) Story null The quick fox jumps over the lazy dog Le renard rapide saut par - dessus le chien parasseux

P(F,A|E) Story null The quick fox jumps over the lazy dog Le renard rapide saut par - dessus le chien parasseux

Getting P t (f|e) We need numbers for P t (f|e) Example: P t (le|the) Count lines in a large collection of aligned text null The quick fox jumps over the lazy dog Le renard rapide saut par - dessus le chien parasseux null The quick fox jumps over the lazy dog Le renard rapide saut par - dessus le chien parasseux null The quick fox jumps over the lazy dog Le renard rapide saut par - dessus le chien parasseux null The quick fox jumps over the lazy dog Le renard rapide saut par - dessus le chien parasseux null The quick fox jumps over the lazy dog Le renard rapide saut par - dessus le chien parasseux null The quick fox jumps over the lazy dog Le renard rapide saut par - dessus le chien parasseux

Where do we get the lines? That sure looked like a lot of monkeys… Remember: some times the information hidden in the text just jumps out at you Well get alignments out of unaligned text by treating the alignment as a hidden variable We infer an A that maxes the prob. of our corpus Generalization of ideas in HMM training: called EM

Wheres heaven in Vietnamese? Example borrowed from Jason Eisner

English :In the beginning God created the heavens and the earth. Vietnamese :Ban dâu Dúc Chúa Tròi dung nên tròi dât. English :God called the expanse heaven. Vietnamese :Dúc Chúa Tròi dat tên khoang không la tròi. English :… you are this day like the stars of heaven in number. Vietnamese :… các nguoi dông nhu sao trên tròi. Wheres heaven in Vietnamese? Example borrowed from Jason Eisner

English :In the beginning God created the heavens and the earth. Vietnamese :Ban dâu Dúc Chúa Tròi dung nên tròi dât. English :God called the expanse heaven. Vietnamese :Dúc Chúa Tròi dat tên khoang không la tròi. English :… you are this day like the stars of heaven in number. Vietnamese :… các nguoi dông nhu sao trên tròi. Wheres heaven in Vietnamese? Example borrowed from Jason Eisner

EM: Expectation Maximization Assume a probability distribution (weights) over hidden events Take counts of events based on this distribution Use counts to estimate new parameters Use parameters to re-weight examples. Rinse and repeat

Alignment Hypotheses null I like milk Je aime le lait null I like milk Je aime le lait null I like milk Je aime le lait null I like milk Je aime le lait null I like milk Je aime le lait null I like milk Je aime le lait null I like milk Je aime le lait null I like milk Je aime le lait

Weighted Alignments What well do is: Consider every possible alignment Give each alignment a weight - indicating how good it is Count weighted alignments as normal

Good grief! We forgot about P(F|E)! No worries, a little more stats gets us what we need:

Big Example: Corpus fast car voiture rapide fast rapide 1 2

Possible Alignments fast car voiture rapide fast rapide fast car voiture rapide 1a1b2

Parameters fast car voiture rapide fast rapide fast car voiture rapide 1a1b2 P(voiture|fast)P(rapide|fast)P(voiture|car)P(rapide|car) 1/2

Weight Calculations fast car voiture rapide fast rapide fast car voiture rapide 1a1b2 P(voiture|fast)P(rapide|fast)P(voiture|car)P(rapide|car) 1/2 P(A,F|E)P(A|F,E) 1a1/2*1/2=1/41/4 / 2/4 = 1/2 1b1/2*1/2=1/41/4 / 2/4 = 1/2 21/21/2 / 1/2 = 1

Count Lines fast car voiture rapide fast rapide fast car voiture rapide 1a1b2 1/2 1

Count Lines fast car voiture rapide fast rapide fast car voiture rapide 1a1b2 1/2 1 #(voiture,fast)#(rapide,fast)#(voiture,car)#(rapide,car) 1/21/2+1 = 3/21/2

Count Lines fast car voiture rapide fast rapide fast car voiture rapide 1a1b2 1/2 1 #(voiture,fast)#(rapide,fast)#(voiture,car)#(rapide,car) 1/21/2+1 = 3/21/2 Normalize P(voiture|fast)P(rapide|fast)P(voiture|car)P(rapide|car) 1/43/41/2

Parameters fast car voiture rapide fast rapide fast car voiture rapide 1a1b2 P(voiture|fast)P(rapide|fast)P(voiture|car)P(rapide|car) 1/43/41/2

Weight Calculations fast car voiture rapide fast rapide fast car voiture rapide 1a1b2 P(voiture|fast)P(rapide|fast)P(voiture|car)P(rapide|car) 1/43/41/2 P(A,F|E)P(A|F,E) 1a1/4*1/2=1/81/8 / 4/8 = 1/4 1b1/2*3/4=3/83/8 / 4/8 = 3/4 23/43/4 / 3/4 = 1

Count Lines fast car voiture rapide fast rapide fast car voiture rapide 1a1b2 1/43/41

Count Lines fast car voiture rapide fast rapide fast car voiture rapide 1a1b2 1/43/41 #(voiture,fast)#(rapide,fast)#(voiture,car)#(rapide,car) 1/43/4+1 = 7/43/41/4

Count Lines fast car voiture rapide fast rapide fast car voiture rapide 1a1b2 1/43/41 #(voiture,fast)#(rapide,fast)#(voiture,car)#(rapide,car) 1/43/4+1 = 7/43/41/4 Normalize P(voiture|fast)P(rapide|fast)P(voiture|car)P(rapide|car) 1/87/83/41/4

After many iterations: fast car voiture rapide fast rapide fast car voiture rapide 1a1b2 ~0~11 P(voiture|fast)P(rapide|fast)P(voiture|car)P(rapide|car)

Seems too easy? What if you have no 1-word sentence? Words in shorter sentences will get more weight - fewer possible alignments Weight is additive throughout the corpus: if a word e shows up frequently with some other word f, P(f|e) will go up Like our heaven example

The Final Product Now we have a model for P(F|E) Test it by aligning a corpus! IE: Find argmax A P(A|F,E) Use it for translation: Combine with our n-gram model for P(E) Search space of English sentences for one that maximizes P(E)P(F|E) for a given F

Model could be a lot better: Word positions Multiple fs generated by the same e Could take into account who generated your neighbors Could use syntax, parsing Could align phrases

Sure, but is it any better? Weve got some good ideas for improving translation How can we quantify the change translation quality?

Sure, but is it any better? How to (automatically) measure translation? Original French Dès qu'il fut dehors, Pierre se dirigea vers la rue de Paris, la principale rue du Havre, éclairée, animée, bruyante. Human translation to English As soon as he got out, Pierre made his way to the Rue de Paris, the high-street of Havre, brightly lighted up, lively and noisy. Two machine tranlations back to French: Dès qu'il est sorti, Pierre a fait sa manière à la rue De Paris, la haut-rue de Le Havre, brillamment allumée, animée et bruyante. Dès qu'il en est sorti, Pierre s'est rendu à la Rue de Paris, de la grande rue du Havre, brillamment éclairés, animés et bruyants. Example from

Bleu Score Bleu Bilingual Evaluation Understudy A metric for comparing translations Considers n-grams in common with the target translation Length of target translation Score of 1 is identical, 0 shares no words in common Even human translations dont score 1

Google Translate 25 language pairs In the news (digg.com) _translation_systran.php _translation_systran.php In competition eval_official_results.html eval_official_results.html

Questions? ?

References (Inspiration, Sources of borrowed material) Colin Cherry, MT for NLP, Knight, K., Automating Knowledge Acquisition for Machine Translation, AI Magazine 18(4), Knight, K., A Statistical Machine Translation Tutorial Workbook, 1999, Eisner, J., JHU NLP Course notes: Machine Translation, 2001, Olga Kubassova, Probability for NLP,

Enumerating all alignments There arepossible alignments!

Gah! Null (0) Fast (1) car (2) Voiture (1) rapide (2)

Lets move these over here… Null (0) Fast (1) car (2) Voiture (1) rapide (2)

And now we can do this… Null (0) Fast (1) car (2) Voiture (1) rapide (2)

So, it turns out: Requires only operations. Can be used whenever your alignment choice for one word does not affect the probability of the rest of the alignment