February 2006Machine Translation II.21 Postgraduate Diploma In Translation Example Based Machine Translation Statistical Machine Translation.

February 2006Machine Translation II.21 Postgraduate Diploma In Translation Example Based Machine Translation Statistical Machine Translation

February 2006 Machine Translation II.2 2 Three ways to lighten the load Restrict coverage to specialised domains Exploit existing sources of knowledge (convert machine readable dictionaries) Try to manage without explicit representations  Example Based MT (EBMT)  Statistical MT (SMT)

February 2006 Machine Translation II.2 3 Today’s Lecture Example Based MT Statistical MT

February 2006Machine Translation II.24 Part I Example Based Machine Translation

February 2006 Machine Translation II.2 5 EBMT Basic idea is that instead of being based on on rules and abstract representations, translation should be based on a database of examples. Each example is pairing of a source/target fragment. The original intuition came from Nagao, a well-known pioneer in the field of En/Jp translation.

February 2006 Machine Translation II.2 6 EBMT (Nagao 1984) Man does translation by: by properly decomposing an input sentence into certain fragmental phrases, then by translating these phrases into other language phrases, and finally by properly composing these fragmental translations into one long sentence.

February 2006 Machine Translation II.2 7 Three Step Process Match: identify relevant source language examples in database. Align: find corresponding fragments in target language. Recombine: target language fragments to form sentences.

February 2006 Machine Translation II.2 8 EBMT An Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Based on Notes by Dave Inman

February 2006 Machine Translation II.2 9 EBMT Corpus & Index Corpus S1: The cat eats a fish. Le chat mange un poisson. S2: A dog eats a cat. Un chien mange un chat. S99,999,999 …. Index the:S1 cat:S1,S2 eats:S1,S2 fish:S1 dog:S2

February 2006 Machine Translation II.2 10 EBMT: find chunks A source language sentence is input. The cat eats a dog. Chunks of this sentence are matched against the corpus. The cat: S1 The cat eats: S1 The cat eats a: S1 a dog: S2

February 2006 Machine Translation II.2 11 Match and Align Chunks For each chunk retrieve target.  the cat eats : S1 The cat eats a fish. Le chat mange un poisson  a dog:S2 a dog. Un chien mange un chat. The chunks are aligned with target sentences The cat eats Le chat mange un poisson Alignment is difficult.

February 2006 Machine Translation II.2 12 Recombination Chunks are scored to find good match…  The cat eats/Le chat mange Score 78%  The cat eats /Le chat dorme Score 43%  a dog/un chienScore 67%  a dog/le chienScore 56%  a dog/un arbreScore 22% The best translated chunks are put together to make the final translation.  The cat eats/Le chat mange  a dog/un chien

February 2006 Machine Translation II.2 13 What Data Are Needed? 1. A bilingual dictionary …but we can induce this from the corpus. 2. A target language root/synonym list. … so we can see similarity between words and inflected forms (e.g. verbs) 3. Classes of words easily translated … such as numbers, towns, weekdays. 4. A large corpus of parallel sentences. …if possible in the same domain as the translations.

February 2006 Machine Translation II.2 14 How to create a bilingual lexicon Take each sentence pair in the corpus. For each word in the source sentence, add each word in the target sentence and increment the frequency count. Repeat for as many sentences as possible. Use a threshold to get possible alternative translations.

February 2006 Machine Translation II.2 15 How to create a lexicon The cat eats a fish. Le chat mange un poisson. thele,1chat,1mange,1un,1poisson,1 catle,1chat,1mange,1un,1poisson,1 eatsle,1chat,1mange,1un,1poisson,1 ale,1chat,1mange,1un,1poisson,1 fishle,1chat,1mange,1un,1poisson,1

February 2006 Machine Translation II.2 16 After many sentences … the le,956 la,925 un,235 ------ Threshold ---------- chat,47 mange,33 poisson,28.... arbre,18

February 2006 Machine Translation II.2 17 After many sentences … cat chat,963 ------ Threshold ---------- le,604 la,485 un,305 mange,33 poisson,28.... arbre,47

February 2006 Machine Translation II.2 18 Indexing the Corpus For speed the corpus is indexed on the source language sentences. Each word in each source language sentence is stored with info about the target sentence. Words can be added to the corpus and the index easily updated. Tokens are used for common classes of words (e.g. numbers). This makes matching more effective.

February 2006 Machine Translation II.2 19 Finding Chunks to Translate Look up each word in the source sentence in the index. Look for chunks in the source sentence (at least 2 words adjacent) which match the corpus. Select last few matches against the corpus (translation memory). Pangloss uses the last 5 matches for any chunk.

February 2006 Machine Translation II.2 20 Matching a chunk against the target. For each source chunk found previously, retrieve the target sentences from the corpus (using the index). Try to find the translation for the source chunk from these sentences. This is the hard bit! Look for the minimum and maximum segments in the target sentences which could correspond with the source chunk. Score each of these segments.

February 2006 Machine Translation II.2 21 Scoring a segment… Unmatched Words : Higher priority is given to sentences containing all the words in an input chunk. Noise : Higher priority is given to corpus sentences which have fewer extra words. Order : Higher priority is given to sentences containing input words in the order which is closer to their order in the input chunk. Morphology : Higher priority is given to sentences in which words match exactly rather than against morphological variants.

February 2006 Machine Translation II.2 22 Whole Sentence Match If we are lucky the whole sentence will be found in the corpus! In that case the target sentence is used without previous alignment. Useful if translation memory is available (sentences recently translated are added to the corpus).

February 2006 Machine Translation II.2 23 Quality of Translation Pangloss was tested against source sentences in a different domain to the examples in the corpus. Pangloss “covered” about 70% of the sentences input. This means a match was found against the corpus…. …but not necessarily a good match. Others report around 60% of the translation can be understood by a native speaker. Systran manages about 70%.

February 2006 Machine Translation II.2 24 Speed of Translation Translations are much faster than for Systran. Simple sentences translated in seconds. Corpus can be added to (translation memory) at about 6MBytes per minute (Sun Sparc Station) A 270 Mbytes corpus takes 45 minutes to index.

February 2006 Machine Translation II.2 25 Positive Points Fast Easy to add a new language pair No need to analyse languages (much) Can induce a dictionary from the corpus Allows easy implementation of translation memory

February 2006 Machine Translation II.2 26 Negative Points Quality is second best at present Depends on a large corpus of parallel, well translated sentences 30% of source has no coverage (translation) Matching of words is brittle – we can see a match Pangloss cannot. Domain of corpus should match domain to be translated - to match chunks

February 2006 Machine Translation II.2 27 Conclusions An alternative to Systran Faster Lower quality Quick to develop for a new language pair – if corpus exists! Needs no linguistics Might improve as bigger corpora become available?

February 2006Machine Translation II.228 Part II Statistical Translation

February 2006 Machine Translation II.2 29 Statistical Translation Robust Domain independent Extensible Does not require language specialists Uses noisy channel model of translation

February 2006 Machine Translation II.2 30 Noisy Channel Model Sentence Translation (Brown et. al. 1990) source sentence target sentence sentence

February 2006 Machine Translation II.2 31 Basic Principle John loves Mary (S) Jean aime Marie (T) Given T, I have to find S such that  P trans = probability that T is a translation of S  P s = probability of S  P trans x P s is greater than for any other S’

February 2006 Machine Translation II.2 32 A Statistical MT System Source Language Model Translation Model Ps Ptrans ST Decoder TS

February 2006 Machine Translation II.2 33 The Three Components of a Statistical MT model 1. Method for computing language model (P s ) probabilities 2. Method for computing translation (P trans ) probabilities 3. Method for searching amongst source sentences for one that maximises P trans x P s

February 2006 Machine Translation II.2 34 Simplest Language Model Probability Ps of any sentence is the product of the probabilities of the words in it. For example, Probability of John loves Mary = P(John) x P(loves) x P(Mary)

February 2006 Machine Translation II.2 35 Simplest Translation Model (1) Assumption: target sentence is generated from the source sentence word-by-word S: John loves Mary T: Jean aime Marie

February 2006 Machine Translation II.2 36 Simplest Translation Model (2) P trans is just the product of the translation probabilities of each of the words. P trans = P(Jean|John) * P(aime|loves) * P(Marie|Mary)

February 2006 Machine Translation II.2 37 More Realistic Example The proposal will not now be implemented Les propositions ne seront pas mises en application maintenant

February 2006 Machine Translation II.2 38 More Realistic Translation Models Better translation models include other features such as Fertility: the number of words in the target that are paired with each source word: (0 – N) Distortion: the difference in sentence position between the source word and the target word

February 2006 Machine Translation II.2 39 Searching Maintain list of hypotheses. Initial hypothesis: (Jean aime Marie | *) Search proceeds interatively. At each iteration we extend most promising hypotheses with additional words Jean aime Marie | John(1) * Jean aime Marie | * loves(2) * Jean aime Marie | * Mary(3) * Jean aime Marie | Jean(1) *

February 2006 Machine Translation II.2 40 Building Models In general - large quantities of data For language model, we need only source language text. For translation model, we need pairs of sentences that are translations of each other. Use EM Algorithm (Baum 1972) to optimize model parameters.

February 2006 Machine Translation II.2 41 Experiment 1 (Brown et. al. 1990) Hansard. 40,000 pairs of sentences = approx. 800,000 words in each language. Considered 9,000 most common words in each language. Assumptions (initial parameter values)  each of the 9000 target words equally likely as translations of each of the source words.  each of the fertilities from 0 to 25 equally likely for each of the 9000 source words  each target position equally likely given each source position and target length

February 2006 Machine Translation II.2 42 English: the FrenchProbability le.610 la.178 l’.083 les.023 ce.013 il.012 de.009 à.007 que.007 FertilityProbability 1.871 0.124 2.004

February 2006 Machine Translation II.2 43 English: not FrenchProbability pas.469 ne.460 non.024 pas du tout.003 faux.003 plus.002 ce.002 que.002 jamais.002 FertilityProbability 2.758 0.133 1.106

February 2006 Machine Translation II.2 44 English: hear FrenchProbability bravo.992 entendre.005 entendu.002 entends.001 FertilityProbability 0.584 1.416

February 2006 Machine Translation II.2 45 Experiment 2 Perform translation using 1000 most frequent words in the English corpus. 1,700 most frequently used French words in translations of sentences completely covered by 1000 word English vocabulary. 117,000 pairs of sentences completely covered by both vocabularies. Parameters of English language model from 570,000 sentences in English part.

February 2006 Machine Translation II.2 46 Experiment 2 contd 73 French sentences tested from elsewhere in corpus. Results were classified as  Exact – same as actual translation  Alternate – same meaning  Different – legitimate translation but different meaning  Wrong – could not be intepreted as a translation  Ungrammatical – grammatically deficient Corrections to the last three categories were made and keystrokes were counted

February 2006 Machine Translation II.2 47 Results Category# sentencespercent Exact45 Alternate1825 Different1318 Wrong1115 Ungrammatical2737 Total73

February 2006 Machine Translation II.2 48 Results - Discussion According to Brown et. al., system performed successfully 48% of the time (first three categories). 776 keystrokes needed to repair 1916 keystrokes to generate all 73 translations from scratch. According to authors, system therefore reduces work by 60%.

February 2006 Machine Translation II.2 49 Bibliography Statistical MT Brown et. al., A Statistical Approach to MT, Computational Linguistics 16.2, 1990 pp79- 85 (search “ACL Anthology”)

February 2006Machine Translation II.21 Postgraduate Diploma In Translation Example Based Machine Translation Statistical Machine Translation.

Similar presentations

Presentation on theme: "February 2006Machine Translation II.21 Postgraduate Diploma In Translation Example Based Machine Translation Statistical Machine Translation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

February 2006Machine Translation II.21 Postgraduate Diploma In Translation Example Based Machine Translation Statistical Machine Translation.

Similar presentations

Presentation on theme: "February 2006Machine Translation II.21 Postgraduate Diploma In Translation Example Based Machine Translation Statistical Machine Translation."— Presentation transcript:

Similar presentations

About project

Feedback