Example-based Machine Translation The other corpus-based approach to MT.

Slides:

Advertisements

Similar presentations

1 Knowledge Representation Introduction KR and Logic.

Advertisements

Artificial Intelligence: Natural Language and Prolog

1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.

By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.

Factors, Primes & Composite Numbers

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.

How to Factor Quadratics of the Form

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 2 Q 3 Q 4 Q 5 Q 6Q 16Q 11Q 21 Q 7Q 12Q 17Q 22 Q 8Q 13Q 18 Q 23 Q 9 Q 14Q 19Q 24 Q 10Q 15Q 20Q 25 Final Jeopardy Writing Terms.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Title Subtitle.

Setting Up The Interactive Notebook!!

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

Year 6 mental test 5 second questions

Formal Models of Computation Part III Computability & Complexity

CS1512 Foundations of Computing Science 2 Week 3 (CSD week 32) Probability © J R W Hunter, 2006, K van Deemter 2007.

Database Design: ER Modelling (Continued)

LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.

Understanding by Design PLC

On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach

ABC Technology Project

Semantic Analysis and Symbol Tables

Review Chapter 11 - Tables © 2010, 2006 South-Western, Cengage Learning.

1 Passage Idea of the text 2. Word study.

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.

Machine Translation II How MT works Modes of use.

Centre for Public Legal Education Alberta Minority Official Language Rights Introduction to Minority Official Language Rights.

Queensland University of Technology – University of Tartu From Conceptual to Executable BPMN Process Models A Step-by-Step.

This, that, these, those Number your paper from 1-10.

GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.

Addition 1’s to 20.

25 seconds left…...

How do you read?.

1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.

We will resume in: 25 Minutes.

Which units are you most interested in covering? Unit A –Management Science Unit B – Growth Unit C – Shape and Form Unit D – Statistics.

Improving Achievement

Introduction to Recursion and Recursive Algorithms

Self-training with Products of Latent Variable Grammars Zhongqiang Huang, Mary Harper, and Slav Petrov.

1 Complexity ©D.Moshkovitz Cryptography Where Complexity Finally Comes In Handy…

Symbolic Logic Lesson CS1313 Spring Symbolic Logic Outline 1.Symbolic Logic Outline 2.What is Logic? 3.How Do We Use Logic? 4.Logical Inferences.

MA 1128: Lecture 06 – 2/15/11 Graphs Functions.

Median and Mode Lesson

1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.

Machine Translation III Empirical approaches to MT: Example-based MT Statistical MT LELA30431/chapter50.pdf.

CSA4050: Advanced Topics in NLP Example Based MT.

Author’s Purpose Standards: ELACC8RI1 (Cite textual evidence) ELACC8RI6 (Determine POV or purpose in text) ELACC8RI7 (Evaluate use of different mediums)

1 Session 1 Advantages and Disadvantages of Translation Technology (TT) - Historical development of translation technology - Focus on TM and MT (Theory.

Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series 2006.

C SC 620 Advanced Topics in Natural Language Processing Lecture 20 4/8.

EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.

Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.

Example-Based Machine Translation Kevin Duh April 21, 2005 UW Machine Translation Reading Group.

C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.

Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

Teaching Productive Skills Which ones are they? Writing… and… Speaking They have similarities and Differences.

Unsupervised learning of Natural languages Eitan Volsky Yasmine Meroz.

Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.

February 2006Machine Translation II.21 Postgraduate Diploma In Translation Example Based Machine Translation Statistical Machine Translation.

Rules, Movement, Ambiguity

For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.

A Simple English-to-Punjabi Translation System By : Shailendra Singh.

Approaches to Machine Translation

Approaches to Machine Translation

Presentation transcript:

Example-based Machine Translation The other corpus-based approach to MT

2/23 Example-based Machine Translation Historically predates SMT (just about) At first seen as a rival approach Now almost marginalised … … despite (because of?) some convergence The other corpus-based approach to MT In this talk I will Explain basic ideas and problems Point to differences and similarities between EBMT and SMT

3/23 Example-based MT Long-established approach to empirical MT First developed in contrast with rule-based MT Idea of translation by analogy (Nagao 1984) Translate by adapting previously seen examples rather than by linguistic rule “Existing translations contain more solutions to more translation problems than any other available resource.” (P. Isabelle et al., TMI, Kyoto, 1993) In computational terms, belongs in family of Case- based reasoning approaches

4/23 EBMT basic idea database of translation pairs match input against example database (like Translation Memory) identify corresponding translation fragments (align) recombine fragment into target text

5/23 He buys a book on international politics Input Matches He buys a notebook. Kare wa nōto o kau. I read a book on international politics. Watashi wa kokusai seiji nitsuite kakareta hon o yomu. Result Kare wa o kau. kokusai seiji nitsuite kakareta hon Example (Sato & Nagao 1990)

6/23 A bit less hand-waving Simple example hides some problems, but first notice already some differences with SMT If the input already appeared in the bitext, system is guaranteed to produce an exact (correct) translation (assuming no contradictory examples) If the input is only slightly different from the example, there’s a pretty good chance that the translation will be OK These are both properties of Translation Memories In its purest form, there is no preprocessing of the corpus in EBMT: everything is done at run time

7/23 Matching the input In principle, the simplest part of the process: Levenshtein distance for simple string match Can be enhanced by annotating the examples with linguistic knowledge (POS tags, semantic info, structural representations) to improve accuracy and flexibility Some approaches suggest generalizing example pairs you end up with something which looks like RBMT transfer rules Example generalization is done off-line Using “rules” that express linguistic knowledge Or more automatically by merging similar examples

8/23 Generalization using knowledge John Miller flew to Frankfurt on December 3rd. John Miller ist am 3. Dezember nach Frankfurt geflogen. flew to on. ist am. nach geflogen. flew to on. ist am nach geflogen. Dr Howard Johnson flew to Ithaca on 7 April 1997.

9/23 The monkey ate a peach.  saru wa momo o tabeta. The man ate a peach.  hito wa momo o tabeta monkey  saru man  hito The … ate a peach.  … wa momo o tabeta The dog ate a rabbit.  inu wa usagi o tabeta The dog ate a rabbit.  inu wa usagi o tabeta dog  inu rabbit  usagi The … ate a ….  … wa … o tabeta Generalization by analogy – an exercise

10/23 Alignment Taking the input and the closely-matching example and deciding which fragments of the translation can be reused or need to be changed Input: The operation was interrupted because the Listening key was pressed. Matches: The operation failed because the print key was pressed. L’opération a échoué car la touche d’impression a été enfoncée.

11/23 Alignment – how is this done? Dictionary look-up Comparison of multiple examples

12/23 Alignment – Comparison of multiple examples Comparison of multiple examples to distinguish alternatives, using semantic similarity (Nagao 1984) He eats potatoes. Input Matches A man eats vegetables. Hito wa yasai o taberu. Result Kare wa jagaimo o Acid eats metal. San wa kinzoku o okasu. ☺ taberu.

13/23 Alignment – Comparison of multiple examples Comparison of multiple examples to distinguish alternatives, using semantic similarity (Nagao 1984) He eats potatoes. Input Matches A man eats vegetables. Hito wa yasai o taberu. Result Kare wa jagaimo o Acid eats metal. San wa kinzoku o okasu. Sulphuric acid eats iron. Ryūsan wa tetsu o ☺ taberu. okasu.

14/23 Alignment – how is this done? Dictionary look-up Comparison of multiple examples Precomputed as in SMT: using word-alignment model

15/23 Phrase alignment Granularity of fragments is a problem Too small = too general when it comes to recombination (You wouldn’t dream of translating by looking up each individual word in a dictionary and pasting it into position) Too big = sparse, and difficult to recombine Working at an intermediate level seems attractive: Phrase-based chunking Also found in SMT One fairly successful approach (at DCU) has been …

16/23 Marker-based chunking Most languages have a set of “marker words” (Green 1979) – roughly speaking, closed-class words Marker words can be used to distinguish chunks Start a new phrase every time you come across a marker word Except that each phrase must contain at least one non-marker word these limits are designed to provide reasonable protection against harmful interference when the equipment is operated in a residential environment. these limits are designed to provide reasonable protection against harmful interference when the equipment is operated in a residential environment. these limits are designed to provide reasonable protection against harmful interference when the equipment is operated in a residential environment.

17/23 Chunk alignment Align by finding similar pairs of chunks in other examples No need for chunks to align 1:1, … … nor follow the same sequence Markers can help, but don’t have to 1 these limits are designed 2 to provide reasonable protection 3 against harmful interference 4 when the equipment is operated 5 in a residential environment. 1 ces limites sont destinées 2 à assurer 3 une protection raisonnable 4 contre les interférences 5 lorsque le matériel est utilisé 6 dans un environnement résidentiel. 1 consult 2 the dealer 3 or an experienced radio/TV technician 4 for help. 1 en cas 2 de besoin, 3 se adresser 4 à un technicien radio 5 ou TV qualifié.

18/23 Recombination Having identified target-language fragments, how do we put them together? Depends how examples are stored Templates with labelled slots flew to on. Tree structures Kanojo wa kami ga nagai. SHE (topic) HAIR (subj) IS-LONG. She has long hair. kanojo nagai kami waga have she hair long subjobj mod Kare wa me ga aoi. kare aoi me he eyes blue He has blue eyes.

19/23 Recombination – a problem Consider again: He buys a book on politics Matches He buys a notebook. Kare wa n ō to o kau. I read a book on politics. Watashi wa seiji nitsuite kakareta hon o yomu. He buys a pen. Kare wa pen o kau. She wrote a book on politics. Kanojo wa seiji nitsuite kakareta hon o kaita. Result Kare wa o kau. wa seiji nitsuite kakareta hon o Kare wa o kau

20/23 Recombination – another problem Boundary friction Solutions? Labelled fragments (remember where you got the fragment from – use its context) Target-language grammar Target language model (as in SMT) Input: The handsome boy entered the room Matches: The handsome boy ate his breakfast. Der schöne Junge aß sein Frühstück I saw the handsome boy. Ich sah den schönen Jungen.

21/23 EBMT and SMT hybrids Recombination is like decoding Matching/alignment phases have produced a bag of fragments that now need to be recombined to form a grammatical target sentence Essentially the same task as is found in SMT decoding Doesn’t matter what the source of the fragments is Similarly, one could imagine an SMT translation model taking ideas from EBMT matching/alignment

22/23 So are EBMT and SMT the same? Use of a bitext as the fundamental data source Empirical rather than rational: Principle of machine learning rather than human (linguist) writing rules From which it follows (in principle) that systems can be improved mainly by getting more data And it is hoped that new language-pairs can be developed “just” by finding suitable parallel corpus data Some things in common which distinguish them from Rule-based MT

23/23 So are EBMT and SMT the same? SMT essentially uses statistical data (parameters, probabilities) derived from the bitext Preprocessing the data is essential Even if the input is in the training data, you are not guaranteed to get the same translation EBMT uses the bitext as its primary data source Preprocessing the data is optional If the input is in the example set, you are guaranteed to get the same translation It may be merely dogmatic to insist, but there are some definitional differences