Improving a Statistical MT System with Automatically Learned Rewrite Rules Fei Xia and Michael McCord IBM T. J. Watson Research Center Yorktown Heights,

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.

Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.

Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series 2006.

1 A Tree Sequence Alignment- based Tree-to-Tree Translation Model Authors: Min Zhang, Hongfei Jiang, Aiti Aw, et al. Reporter: 江欣倩 Professor: 陳嘉平.

Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.

Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.

1/13 Parsing III Probabilistic Parsing and Conclusions.

Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.

1 Improving a Statistical MT System with Automatically Learned Rewrite Patterns Fei Xia and Michael McCord (Coling 2004) UW Machine Translation Reading.

1/17 Probabilistic Parsing … and some other approaches.

Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.

Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.

Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.

A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.

Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland.

Probabilistic Parsing Ling 571 Fei Xia Week 4: 10/18-10/20/05.

Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.

LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.

Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Syntax for MT EECS 767 Feb. 1, Outline Motivation Syntax-based translation model  Formalization  Training Using syntax in MT  Using multiple.

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

Grammatical Machine Translation Stefan Riezler & John Maxwell.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa

Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.

Dependency Tree-to-Dependency Tree Machine Translation November 4, 2011 Presented by: Jeffrey Flanigan (CMU) Lori Levin, Jaime Carbonell In collaboration.

The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.

A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart

Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.

11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.

Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:

What’s in a translation rule? Paper by Galley, Hopkins, Knight & Marcu Presentation By: Behrang Mohit.

Ameeta Agrawal Nikolay Yakovets 01 Dec …Prime Minister Vladimir V. Putin, the country's paramount leader, cut short a trip to Siberia, returning.

CSA2050 Introduction to Computational Linguistics Parsing I.

LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.

A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation Jun Sun ┼, Min Zhang ╪, Chew Lim Tan ┼ ┼╪

Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.

2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.

CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.

Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.

October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.

Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.

Neural Machine Translation

Approaches to Machine Translation

Lecture – VIII Monojit Choudhury RS, CSE, IIT Kharagpur

Basic Parsing with Context Free Grammars Chapter 13

Approaches to Machine Translation

Statistical Machine Translation Papers from COLING 2004

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Presentation transcript:

Improving a Statistical MT System with Automatically Learned Rewrite Rules Fei Xia and Michael McCord IBM T. J. Watson Research Center Yorktown Heights, New York

Previous attempt at using syntax 2003 NSF/JHU MT Summer Workshop (Och et. al., 2004) : Method: run SMT to generate topN translations, and use syntactic info to rerank the candidates. => Parsing MT output is problematic. Result: no gain from syntax.

Outline Current phrase-based SMT systems (a.k.a. clump-based systems) Overview of the new approach Learning and applying rewrite rules Experimental results Conclusion and future work

Clump-based SMT The unit of translation is a clump, rather than a word. A clump is simply a word n- gram. Ex: P (est le premier | is the first) Baseline system: (Tillmann & Xia, 2003)

Clump-based system: training stage Source Sentence Clump Extractor parallel data Clump Library Preprocessor Target sentence Preprocessor

Clump-based system: translation stage Preprocessor Decoder Source sentence Clump Library Target translation Language Model

France is the first western country … Baseline system: clump extraction France is => France est La France est le premier pays occidental … France => France

France is the first western country … Baseline system: clump extraction France is => France est La France est le premier pays occidental … France => France France is the => France est le

Baseline system: decoding He is the first international student he => il, he is => il est, he is the => il est le first => premier, is the first => est le premier international => international student => étudiant He is the first international student

Monotonic vs. non-monotonic decoding premier il est le international étudiant... il est le premier international étudiant il est le premier étudiant international He is the first international student Monotonic decoding: Non-monotonic decoding: (S2, S1, S3, S4): (S1, S2, S4, S3): (S1, S2, S3, S4): S1S2S3S4

Challenges for current clump-based systems premier il est le international étudiant premier international il est le étudiant (1) Non-monotonic decoding is expensive (n!), and it can hurt performance. He is the first international student S1 S2 S3 S4 (S2, S1, S3, S4): (S2, S3, S1, S4): (S2, S3, S4, S1): premier international étudiant il est le

Challenges for current clump-based systems (ctd) (2) No phrase-level generalizations are learned and used. France is the first western country. He is the first international student. Rewrite rules are useful: word-level rule: Adj N => N Adj phrase-level rule: Subj V Obj => V Subj Obj

New approach He is the first international student He is the first student international Applying rewrite rules: Adj N => N Adj il est le premier étudiant international Monotonic decoding

Rewrite rules S NP V S V NP V NP => V NP NP NP 0 V NP 1 => V NP 0 NP 1

Defaults and exceptions Adj N => N Adj Adj (first) N => Adj (premier) N NP 0 V NP 1 (iobj, pron) => NP 0 NP 1 (iobj, pron) V NP 0 V NP 1 => NP 0 V NP 1 NP 0 V NP 1 (iobj, pron) => NP 0 NP 1 V Adj (first) N => Adj N  Learn both defaults and exceptions

New approach: training stage 1 Source Sentence Phrase Aligner Parallel data Parser Target sentence Parser Rewrite rules Rewrite Rule Extractor

New approach: training stage 2 Source Sentence Clump Extractor Parallel data Rewrite Rule Applier Source sentence in target word order Clump Library PreprocessorParser Target sentence Preprocessor Rewrite rules

New approach: translation stage Preprocessor Decoder Source sentence Target translation Rewrite rule applier source sentence in target word order Parser Clump Library Language Model Rewrite rules

Tasks Learn rewrite rules automatically from data. Apply rewrite rules to source parse trees.

Learning rewrite rules Parse source and target sentences Align linguistic phrases Extract rewrite rules Organize rewrite rules into a hierarchy

Parsing France is the first western country Slot grammar: (McCord 1980, 1993, …)

Parse trees in Penn Treebank style S NP-SBJ V NP-PRD is DetAdj NN Francethe first western country

Aligning phrases NP-SBJ S VNP-PRD is DetAdj N N France the first western country NP-SBJ S VNP-PRD est DetAdjN N Francele premier pays occidental Det la

Extracting rewrite rules NP-SBJ S VNP-PRD is DetAdj N N France the first western country NP-SBJ S VNP-PRD est DetAdjN N Francele premier pays occidental Det la NP 0 (France) V(is) NP 1 (country) => NP 0 V NP 1

Extracting rewrite rules NP-SBJ S VNP-PRD is DetAdj N N France the first western country NP-SBJ S VNP-PRD est DetAdjN N Francele premier pays occidental Det la N (France) => Det (la) N NP 0 (France) V(is) NP 1 (country) => NP 0 V NP 1

Extracting rewrite rules NP-SBJ S VNP-PRD is DetAdj N N France the first western country NP-SBJ S VNP-PRD est DetAdjN N Francele premier pays occidental Det la Det (the) Adj 1 (first) Adj 2 (western) N (country) => Det Adj 1 N Adj 2 N (France) => Det (la) N NP 0 (France) V(is) NP 1 (country) => NP 0 V NP 1

Creating more generalized rules Det (the) Adj 1 (first) Adj 2 (western) N (country) => Det Adj 1 N Adj 2 Adj (first) N (country) => Adj N * Adj N => Adj N * Adj (first) N => Adj N Adj (western) N (country) => N Adj * Adj N => N Adj * Adj (western) N => N Adj

Merging the counts and normalize ADJ N => N ADJ ADJ N => ADJ N ADJ (first) N => ADJ N ADJ (first) N => N ADJ ADJ (first) N (country) => ADJ N

Organizing rewrite rules N Adj N PP Adj (first) N Adj(first) N(country)  N 0.9 => Det N 0.1 Adj N => N Adj 0.64 => Adj N 0.33 => Adj N 0.99 => N Adj 0.01 => N Adj PP 0.61 => Adj N PP 0.30 N PP => N PP 0.91 => Det N PP 0.05 => Adj N 1.0 Adj N N PP Adj N PP Adj(first) N(country) N

Organizing rewrite rules Adj (first) N Adj N => N Adj 0.64 => Adj N 0.27 => Adj N 0.99 => N Adj 0.26 => Det Adj N 0.35 Adj N

Applying rewrite rules Adj 1 (first) Adj 2 N => Adj 1 N Adj 2 He NP-SBJ S VNP-PRD is DetAdj N N the first international student NP-SBJ S VNP-PRD is DetAdjN N He the first student international Adj N => N Adj Adj 1 Adj 2 N => N Adj 2 Adj 1

Decoding He is the first international student He is the first student international Applying rewrite rules: Adj (first) Adj N => Adj N Adj il est le premier étudiant international Monotonic decoding

Experimental Result Training data: 90M-word Eng-Fr Canadian Hansard Test data: 500 sentences in news domain Metrics: Bleu score (Papineni et. al., 2002) 1-reference translation Parser: English and French slot grammars Baseline system: (Tillmann and Xia, 2003)

Extracted rewrite rules Extracted rules: 15.0 M After removing singleton: 2.9 M After filtering with hierarchy: 56 K -- 1K unlexicalized rules -- 55K lexicalized rules, represented as 760 compact rule schemes. Ex: Adj (w) N => Adj N: w: new, first, prime, many, other, ….

Most commonly used rules # of rules applied per sentence: 1.4 times Adj N => N Adj : 0.32 Adj (w) N => Adj N : 0.15 NP 1 ’s NP 2 => NP 2 de NP 1 : 0.05 NP Adv V => NP V Adv: 0.03 NP 1 V NP 2 (pron) => NP 1 NP 2 V: 0.03

Monotonic vs. non-monotonic decoding baselinew/ rewrite rules non-monotonic monotonic

Monotonic decoding results Max source clump size Bleu Score (1-ref) baseline w/ rewrite rules

Conclusion Use automatically learned rewrite rules to reorder source sentences: –Rewrite rules allow generalizations –Monotonic decoding speeds up translation Gain: 10% improvement in Bleu. (0.196 => 0.215)

Future work Try other language pairs (Ar-Eng, Ch-Eng). Inject rewriting lattice into statistical models. Use rewrite rules directly in the decoder.

Backup slides

An example of filtering rules N* (10^9) Det Adj N* Adj(prime) N* (10^3) Adj N* (10^6) Det Adj(prime) N* (10^2) => N 0.9  N* Adj 0.7  Adj N* 0.3 => Adj N 1.0 => Det Adj N 0.85 (gain=0) (gain=4*10^5) (gain=10^3) (gain=0)

Fr: le presse service de le premier ministre service de presse An example Eng: the prime minister ’s press office issued the following press release Fr: le premier ministre de presse service diffusé le suivant communiqué a diffusé le communiqué suivant du

Main issues in MT Word choice: office => bureau, cabinet, ……, service release => libération, sortie, disque,.. communiqué Inserting glue words: e.g., preposition “de”, aux verb “a” Ordering target words: service de presse Morphing target words: subject-verb agreement, contraction (de + le => du), etc.

Two approaches to MT Syntax-based MT Statistical MT (SMT)

Syntax-based MT Major steps : –Parse the source sentence –Translate source words into target words –Reshape source parse tree with rewrite rules –Read target sentence off the tree.

Translation lexicon: prime => premier, ’s => de office => service if modified by “press” Rewrite rules: NP1 de NP2 => NP2 de NP1 the prime minister NP press office NP ’s NP N Det Adj N le premier ministre presse service de Translation: service de presse de le premier ministre N1 N2 => N2 de N1 du NP press office N presse service the prime minister Det Adj N le premier ministre de NP N de N service pressethe prime minister Det Adj N le premier ministre de

Syntax-based approach It requires: a parser for the source language a translation lexicon a set of rewrite rules Normally, these components are created by hand.

Statistical Machine Translation  Learn from parallel corpus  Easier to create translation systems for new language pairs  “Phrase-based” models outperforms word-based models. E->F Translator F->E Translator

Advantages of phrase pairs Translating source word with extended context: press office => service de presse Glue word insertion: e.g., “de” Ordering of target words Morphing target words: the prime minster ’s => du premier ministre: de + le => du

NSF workshop experiments Data: 150M-word Chinese-English parallel corpora. Top 1000 candidates, 4 references Baseline (SMT): in bleu Oracle result: in bleu Each method: range from to Adding all good methods: Typical improvements: no syntax > shallow ~ tricky > deep syntax Sadly, no gain from syntax

Syntax-based rewrite rules (X0 => X1 … Xn) => (Y0 => Y1 … Yn) Xi, Yi: head word, thematic role (e.g., subj, obj), syntax label (POS tag of the head word), etc. VP => V iobj + NP + w_it => VP => iobj + noun + w_le V V NP (iobj, it) => NP (iobj, le) V

Parsing: ESG parser Slot grammar is a lexicalized, dependency- oriented system (McCord 1980) Languages covered: English, German, French, Spanish, Italian, and Portuguese.

the prime minister Det Adj N press office NP ’s NP N N NP NP1 ’s NP2 => NP2 de NP1 N1 N2 => le N2 de N1 Det Adj N => Det Adj N Training (1) : parse and learn rewrite rules le premier ministre NP de NP Det N P N Det Adj N le service de presse

the prime minister Det Adj N press office NP ’s NP N N NP1 ’s NP2 => NP2 de NP1 N1 N2 => le N2 de N1 Det Adj N => Det Adj N Training (2): put Eng sentences into Fr order the prime minister NP de NP Det N P N Det Adj N le office de press => le office de press de the prime minister

Training (3): learn phrases from training data Eng: le office de press de the prime minister Fr: le service de presse du premier ministre Phrase pairs learned: le office de press => le service de presse press de the prime minister => presse du premier minstre le => le de => de, de the => du

the government economic policy NP the government Det N NP ’s NP Adj N NP1 ’s NP2 => NP2 de NP1 Adj N => N Adj Translating (1): put Eng sentences into Fr order NP deNP N Adj Det N policy ecomomic => policy ecomomic de the goverment

Translating (2): translate with SMT decoder Eng: policy economic de the government Phrase pairs learned at training time: policy => politique economic => economique de the government => du gouvernement SMT output: translating in linear order politique economique du gourvernement

Test the idea Training data: 90M English-French Candide data Test data: 500 sentences, 1 reference translation Parser: English and French slot grammars (ESG and FSG) Rewrite rules: 10 hand-written rewrite rules Adj N => N Adj

Experimental results Improvement so far: from to (+9%) NSF workshop: no gain from syntax Not reorder source Reorder source Not reorder target Reorder target

Learning rewrite rules from data There are many rules and many exceptions: ADJ N => N ADJ 0.47 ADJ N => ADJ N0.27 Ex: small, recent, past, former, next, last, good, previous, serious, certain, large, great, various, …..

Algorithm Parse source and target sentences Align linguistic phrases Extract rewrite rules

Filtering rewrite rules Why? Too many rules Most are “redundant” How? Put rules into a hierarchy Calculate gains w.r.t. parents

Translation results Baseline (no rewrite rules): with 10 hand-written rules: with 1K unlexicalized rules: with 1K unlexicalized rules and 760 “meta” rules: 0.215

Details of filtering algorithm Remove redundant unlexicalized rules Remove redundant lexicalized rules w.r.t. the corresponding unlexicalized rules Put