Towards Syntactically Constrained Statistical Word Alignment Greg Hanneman 11-734: Advanced Machine Translation Seminar April 30, 2008.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

The Learning Non-Isomorphic Tree Mappings for Machine Translation Jason Eisner - Johns Hopkins Univ. a b A B events of misinform wrongly report to-John.
Latent Variables Naman Agarwal Michael Nute May 1, 2013.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.
Bayesian Learning of Non- Compositional Phrases with Synchronous Parsing Hao Zhang; Chris Quirk; Robert C. Moore; Daniel Gildea Z honghua li Mentor: Jun.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
DP-based Search Algorithms for Statistical Machine Translation My name: Mauricio Zuluaga Based on “Christoph Tillmann Presentation” and “ Word Reordering.
Statistical Machine Translation. General Framework Given sentences S and T, assume there is a “translator oracle” that can calculate P(T|S), the probability.
Discriminative Learning of Extraction Sets for Machine Translation John DeNero and Dan Klein UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.
1 A Tree Sequence Alignment- based Tree-to-Tree Translation Model Authors: Min Zhang, Hongfei Jiang, Aiti Aw, et al. Reporter: 江欣倩 Professor: 陳嘉平.
A Tree-to-Tree Alignment- based Model for Statistical Machine Translation Authors: Min ZHANG, Hongfei JIANG, Ai Ti AW, Jun SUN, Sheng LI, Chew Lim TAN.
Novel Reordering Approaches in Phrase-Based Statistical Machine Translation S. Kanthak, D. Vilar, E. Matusov, R. Zens & H. Ney ACL Workshop on Building.
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
University of AlbertaNov 30, 2006 Inversion Transduction Grammar with Linguistic Constraints Colin Cherry University of Alberta.
Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??
Daniel Gildea (2003): Loosely Tree-Based Alignment for Machine Translation Linguistics 580 (Machine Translation) Scott Drellishak, 2/21/2006.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Jan 2005Statistical MT1 CSA4050: Advanced Techniques in NLP Machine Translation III Statistical MT.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Natural Language Processing Expectation Maximization.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Syntax for MT EECS 767 Feb. 1, Outline Motivation Syntax-based translation model  Formalization  Training Using syntax in MT  Using multiple.
Statistical Machine Translation Part V - Advanced Topics Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
Statistical Machine Translation Part V – Better Word Alignment, Morphology and Syntax Alexander Fraser CIS, LMU München Seminar: Open Source.
Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Bayesian Subtree Alignment Model based on Dependency Trees Toshiaki Nakazawa Sadao Kurohashi Kyoto University 1 IJCNLP2011.
What’s in a translation rule? Paper by Galley, Hopkins, Knight & Marcu Presentation By: Behrang Mohit.
INSTITUTE OF COMPUTING TECHNOLOGY Forest-to-String Statistical Translation Rules Yang Liu, Qun Liu, and Shouxun Lin Institute of Computing Technology Chinese.
Presenter: Jinhua Du ( 杜金华 ) Xi’an University of Technology 西安理工大学 NLP&CC, Chongqing, Nov , 2013 Discriminative Latent Variable Based Classifier.
A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation Jun Sun ┼, Min Zhang ╪, Chew Lim Tan ┼ ┼╪
Supertagging CMSC Natural Language Processing January 31, 2006.
2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.
Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
(Statistical) Approaches to Word Alignment
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Linguistically-motivated Tree-based Probabilistic Phrase Alignment Toshiaki Nakazawa, Sadao Kurohashi (Kyoto University)
Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.
September 2004CSAW Extraction of Bilingual Information from Parallel Texts Mike Rosner.
A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.
1 Local Search for Optimal Permutations Jason Eisner and Roy Tromble with Very Large-Scale Neighborhoods in Machine Translation.
LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
CSC 594 Topics in AI – Natural Language Processing
Statistical NLP Spring 2011
Statistical Machine Translation Papers from COLING 2004
A Path-based Transfer Model for Machine Translation
Dekai Wu Presented by David Goss-Grubbs
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Presentation transcript:

Towards Syntactically Constrained Statistical Word Alignment Greg Hanneman : Advanced Machine Translation Seminar April 30, 2008

Outline The word alignment problem Base approaches Syntax-based approaches –Distortion models –Tree-to-string models –Tree-to-tree models Discussion

Word Alignment Parallel sentence pair: F and E Most general: map a subset of F to a subset of E

Word Alignment Very large alignment spaces! –An n-word parallel sentence has n 2 possible links and 2 n 2 possible alignments –Restrict to one-to-one alignments: n! possible alignments Alignment models try to restrict or learn a probability distribution over this space to get the “best” alignment of a sentence

Outline The word alignment problem Base approaches Syntax-based approaches –Distortion models –Tree-to-string models –Tree-to-tree models Discussion

A Generative Story [Brown et al. 1990] Theproposalwillnotbeimplemented English sentence Fertility Lespropositionsneserontpasapplicationmisesen Lexical generation Lespropositionsneserontpasapplicationmisesen Distortion

The Framework F: words f 1 … f j … f n E: words e 1 … e i … e m Compute P(F, A | E) for hidden alignment variable A: a 1 … a j … a n –The major step: decomposition, model parameters, EM algorithm, etc. a j = i: word f j is aligned to word e i

The IBM Models [Brown et al. 1993; Och and Ney 2003] Model 1: “Bag of words” — word order doesn’t affect alignment Model 2: Position of words being aligned does matter

The IBM Models [Brown et al. 1993; Och and Ney 2003] Later models use more implicit structural or linguistic information, but not really syntax, and not really overtly –Fertility: P(φ | e i ) of e i producing φ words in F –Distortion: P(τ, π | E) for a set of F words τ in a permutation π –Previous alignments: Probs. for positions in F of the different words of a fertile e i

The HMM Model [Vogel et al. 1996; Och and Ney 2003] Linguistic intuition: words, and their alignments, tend to clump together in clusters a j depends on absolute size of “jump” between it and a j–1

Discriminative Training Consider all possible alignments, score them, and pick the best ones under some set of constraints Can incorporate arbitrary features; generative models more fixed Generative models’ EM requires lots of unlabeled training data; discriminative requires some labeled data

Discriminative Alignment [Taskar et al. 2005] –Co-occurrence –Position difference –Co-occurrence of following words –Word-frequency rank –Model 4 prediction –…–… The proposal will not be implemented Les propositions ne seront pas application mises en

Outline The word alignment problem Base approaches Syntax-based approaches –Distortion models –Tree-to-string models –Tree-to-tree models Discussion

Syntax-Based Approaches Constrain alignment space by looking beyond flat text stream: take higher-level sentence structure into account Representations –Constituency structure –Inversion Transduction Grammar –Dependency structure

An MT Motivation

Syntax-Based Distortion [DeNero and Klein 2007] Syntax-based MT should start from syntax-aware word alignments HMM model + target-language parse trees: prefer alignments that respect tree Handled in distortion model: jumps should reflect tree structure

Syntax-Based Distortion [DeNero and Klein 2007] HMM distortion: size of jump between a j–1 and a j Syntactic distortion: tree path between a j–1 and a j

Syntax-Based Distortion [DeNero and Klein 2007] Training:100,000 parallel French–English and Chinese–English sentences with English parse trees Both E → F and F → E; combined with different unions and intersections, plus thresholds Test: Hand-aligned Hansards and NIST MT 2002 data

Syntax-Based Distortion [DeNero and Klein 2007] HMMs roughly equal, better than GIZA++ Soft union for French; hard union for Chinese; competitive thresholding

Tree-to-String Models

New generative story Word-level fertility and distortion replaced with node insertion and sibling reordering Lexical translation still the same Word alignment produced as a side effect from lexical translations

Tree-to-String Alignment [Yamada and Knight 2001] Discussed in other sessions this semester Training: 2121 short Japanese–English sentences, modified Collins parser output for English Test: First 50 sentences of training corpus Beat IBM Model 5 on human judgements; perplexity between Model 1 and Model 5

Subtree Cloning [Gildea 2003] Original tree-to-string model is too strict –Syntactic divergences, reordering Soft constraint: allow alignments that violate tree structure, but at a cost –Tweak the tree side of the alignment to contain things needed for the string side –Ex.: SVO to OSV

Subtree Cloning [Gildea 2003] S VP AUXVP doADVPVB RB entirely understand NP I PRP NP PRP$NN yourlanguage NP I PRP

Subtree Cloning [Gildea 2003] S VP AUXVP do NP I PRP ADVPVB RB entirely understand NP PRP$NN yourlanguage NP I PRP

Subtree Cloning [Gildea 2003] S VP AUXVP do NP I PRP ADVPVB RB entirely understand NP PRP$NN yourlanguage NP I PRP menti NULL nihuawotu tung

Subtree Cloning [Gildea 2003] For a node n p : –Probability of cloning something as a new child of n p : single EM-learned constant for all n p –Probability of making that clone a node n c : uniform over all n c Surprising that this works…

Subtree Cloning [Gildea 2003] Compared with IBM 1–3, basic tree-to- string, basic tree-to-tree models Training: 4982 Korean–English sentence pairs, with manual Korean parse trees Test: 101 hand-aligned held-out sentences

Subtree Cloning [Gildea 2003] Cloning helps: as good or better than IBM Tree-to-tree model runs faster

Tree-to-Tree Models Alignment must conform to tree structure on both sides — space is more constrained Requires more transformation operations to handle divergent structures [Gildea 2003] Or we could be more permissive…

Inversion Transduction Grammar [Wu 1997] For bilingual parsing; get one- to-one word alignment as a side effect Parallel binary-branching trees with reordering

ITG Operations A → [A A] –Produce “A 1 A 2 ” in source and target streams A → –Produce “A 1 A 2 ” in source stream, “A 2 A 1 ” in target stream A → e / f –Produce “e” in source stream, “f” in target stream

ITG Operations “Canonical form” ITG produces only one derivation for a given alignment –S → A | B | C –A → [A B] | [B B] | [C B] | [A C] | [B C] | [C C] –B → | | | | | –C → e / f

Alignment with ITG [Zhang and Gildea 2004] Compared IBM 1, IBM 4, ITG, and tree-to- string (with and without cloning) Training: Chinese–English (18,773) and French–English (20,000) sentences less than 25 words long Test: Hand-aligned Chinese–English (48) and French–English (447)

Alignment with ITG [Zhang and Gildea 2004] ITG best, or at least as good as IBM or tree-to-string plus cloning ITG has no linguistic syntax…

Dependency Parsing Discussed in other sessions this semester Notion of violating “phrasal cohesion” –Usually bad, but not always

Dependencies + ITG [Cherry and Lin 2006] Find invalid dependency spans; assign score of –∞ if used by the ITG parser Simple model: maximize co-occurrence score with penalty for distant words ITG reduces AER by 13% relative; dependencies + ITG reduce by 34%

Dependencies + ITG [Cherry and Lin 2006] Discriminative training with an SVM Feature vector for each ITG rule instance –Features from Taskar et al. [2005] –Feature marking ITG inversion rules –Feature (penalty) marking invalid spans based on dependency tree

Dependencies + ITG [Cherry and Lin 2006] Compared Taskar et al. to D-ITG with hard and soft constraints Training: 50,000 French–English sentence pairs for counts and probabilities; 100 hand-annotated pairs with derived ITG trees for discriminative training Test: 347 hand-annotated sentences from 2003 parallel text workshop

Dependencies + ITG [Cherry and Lin 2006] Relative improvement smaller in discriminative training scenario with stronger objective function Hard constraint starts to hurt recall

Outline The word alignment problem Base approaches Syntax-based approaches –Distortion models –Tree-to-string models –Tree-to-tree models Discussion

All These Tradeoffs… Mathematical and statistical correctness vs. computability Simple model vs. capturing linguistic phenomena Not enough syntactic information vs. too much syntactic information Ruling out bad alignments vs. keeping good alignments around

Completely unconstrained: every alignment link (e i, f j ) either “on” or “off” Permutation space: one-to-one alignment with reordering [Taskar et al. 2005] ITG space: permutation space satisfying binary tree constraint [Wu 1997] Dependency space: permutation space maintaining phrasal cohesion Alignment Spaces

D-ITG space: Dependency ∩ ITG space [Cherry and Lin 2006] HD-ITG space: D-ITG space where each span must contain a head [Cherry and Lin 2006a]

Examining Alignment Spaces [Cherry and Lin 2006a] Alignment score –Learned co-occurrence score –Gold-standard oracle score

Examining Alignment Spaces [Cherry and Lin 2006a] Learned co-occurrence score –More restricted spaces give better results

Examining Alignment Spaces [Cherry and Lin 2006a] Oracle score: subsets of permutation space –ITG rules out almost nothing correct –Beam search in dependency space does worst

Conclusions Base alignment models are mathematical, limited notions of sentence structure Syntax-aware alignment helpful for syntax-aware MT [DeNero and Klein 2007] Using structure as a hard constraint is harmful for divergent sentences; tweaking trees [Gildea 2003] or using soft constraints [Cherry and Lin 2006] helps fix this

Conclusions Surprise winner: ITG –Computationally straightforward –Permissive, simple grammar that mostly only rules out bad alignments [Cherry and Lin 2006a] –Does a lot, even when it’s not the best Discriminative framework looks promising and flexible — can incorporate generative models as features [Taskar et al. 2005]

Towards the Future Easy-to-run GIZA++ made complicated IBM models the norm — promising discriminative or syntax-based models currently lack such a toolkit Syntax-based discriminative techniques — morphology, POS, semantic information… Any other ideas?

References Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, J. Lafferty, R. Mercer, and P. Roossin, “A statistical approach to machine translation,” Computational Linguistics, 16(2):79-85, Brown, P., S. Della Pietra, V. Della Pietra, and R. Mercer, “The mathematics of statistical machine translation: Parameter estimation,” Computational Linguistics, 19(2): Cherry, Colin and Dekang Lin, “Soft syntactic constraints for word alignment through discriminative training,” Proceedings of the COLING/ACL Poster Session, , Cherry, Colin and Dekang Lin, “A comparison of syntactically motivated alignment spaces,” Proceedings of EACL, , 2006a. DeNero, John and Dan Klein, “Tailoring word alignments to syntactic machine translation,” Proceedings of ACL, 17-24, Gildea, Daniel, “Loosely tree-based alignment for machine translation,” Proceedings of ACL, 80-87, 2003.

References Och, Franz and Hermann Ney, “A systematic comparison of various statistical alignment models,” Computational Linguistics, 29(1):19-51, Taskar, B., S. Lacoste-Julien, and D. Klein, “A discriminative matching approach to word alignment,” Proceedings of HLT/EMNLP, 73-80, Vogel, S., H. Ney, and C. Tillmann, “HMM-based word alignment in statistical translation,” Proceedings of COLING, , Wu, Dekai, “Stochastic inversion transduction grammars and bilingual parsing of parallel corpora,” Computational Linguistics, 23(3): Yamada, Kenji and Kevin Knight, “A syntax-based statistical translation model,” Proceedings of ACL, , Zhang, Hao and Daniel Gildea, “Syntax-based alignment: Supervised or unsupervised?” Proceedings of COLING, , 2004.