Transfer-based MT.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

The Learning Non-Isomorphic Tree Mappings for Machine Translation Jason Eisner - Johns Hopkins Univ. a b A B events of misinform wrongly report to-John.
Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.
May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Approaches to Parsing.
Chapter Chapter Summary Languages and Grammars Finite-State Machines with Output Finite-State Machines with No Output Language Recognition Turing.
GRAMMAR & PARSING (Syntactic Analysis) NLP- WEEK 4.
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
1 A Tree Sequence Alignment- based Tree-to-Tree Translation Model Authors: Min Zhang, Hongfei Jiang, Aiti Aw, et al. Reporter: 江欣倩 Professor: 陳嘉平.
A Tree-to-Tree Alignment- based Model for Statistical Machine Translation Authors: Min ZHANG, Hongfei JIANG, Ai Ti AW, Jun SUN, Sheng LI, Chew Lim TAN.
Grammar induction by Bayesian model averaging Guy Lebanon LARG meeting May 2001 Based on Andreas Stolcke’s thesis UC Berkeley 1994.
Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
A shorted version from: Anastasia Berdnikova & Denis Miretskiy.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
Daniel Gildea (2003): Loosely Tree-Based Alignment for Machine Translation Linguistics 580 (Machine Translation) Scott Drellishak, 2/21/2006.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Models of Generative Grammar Smriti Singh. Generative Grammar  A Generative Grammar is a set of formal rules that can generate an infinite set of sentences.
1 Basic Parsing with Context Free Grammars Chapter 13 September/October 2012 Lecture 6.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Microsoft Research Faculty Summit Robert Moore Principal Researcher Microsoft Research.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Syntax for MT EECS 767 Feb. 1, Outline Motivation Syntax-based translation model  Formalization  Training Using syntax in MT  Using multiple.
Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi and introduced in Tree-adjoining grammars are somewhat similar to context-free.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
GRAMMARS David Kauchak CS159 – Fall 2014 some slides adapted from Ray Mooney.
Some Probability Theory and Computational models A short overview.
Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Parsing with Context Free Grammars.
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
October 2005csa3180: Parsing Algorithms 11 CSA350: NLP Algorithms Sentence Parsing I The Parsing Problem Parsing as Search Top Down/Bottom Up Parsing Strategies.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Parsing Introduction Syntactic Analysis I. Parsing Introduction 2 The Role of the Parser The Syntactic Analyzer, or Parser, is the heart of the front.
What’s in a translation rule? Paper by Galley, Hopkins, Knight & Marcu Presentation By: Behrang Mohit.
INSTITUTE OF COMPUTING TECHNOLOGY Forest-to-String Statistical Translation Rules Yang Liu, Qun Liu, and Shouxun Lin Institute of Computing Technology Chinese.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
1 Context Free Grammars October Syntactic Grammaticality Doesn’t depend on Having heard the sentence before The sentence being true –I saw a unicorn.
A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation Jun Sun ┼, Min Zhang ╪, Chew Lim Tan ┼ ┼╪
2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
GRAMMARS David Kauchak CS457 – Spring 2011 some slides adapted from Ray Mooney.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Stochastic Methods for NLP Probabilistic Context-Free Parsers Probabilistic Lexicalized Context-Free Parsers Hidden Markov Models – Viterbi Algorithm Statistical.
A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.
Spring 2010 Lecture 4 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn and Hwee Tou Ng LING 575: Seminar on statistical machine translation.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Approaches to Machine Translation
Statistical NLP Winter 2009
Basic Parsing with Context Free Grammars Chapter 13
Statistical NLP: Lecture 13
Statistical NLP Spring 2011
CSCI 5832 Natural Language Processing
Training Tree Transducers
CSCI 5832 Natural Language Processing
Approaches to Machine Translation
David Kauchak CS159 – Spring 2019
A Path-based Transfer Model for Machine Translation
Dekai Wu Presented by David Goss-Grubbs
Presentation transcript:

Transfer-based MT

Syntactic Transfer-based Machine Translation Direct and Example-based approaches Two ends of a spectrum Recombination of fragments for better coverage. What if the matching/transfer is done at syntactic parse level Three Steps Parse: Syntactic parse of the source language sentence Hierarchical representation of a sentence Transfer: Rules to transform source parse tree into target parse tree Subject-Verb-Object  Subject-Object-Verb Generation: Regenerating target language sentence from parse tree Morphology of the target language Tree-structure provides better matching and longer distance transformations than is possible in string-based EBMT.

Examples of SynTran-MT quiero wanna I ajá usar yeah use mi tarjeta my card de credit crédito Mostly parallel parse structures Might have to insert word – pronouns, morphological particles

Example of SynTran MT -2 Pros: Cons: Allows for structure transfer 必要があります (need) 私は (I) かける (make) コールを (call) コレクト (collect) need I make to call a collect Pros: Allows for structure transfer Re-orderings are typically restricted to the parent-child nodes. Cons: Transfer rules are for each language pair (N2 sets of rules) Hard to reuse rules when one of the languages is changed

Lexical-semantic Divergences Linguistic Divergences Structural differences between languages Categorical Divergence Translation of words in one language into words that have different parts of speech in another language To be jealous Tener celos (To have jealousy)

Issues Linguistic Divergences Conflational Divergence Translation of two or more words in one language into one word in another language To kick Dar una patada (Give a kick)

Issues Linguistic Divergences Structural Divergence Realization of verb arguments in different syntactic configurations in different languages To enter the house Entrar en la casa (Enter in the house)

Issues Linguistic Divergences Head-Swapping Divergence Inversion of a structural-dominance relation between two semantically equivalent words To run in Entrar corriendo (Enter running)

Issues Linguistic Divergences Thematic Divergence Realization of verb arguments that reflect different thematic to syntactic mapping orders I like grapes Me gustan uvas (To-me please grapes)

Divergence counts from Bonnie Dorr 32% of sentences in UN Spanish/English Corpus (5K) Categorial X tener hambre Y have hunger 98% Conflational X dar puñaladas a Z X stab Z 83% Structural X entrar en Y X enter Y 35% Head Swapping X cruzar Y nadando X swim across Y 8% Thematic X gustar a Y Y likes X 6% 10

Transfer rules 11

Syntax-driven statistical machine translation Slides from Devi Xiong, CAS, Beijing

Why syntax-based SMT Weakness of phrase-based SMT Long-distance reordering: phrase-level reordering Discontinuous phrases Generalization … Other methods using syntactic knowledge Word alignment integrating syntactic constraints Pre-order source sentences Rerank n-best output of translation models

SSMT based on formal structures Compared with phrase-based SMT Translated hierarchically The target structures finally generated are not necessarily real linguistic structures, but Make long-distance reordering more feasible Introduce non-terminals/variables Discontinuous phrases: put x on, 在 x 时 Generalization

SCFG Formulated: Two CFGs and there correspondences Or P:

SCFG: an example

SCFG: derivation

ITG synchronous CFGs in which the links between nonterminals in a production are restricted to two possible configurations: Inverted Straight Any ITG can be converted into a synchronous CFG of rank two.

BTG

ITG as reordering constraint Two kinds of reordering Inverted straight Coverage Wu(1997): “been unable to find real examples” of cases where alignments would fail under this constraint, at least in “lightly inflected languages, such as English and Chinese.” Wellington(2006): “we found examples”, “at least 5% of the Chinese/English sentence pairs”. Weakness No strong mechanism determining which order is better, inverted or straight.

Chiang’05: Hierarchical Phrase-based Model (HPM) Rules: Glue rule: Model: log-linear Decoder: CKY

Chiang’05: rule extraction

Chiang’05: rule extraction restrictions Initial base rule at most 15 on French side Final rule at most 5 on French side At most two non-terminals on each side, nonadjacent At least one aligned terminal pair

Chiang’05: Model Log-linear form and

Chiang’05: decoder

SSMT based on phrase structures Using grammars with linguistic knowledge The grammars are based on SCFG Two categories: Tree-string Tree-to-string String-to-tree Tree-tree

Yamada & Knight 2001, 2003

Yamada’s work vs. SCFG Insertion operation: A  (wA1, A1) Reordering operation A (A1A2A3, A1A3A2) Translating operation A (x, y)

Yamada: weakness Single-level mapping Multi-level reordering Yamada: flatten Word-based Yamada: phrasal leaf

Galley et al. 2004, 2006 translation model incorporates syntactic structure on the target language side trained by learning “translation rules” from bilingual data the decoder uses a parser-like method to create syntactic trees as output hypotheses

Translation rules Translation rules Target: multi-level subtrees Source: continuous or discontinuous phrases Types of translation rules Translating source phrases into target chunks NPB(PRP/I) ↔我 NP-C(NPB(DT/this NN/address)) ↔这个 地址

Types of translation rules Have variables NP-C(NPB(PRP$/my x0:NN)) ↔我 的 x0 PP(TO/to NP-C(NPB(x0:NNS NNP/park))) ↔ 去 x0 公园 Combine previously translated results together VP(x0:VBZ x1:NP-C) ↔ x1 x0 takes a noun phrase followed by a verb, switches their order, then combines them into a new verb phrase

Rules extraction Word-align a parallel corpus Parse the target side Extract translation rules Minimal rules: can not be decomposed Composed rules: composed by minimal rules Estimate probalities

Rule extraction Minimal rule

Composed rules

Format is Expressive [Knight & Graehl, 2005] Phrasal Translation Non-constituent Phrases Non-contiguous Phrases S VP está, cantando hay, x0 VP poner, x0 PRO VP VBZ VBG VB x0:NP PRT there VB x0:NP is singing put on are Context-Sensitive Word Insertion Multilevel Re-Ordering Lexicalized Re-Ordering NP S NPB x0 x1, x0, x2 x0:NP PP x1, , x0 x0:NP VP DT x0:NNS P x1:NP x1:VB x2:NP2 the of [Knight & Graehl, 2005]

decoder probabilistic CYK-style parsing algorithm with beams results in an English syntax tree corresponding to the Chinese sentence guarantees the output to have some kind of globally coherent syntactic structure

Decoding example

Decoding example

Decoding example

Decoding example

Decoding example

Marcu et al. 2006 SPMT Integrating non-syntactifiable phrases Multiple features for each rule Decoding with multiple models

SSMT based on phrase structures Two categories: Tree-string String-to-tree Tree-to-string Tree-tree

Tree-to-string Liu et al. 2006 Tree-to-string alignment template model

TAT NP NR NN 布什 总统 President Bush LCP LC CC 间 美国 和 between United States and DNP DEG

TAT: extraction Constraints Source trees have to be Subtree Have to be consistent with word alignment Restrictions on extraction both the first and last symbols in the target string must be aligned to some source symbols The height of T(z) is limited to no greater than h The number of direct descendants of a node of T(z) is limited to no greater than c

TAT: Model

Decoding

Tree-to-string vs. string-to-tree Integrating source structures into translation and reordering The output can not be grammatical string-to-tree guarantees the output to have some kind of globally coherent syntactic structure Can not use any knowledge from source structures

SSMT based on phrase structures Two categories: Tree-string String-to-tree Tree-to-string Tree-tree

Tree-Tree Synchronous tree-adjoining grammar (STAG) Synchronous tree substitution grammar (STSG)

STAG

STAG: derivation

STSG donnent (“give”) kiss à (“to”) baiser Sam (“kiss”) often null un (“a”) à (“to”) Start NP null Adv Start Sam NP often null Adv enfants (“kids”) kids NP d’ (“of”) beaucoup (“lots”) NP quite null Adv

STSG: elementary trees kiss donnent (“give”) baiser (“kiss”) un (“a”) à (“to”) Start NP null Adv d’ (“of”) beaucoup (“lots”) NP enfants (“kids”) kids NP quite null Adv Sam NP

Dependency structures IP VP NP 成为 NP NP NP ADJP NP NN NN NN VV NR NN JJ NN 企业 增长点 外商 投资 企业 成为 中国 外贸 重要 增长点 外商 投资 中国 外贸 重要 (a) (b)

For MT: dependency structures vs. phrase structures Advantages of dependency structures over phrase structures for machine translation Inherent lexicalization Meaning-relative Better representation of divergences across languages

SSMT based on dependency structures Lin 2004 A Path-based Transfer Model for Machine Translation Quirk et al. 2005 Dependency Treelet Translation: Syntactically Informed Phrasal SMT Ding et al. 2005 Machine Translation Using Probabilistic Synchronous Dependency Insertion Grammars

Lin 2004 Translation model trained by learning transfer rules from bilingual corpus where the source language sentences are parsed. decoding: finding the minimum path covering of the source language dependency tree

Lin 2004: path

Lin 2004: transfer rule

Quirk et al. 2005 Translation model trained by learning treelet pairs from bilingual corpus where the source language sentences are parsed. Decoding: CKY-style

Treelet pairs

Quirk 2005: decoding

Ding 2005

summary Tree (formal, phrase, or dependency structure ) String (phrase or chunk) Semantic Interlingua Word Source Languge Target Language

Introduction State of the art machine translation systems based on statistical models rooted in the theory of formal grammars/automata Translation models based on finite state devices cannot easily model translations between languages with strong differences in word ordering Recently, several models based on context-free grammars have been investigated, borrowing from the theory of compilers the idea of synchronous rewriting You might want to comment a bit on the finite state vs context-free approach, also in view of your own work at acl 2005. Remark that synchronous rewriting originated in the context of compiler theory. Slides from G. Satta

Introduction Translation models based on synchronous rewriting: Inversion Transduction Grammars (Wu, 1997) Head Transducer Grammars (Alshawi et al., 2000) Tree-to-string models (Yamada & Knight, 2001; Galley et al, 2004) “Loosely tree-based” model (Gildea, 2003) Multi-Text Grammars (Melamed, 2003) Hierarchical phrase-based model (Chiang, 2005) We use synchronous CFGs to study formal properties of all these For Yamada&Knight, Gildea and Chiang, I made up the name of the formalism, the name has not been originally proposed by those authors. Say that in the presentation you will discuss how the results in the paper transfer to the above formalisms.

Synchronous CFG A synchronous context-free grammar (SCFG) is based on three components: Context free grammar (CFG) for source language CFG for target language Pairing relation on the productions of the two grammars and on the nonterminals in their right-hand sides “Pairing relation” above is informal terminology. More specifically, what we have is a bijection on the nonterminals in the right-hand sides of paired productions, but you don’t have to explain this here otherwise you will loose people. Just say that this becomes clearer in the example in the next slides.

Synchronous CFG Example (Yamada & Knight, 2001) : VB --> PRP(1) VB1(2) VB2(3) VB2 --> VB(1) TO(2) TO --> TO(1) NN(2) PRP --> he VB1 --> adores VB --> listening TO --> to NN --> music VB --> PRP(1) VB2(3) VB1(2) VB2 --> TO(2) VB(1) ga TO --> NN(2) TO(1) PRP --> kare ha VB1 --> daisuki desu VB --> kiku no TO --> wo NN --> ongaku First, comment on the English and then on the Japanese grammars, separately one from the other as suggested by the animation I have set. You might also want to point out that the distinction between lexical and nonlexical prods is a standard practice but not part of the definition. Second, comment on the pairing between the prods of the two CFGs. Paired productions are displaied on the same line. Third, focus on the pairing between nonterms of the first production pair, as suggested by the animation. Most important for the rest of the talk, explain that this should be understood as a permutation. Comment on the use of superscripts for the representation of such permutation.

VB(1) PRP(1) VB2(3) VB1(2) he kare ha adores daisuki desu VB(1) TO(2) Synchronous CFG Example (cont’d): VB(1) PRP(1) VB2(3) VB1(2) he kare ha adores daisuki desu VB(1) TO(2) ga listening kiku no NN(2) TO(1) The pairing between nonterminals is crucial to the definition of the derive relation, as exemplified in this slide. The formal definition of rewriting is skipped in the presentation. When you go through this example, focus on the use of the superscripts: note that we always start with two co-indexed nonterminals and explain that superscripts should be ignored after nonterminal has been rewritten. Explicitly say that nonterminals with different superscripts cannot be targeted by a synchronous production. Dropping this requirement will result in rewriting systems with much more generative power (as for instance matrix grammars). You might want to make a short summary of the formalism at this point: SCFGs are based on two CFGs and some added representation that synchronizes the nonterminals. You might also want to comment that requiring a bijection between phrases on the two derivation trees is too strong, and that some formalisms drop this. to wo music ongaku

Synchronous CFG A pair of CFG productions in a SCFG is called a synchronous production A SCFG generates pairs of trees/strings, where each component is a translation of the other A SCFG can be extended with probabilities: Each pair of productions is assigned a probability Probability of a pair of trees is the product of probabilities of synchronous productions involved Probabilistic extension works like in the case of PCFG. Probability of a pair of trees is the likelihood that one component translates into the other.

Membership The membership problem (Wu, 1997) for SCFGs is defined as follows: Input: SCFG and pair of strings [w1, w2 ] Output: Yes/No depending on whether w1 translates into w2 under the SCFG Applications in segmentation, word alignment and bracketing of parallel corpora Assumption that SCFG is part of the input is made here to investigate the dependency of problem complexity on grammar size Make the comment that the membership problem is similar to the recognition problem for context-free languages, but in case of SCFGs we have to work on two dimensions. The membership problem was first considered by (Wu, 1997). It has applications in tasks of annotation of parallel corpora, as for instance segmentation, bracketing and word alignment. You might want to point out as early as here that considering the SCFG as part of the input is unrealistic in NLP applications. I have also added remarks on this point after the presentation of the result.

Membership Result: Membership problem for SCFGs is NP-complete Proof uses SCFG derivations to explore space of consistent truth assignments that satisfy source 3SAT instance Remarks: Result transfers to (Yamada & Knight, 2001), (Gildea, 2003), (Melamed, 2003), which are at least as powerful as SCFG Proof is a reduction from 3SAT and is not presented here. The reason why the problem is intractable is that permutations can be very “nasty”; this will become apparent in the discussion of the next result, which gives an exp lower bound. First remark: All the cited formalisms are at least as powerful as SCFGs. Thus the hardness result transfer to all these formalisms. Specifically, (Yamada & Knight, 2001) uses essentially SCFGs. (Melamed, 2003) uses SCFGs with multiple dimensions and with independent rewriting. The latter means that some arguments can be dropped on one side of the translation, but the weak generative power is the same as SCFGs in case of two dimensions. (Gildea, 2003) uses a formalism which is strictly more powerful than SCFGs because of movement, simulated through copying and deletion. Second remark: Inversion transduction grammars and head transducer grammars are a restricted version of SCFGs and can be parsed in polynomial time even for variable input grammars.

Membership Remarks (cont’d): Problem can be solved in polynomial time if: input grammar is fixed or production length is bounded (Melamed, 2004) Inversion Transduction Grammars (Wu, 1997) Head Transducer Grammars (Alshawi et al., 2000) For NLP applications, it is more realistic to assume a fixed grammar and varying input string First remark: a general algorithm for the membership problem has been presented in (Melamed, 2004) for his multi-text grammars. This can be easily adapted to SCFGs and works in polynomial time in case of fixed grammar and also in case of SCFGs with constant bound on production length.

Chart parsing Providing an exponential time lower bound for the membership problem would amount to showing P ≠ NP But we can show such a lower bound if we make some assumptions on the class of algorithms and data structures that we use to solve the problem Result: If chart parsing techniques are used to solve the membership problem for SCFG, a number of partial analyses is obtained that grows exponentially with the production length of the input grammar Chart parsing is a well-known framework in natural language parsing. It is based on dynamic programming and it works in polynomial time when context-free grammars are used. We show that, if chart parsing algorithms are adopted, an exponential time lower bound can be proved for the membership problem for SCFGs. The next two slides provide intuitive discussion about the basic idea underlying this result.

A --> B1 B2 B3 … Bn … B4 B1 B2 B3 Chart parsing Chart parsing for CFGs works by combining completed constituents with partial analyses: A --> B1 B2 B3 … Bn … B4 B1 B2 B3 The basic operation in chart parsing is the extension of partial analyses, pertaining to a single rule, by combination with already completed constituents. How this work is shown by the animation in this slide. This combination is the basic operation in chart parsing. Explain the notion of index, which indicates a position within the input string. Point out that each combination (as described above) involves three indices. Note also that such number is independent from the length of the rule. You will contrast this later with the case of chart parsing for SCFGs, where the number of indices depends on the length of the productions. Conclude that in the worst case there can be n^3 possible combinations that must be checked in order to complete the parsing task for each rule. Thus chart algorithms run in polynomial time for CFGs. Three indices are used to process each combination, for a total number of O(n3) possible combinations that must be checked, n the length of the input string

B (1) B (2) B (3) B (4) B (3) B (1) B (4) B (2) Chart parsing Consider the synchronous production : [ A --> B (1) B (2) B (3) B (4) , A --> B (3) B (1) B (4) B (2) ] representing the permutation : B (1) B (2) B (3) B (4) B (3) B (1) B (4) B (2) We focus on “nasty” permutations. These permutations implement uniform shuffling, which is the worst thing for synchronous parsing, as we will see in the next slide.

B (4) B (1) B (2) B (3) Chart parsing When applying chart parsing, there is no way to keep partial analyses “contiguous”: B (4) B (1) B (2) B (3) This animation shows that, when combining any two contiguous constituent on the left production component, one always obtains non-contiguous constituents at the right production. A similar effects occurs when one combines two contiguous constituents on the right production component. Thus the combination of two constituents involves seven indices. This is greater than three times two = six that we would expect if partial analyses on both sides could always be extended contiguously.

Chart parsing The proof of our result generalizes the previous observations We show that, for some worst case permutations of length q, any combination strategy we choose leads to a number of indices growing with order at least sqrt(q) Then for SCFGs of size q, sqrt(q) is an asymptotic lower bound for the membership problem when chart parsing algorithms are used The proof in the paper uses worst case permutations that realize uniform shuffling operations. At the growing of the length of these productions, the number of indices involved grows at least with a square-root dependency. Thus there is no fixed-degree polynomial that could bind the running time, when chart parsing is applied. This concludes the discussion of the second result in the paper.

Translation A probabilistic SCFG provides the probability that tree t1 translates into tree t2: Pr( [t1 , t2] ) Accordingly, we can define the probability that string w1 translates into string w2: Pr( [w1 , w2] ) = t1w1,t2w2 Pr( [t1 , t2] ) and the probability that string w translates into tree t: Pr( [w , t ] ) = t1w Pr( [t1 , t ] )

Translation The string-to-tree translation problem for probabilistic SCFGs is defined as follows: Input: Probabilistic SCFG and string w Output: tree t such that Pr([w, t ]) is maximized Application in machine translation Again, assumption that SCFG is part of the input is made to investigate the dependency of problem complexity on grammar size

Translation Result: string-to-tree translation problem for probabilistic SCFGs (summing over possible source trees) is NP-hard Proof reduces from consensus problem: Strings generated by probabilistic finite automaton or hidden Markov model have probabilities defined as sum of probabilities of several paths Maximizing such summation is NP-hard (Casacuberta & Higuera, 2000) (Lyngso & Pedersen, 2002) The story of the consensus problem is as follows. Kalil Simaan first studied the problem for DOP grammars and for probabilistic CFG, showing NP-hardness. Casacuberta proved NP-hardness in case of probabilistic regular CFG and Lyngso for HMM. Finite automata and HMM should be nondeterministic. If not, the consensus problem can be solved in polynomial time using Viterbi-like techniques.

Translation Remarks: Source of complexity of the problem comes from the fact that several source trees can be translated into the same target tree Result persists if there is a constant bound on length of synchronous productions Open: can the problem be solved in polynomial time if probabilistic SCFG is fixed? As mentioned above, the source of the complexity comes from cases in which several source trees can be translated into same target tree. This means that the source language has a “finer grained” ambiguity than the target language, people also say that the target language does not preserve the ambiguity. Second remark: the complexity of the problem has nothing to do with the intricacy of the permutation implemented by the productions. Third remark: this is related to the problem I mentioned in one of the last emails. Can you extract the max Pr string of length n from a fixed PCFG in polytime in n? Even for PFA this is open, I think.

Learning Non-Isomorphic Tree Mappings for Machine Translation b B report misinform 2 words become 1 reorder dependents wrongly events to-John him of 0 words become 1 events the “wrongly report events to-John” “him misinform of the events” Slides from J. Eisner

Syntax-Based Machine Translation Previous work assumes essentially isomorphic trees Wu 1995, Alshawi et al. 2000, Yamada & Knight 2000 But trees are not isomorphic! Discrepancies between the languages Free translation in the training data the a b A B events of misinform wrongly report to-John him

Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. enfants (“kids”) d’ (“of”) beaucoup (“lots”) Sam donnent (“give”) baiser (“kiss”) un (“a”) à (“to”) kids Sam kiss quite often “beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often”

Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. donnent (“give”) Start kiss à (“to”) baiser (“kiss”) Sam NP NP Sam null Adv null Adv often un (“a”) kids beaucoup (“lots”) NP NP null Adv null Adv quite NP d’ (“of”) NP enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often”

Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. A much worse alignment ... donnent (“give”) Start kiss à (“to”) baiser (“kiss”) Sam NP Sam often un (“a”) kids NP beaucoup (“lots”) quite d’ (“of”) NP Adv enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often”

Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. donnent (“give”) Start kiss à (“to”) baiser (“kiss”) Sam NP NP Sam null Adv null Adv often un (“a”) kids beaucoup (“lots”) NP NP null Adv null Adv quite NP d’ (“of”) NP enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often”

Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. Alignment shows how trees are generated synchronously from “little trees” ... kiss donnent (“give”) baiser (“kiss”) un (“a”) à (“to”) Start NP null Adv Start Sam NP often null Adv enfants (“kids”) kids NP d’ (“of”) beaucoup (“lots”) NP quite null Adv “beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often”

Grammar = Set of Elementary Trees kiss donnent (“give”) baiser (“kiss”) un (“a”) à (“to”) Start NP null Adv idiomatic translation enfants (“kids”) kids NP Sam NP

Grammar = Set of Elementary Trees kiss donnent (“give”) baiser (“kiss”) un (“a”) à (“to”) Start NP null Adv idiomatic translation enfants (“kids”) kids NP Sam enfants (“kids”) kids NP Sam NP

Grammar = Set of Elementary Trees kiss donnent (“give”) baiser (“kiss”) un (“a”) à (“to”) Start NP null Adv d’ (“of”) beaucoup (“lots”) NP “beaucoup d’” deletes inside the tree enfants (“kids”) kids NP Sam NP

Grammar = Set of Elementary Trees kiss donnent (“give”) baiser (“kiss”) un (“a”) à (“to”) Start NP null Adv d’ (“of”) beaucoup (“lots”) NP “beaucoup d’” deletes inside the tree enfants (“kids”) kids NP Sam NP

Grammar = Set of Elementary Trees kiss donnent (“give”) baiser (“kiss”) un (“a”) à (“to”) Start NP null Adv enfants (“kids”) kids NP d’ (“of”) beaucoup (“lots”) NP “beaucoup d’” matches nothing in English enfants (“kids”) kids NP Sam NP

Grammar = Set of Elementary Trees kiss donnent (“give”) baiser (“kiss”) un (“a”) à (“to”) Start NP null Adv adverbial subtree matches nothing in French d’ (“of”) beaucoup (“lots”) NP often null Adv enfants (“kids”) kids NP quite null Adv Sam NP

Probability model similar to PCFG Probability of generating training trees T1, T2 with alignment A P(T1, T2, A) =  p(t1,t2,a | n) probabilities of the “little” trees that are used p( is given by a maximum entropy model wrongly misinform NP report VP | )

Form of model of big tree pairs Joint model P(T1,T2). Wise to use noisy-channel form: P(T1 | T2) * P(T2) But any joint model will do. could be trained on zillions of target-language trees train on paired trees (hard to get) In synchronous TSG, aligned big tree pair is generated by choosing a sequence of little tree pairs: P(T1, T2, A) =  p(t1,t2,a | n)

Maxent model of little tree pairs wrongly misinform NP report VP | ) p( FEATURES report+wrongly  misinform? (use dictionary) report  misinform? (at root) wrongly  misinform? verb incorporates adverb child? verb incorporates child 1 of 3? children 2, 3 switch positions? common tree sizes & shapes? ... etc. ....

Inside Probabilities p( | ) ( ) = ... * ( ) * ( ) + ... a A b report B VP misinform wrongly events to-John him of events the ( ) = ... * ( ) * ( ) + ... p( | ) VP misinform report VP

only O(n2) Inside Probabilities p( | ) ( ) = ... * ( ) * ( ) + ... report VP misinform wrongly events to-John him of NP events NP the ( ) = ... * ( ) * ( ) + ... p( | ) VP misinform report VP NP misinform wrongly report VP events of NP to-John him NP

P(T1, T2, A) =  p(t1,t2,a | n) Alignment: find A to max P(T1,T2,A) Decoding: find T2, A to max P(T1,T2,A) Training: find  to max A P(T1,T2,A) Do everything on little trees instead! Only need to train & decode a model of p(t1,t2,a) But not sure how to break up big tree correctly So try all possible little trees & all ways of combining them, by dynamic prog.

Alignment Pseudocode for each node c1 of T1 (bottom-up) for each possible little tree t1 rooted at c1 for each node c2 of T2 (bottom-up) for each possible little tree t2 rooted at c2 for each matching a between frontier nodes of t1 and t2 p = p(t1,t2,a) for each pair (d1,d2) of frontier nodes matched by a p = p * (d1,d2) // inside probability of kids (c1,c2) = (c1,c2) + p // our inside probability Nonterminal states are used in practice but not shown here For EM training, also find outside probabilities

dynamic programming engine An MT Architecture Viterbi alignment yields output T2 dynamic programming engine Trainer Decoder scores all alignments of two big trees T1,T2 scores all alignments between a big tree T1 & a forest of big trees T2 inside-outside estimated counts each possible (t1,t2,a) for each possible t1, various (t1,t2,a) each proposed (t1,t2,a) Probability Model p(t1,t2,a) of Little Trees update parameters  score little tree pair propose translations t2 of little tree t1

Related Work Statistical work has allowed only 1:1 (isomorphic trees) Synchronous grammars (Shieber & Schabes 1990) Statistical work has allowed only 1:1 (isomorphic trees) Stochastic inversion transduction grammars (Wu 1995) Head transducer grammars (Alshawi et al. 2000) Statistical tree translation Noisy channel model (Yamada & Knight 2000) Infers tree: trains on (string, tree) pair, not (tree, tree) pair But again, allows only 1:1, plus 1:0 at leaves Data-oriented translation (Poutsma 2000) Synchronous DOP model trained on already aligned trees Statistical tree generation Similar to our decoding: construct forest of appropriate trees, pick by highest prob Dynamic prog. search in packed forest (Langkilde 2000) Stack decoder (Ratnaparkhi 2000)

What Is New Here? Learning full elementary tree pairs, not rule pairs or subcat pairs Previous statistical formalisms have basically assumed isomorphic trees Maximum-entropy modeling of elementary tree pairs New, flexible formalization of synchronous Tree Subst. Grammar Allows either dependency trees or phrase-structure trees “Empty” trees permit insertion and deletion during translation Concrete enough for implementation (cf. informal previous descriptions) TSG is more powerful than CFG for modeling trees, but faster than TAG Observation that dynamic programming is surprisingly fast Find all possible decompositions into aligned elementary tree pairs O(n2) if both input trees are fully known and elem. tree size is bounded