Quasi-Synchronous Grammars Alignment by Soft Projection of Syntactic Dependencies David A. Smith and Jason Eisner Center for Language and Speech Processing Department of Computer Science Johns Hopkins University
Synchronous Grammars Synchronous grammars elegantly model P(T 1, T 2, A) Conditionalizing for Alignment Translation Training? Observe parallel trees? Impute trees/links? Project known trees… ImAnfangwardasWort Inthebeginningwastheword
Projection Train with bitext Parse one side Align words Project dependencies Many to one links? Non-projective and circular dependencies? Proposals in Hwa et al., Quirk et al., etc. ImAnfangwardasWort Inthebeginningwastheword
Divergent Projection AufFragediesebekommenichhabeleiderAntwortkeine Ididnotunfortunatelyreceiveananswertothisquestion NULL monotonic null head-swapping siblings
Free Translation TschernobylkönntedannetwasspäterandieReihekommen ThenwecoulddealwithChernobylsometimelater Bad dependencies Parent-ancestors? NULL
Dependency Menagerie
Overview Divergent & Sloppy Projection Modeling Motivation Quasi-Synchronous Grammars (QG) Basic Parameterization Modeling Experiments Alignment Experiments
QG by Analogy HMM: noisy channel generating states MEMM: direct generative model of states CRF: undirected, globally normalized Target Source Target
I really mean “conference paper”. Words with Senses Ipresentedthehavepaperabout IchhabedieVeröffentlichungüber…präsentiertdas Papier mit with Now senses in a particular (German) sentence Veröffentlichung
Quasi-Synchronous Grammar QG: A target-language grammar that generates translations of a particular source- language sentence. A direct, conditional model of translation as P(T 2, A | T 1 ) This grammar can be CFG, TSG, TAG, etc.
Generating QCFG from T 1 U = Target language grammar nonterminals V = Nodes of given source tree T 1 Binarized QCFG: A, B, C ∈ U; α, β, γ ∈ 2 V ⇒ ⇒ w Present modeling restrictions |α| ≤ 1 Dependency grammars (1 node per word) Tie parameters that depend on α, β, γ “Model 1” property: reuse of senses. Why? “senses”
Modeling Assumptions ImAnfangwardasWort thebeginningwastheword At most 1 sense per English word Dependency Grammar: one node/word Allow sense “reuse” Tie params for all tokens of “im” In
Dependency Relations + “none of the above”
QCFG Generative Story observed AufFragediesebekommenichleiderAntwortkeine Ididnotunfortunatelyreceiveananswertothisquestion NULL habe P(parent-child) P(PRP | no left children of did) P(I | ich) O(m 2 n 3 ) P(breakage)
Training the QCFG Rough surrogates for translation performance How can we best model target given source? How can we best match human alignments? German-English Europarl from SMT05 1k, 10k, 100k sentence pairs German parsed w/Stanford parser EM training of monolingual/bilingual parameters For efficiency, select alignments in training (not test) from IBM Model 4 union
Cross-Entropy Results
AER Results
AER Comparison IBM4 German-English QG German-English IBM4 English-German
Conclusions Strict isomorphism hurts for Modeling translations Aligning bitext Breakages beyond local nodes help most “None of the above” beats simple head-swapping and 2-to-1 alignments Insignificant gains from further breakage taxonomy
Continuing Research Senses of more than one word should help Maintaining O(m 2 n 3 ) Further refining monolingual features on monolingual data Comparison to other synchronizers Decoder in progress uses same direct model of P(T 2,A | T 1 ) Globally normalized and discriminatively trained
Thanks David Yarowsky Sanjeev Khudanpur Noah Smith Markus Dreyer David Chiang Our reviewers The National Science Foundation
Synchronous Grammar as QG Target nodes correspond to 1 or 0 source nodes ∀ ⇒ … ( ∀ i ≠ j) α i ≠ α j unless α i = NULL ( ∀ i > 0) α i is a child of α 0 in T 1, unless α i = NULL STSG, STAG operate on derivation trees Cf. Gildea’s clone operation as a quasi- synchronous move
Say What You’ve Said
Projection Synchronous grammars can explain s-t relation May need fancy formalisms, harder to learn Align as many fragments as possible: explain fragmentariness when target language requirements override Some regular phenomena: head-swapping, c-command (STAG), traces Monolingual parser Word alignment Project to other language Empirical model vs. decoding P(T2,A|T1) via synchronous dep. Grammar How do you train? Just look at your synchronous corpus … oops. Just look at your parallel corpus and infer the synchronous trees … oops. Just look at your parallel corpus aligned by Giza and project dependencies over to infer synchronous tree fragments. But how do you project over many-to-one? How do you resolve nonprojective links in the projected version? And can’t we use syntax to align better than Giza did, anyway? Deal with incompleteness in the alignments, unknown words (?)
Talking Points Get advantages of a synchronous grammar without being so darn rigid/expensive: conditional distribution, alignment, decoding all taking syntax into account What is the generative process? How are the probabilities determined from parameters in a way that combines monolingual and cross-lingual preferences? How are these parameters trained? Did it work? What are the most closely related ideas and why is this one better?
Cross-Entropy Results ConfigurationCE at 1kCE at 10kCE at 100k NULL parent-child child-parent same node all breakages siblings grandparent c-command
AER Results ConfigurationAER at 1kAER at 10kAER at 100k parent-child child-parent same node all breakages siblings grandparent c-command