1 Shuffling Non-Constituents Jason Eisner ACL SSST Workshop, June 2008 with David A. Smith and Roy Tromble syntactically-flavored reordering search methods.

1 Shuffling Non-Constituents Jason Eisner ACL SSST Workshop, June 2008 with David A. Smith and Roy Tromble syntactically-flavored reordering search methods syntactically-flavored reordering model

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 2 Starting point: Synchronous alignment Synchronous grammars are very pretty. But does parallel text actually have parallel structure?  Depends on what kind of parallel text Free translations? Noisy translations? Were the parsers trained on parallel annotation schemes?  Depends on what kind of parallel structure What kinds of divergences can your synchronous grammar formalism capture? E.g., wh-movement versus wh in situ

Two training trees, showing a free translation from French to English. Synchronous Tree Substitution Grammar enfants (“kids”) d’ (“of”) beaucoup (“lots”) Sam donnent (“give”) baiser (“kiss”) un (“a”) à (“to”) kids Sam kiss quite often “beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often”

enfants (“kids”) kids NP d’ (“of”) beaucoup (“lots”) NP Sam NP Synchronous Tree Substitution Grammar kiss donnent (“give”) baiser (“kiss”) un (“a”) à (“to”) Start NP null Adv quite null Adv often null Adv null Adv “beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often” Two training trees, showing a free translation from French to English. A possible alignment is shown in orange.

enfants (“kids”) kids Adv d’ (“of”) beaucoup (“lots”) NP Sam NP Synchronous Tree Substitution Grammar kiss donnent (“give”) baiser (“kiss”) un (“a”) à (“to”) Start NP quite often “beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often” Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. A much worse alignment...

enfants (“kids”) kids NP d’ (“of”) beaucoup (“lots”) NP Sam NP Synchronous Tree Substitution Grammar kiss donnent (“give”) baiser (“kiss”) un (“a”) à (“to”) Start NP null Adv quite null Adv often null Adv null Adv “beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often” Two training trees, showing a free translation from French to English. A possible alignment is shown in orange.

Sam NP enfants (“kids”) kids NP quite null Adv Grammar = Set of Elementary Trees often null Adv null Adv d’ (“of”) beaucoup (“lots”) NP kiss donnent (“give”) baiser (“kiss”) un (“a”) à (“to”) Start NP null Adv

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 8 But many examples are harder AufFragediesebekommenichhabeleiderAntwortkeine Ididnotunfortunatelyreceiveananswertothisquestion NULL Toquestionthisreceived Ihavealasanswer no

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 9 But many examples are harder AufFragediesebekommenichhabeleiderAntwortkeine Ididnotunfortunatelyreceiveananswertothisquestion NULL Toquestionthisreceived Ihavealasanswer no Displaced modifier (negation)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 10 But many examples are harder AufFragediesebekommenichhabeleiderAntwortkeine Ididnotunfortunatelyreceiveananswertothisquestion NULL Toquestionthisreceived Ihavealasanswer no Displaced modifier (negation)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 11 But many examples are harder AufFragediesebekommenichhabeleiderAntwortkeine Ididnotunfortunatelyreceiveananswertothisquestion NULL Toquestionthisreceived Ihavealasanswer no Displaced argument (here, because projective parser)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 12 But many examples are harder AufFragediesebekommenichhabeleiderAntwortkeine Ididnotunfortunatelyreceiveananswertothisquestion NULL Toquestionthisreceived Ihavealasanswer no Head-swapping (here, different annotation conventions)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 13 Free Translation Tschernobyl Chernobyl könnte could dann then etwas something später later an on die the Reihe queue kommen come ThenwecoulddealwithChernobylsometimelater NULL

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 14 Free Translation Tschernobyl Chernobyl könnte could dann then etwas something später later an on die the Reihe queue kommen come ThenwecoulddealwithChernobylsometimelater NULL Probably not systematic (but words are correctly aligned)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 15 Free Translation Tschernobyl Chernobyl könnte could dann then etwas something später later an on die the Reihe queue kommen come ThenwecoulddealwithChernobylsometimelater NULL Erroneous parse

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 16 What to do? Current practice:  Don’t try to model all systematic phenomena!  Just use non-syntactic alignments (Giza++).  Only care about the fragments that recur often Phrases or gappy phrases Sometimes even syntactic constituents (can favor these, e.g., Marton & Resnik 2008)  Use these (gappy) phrases in a decoder Phrase based or hierarchical

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 17 What to do? Current practice:  Use non-syntactic alignments (Giza++)  Keep frequent phrases for a decoder But could syntax give us better alignments?  Would have to be “loose” syntax … Why do we want better alignments? 1. Throw away less of the parallel training data 2. Help learn a smarter, syntactic, reordering model  Could help decoding: less reliance on LM 3. Some applications care about full alignments

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 18 Quasi-synchronous grammar How do we handle “loose” syntax? Translation story:  Generate target English by a monolingual grammar Any grammar formalism is okay Pick a dependency grammar formalism for now Ididnotunfortunatelyreceiveananswertothisquestion P(PRP | no previous left children of “did”) P(I | did, PRP) parsing: O(n 3 )

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 19 Quasi-synchronous grammar How do we handle “loose” syntax? Translation story:  Generate target English by a monolingual grammar  But probabilities are influenced by source sentence Each English node is aligned to some source node Prefers to generate children aligned to nearby source nodes Ididnotunfortunatelyreceiveananswertothisquestion parsing: O(n 3 )

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 20 P(PRP | no previous left children of “did”, habe) QCFG Generative Story observed AufFragediesebekommenichleiderAntwortkeine Ididnotunfortunatelyreceiveananswertothisquestion NULL habe P(parent-child) aligned parsing: O(m 2 n 3 ) P(breakage) P(I | did, PRP, ich)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 21 What’s a “nearby node”? + “none of the above” Given parent’s alignment, where might child be aligned? synchronous grammar case

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 22 Useful analogies: 1. Generative grammar with latent word senses 2. MEMM  Generate n-gram tag sequence, but probabilities are influenced by word sequence Quasi-synchronous grammar Target Source How do we handle “loose” syntax? Translation story:  Generate target English by a monolingual grammar  But probabilities are influenced by source sentence

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 23 Useful analogies: 1. Generative grammar with latent word senses 2. MEMM 3. IBM Model 1  Source nodes can be freely reused or unused   Future work: Enforce 1-to-1 to allow good decoding (NP-hard to do exactly) Quasi-synchronous grammar How do we handle “loose” syntax? Translation story:  Generate target English by a monolingual grammar  But probabilities are influenced by source sentence

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 24 Some results: Quasi-synch. Dep. Grammar Alignment (D. Smith & Eisner 2006)  Quasi-synchronous much better than synchronous  Maybe also better than IBM Model 4 Question answering (Wang et al. 2007)  Align question w/ potential answer  Mean average precision 43%  48%  60% previous state of the art  + QG  + lexical features Bootstrapping a parser for a new language (D. Smith & Eisner 2007 & ongoing)  Learn how parsed parallel text influences target dependencies Along with many other features! (cf. co-training)  Unsupervised: German 30%  69%, Spanish 26%  65%

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 25 Summary of part I Current practice:  Use non-syntactic alignments (Giza++)  Some bits align nicely  Use the frequent bits in a decoder Suggestion: Let syntax influence alignments. So far, loose syntax methods are like IBM Model I.  NP-hard to enforce 1-to-1 in any interesting model. Rest of talk:  How to enforce 1-to-1 in interesting models?  Can we do something smarter than beam search?

26 Shuffling Non-Constituents Jason Eisner ACL SSST Workshop, June 2008 with David A. Smith and Roy Tromble syntactically-flavored reordering model syntactically-flavored reordering search methods

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 27 Motivation MT is really easy! Just use a finite-state transducer! Phrases, morphology, the works!

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 28 Permutation search in MT 1 4 23 56 initial order (French) 1 5 42 63 best order (French’) NNP Marie NEG ne PRP m’ AUX a NEG pas VBN vu NNP Marie NEG ne PRP m’ AUX a NEG pas VBN vu Mary hasn’t seenme easy transduction

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 29 Motivation MT is really easy! Just use a finite-state transducer! Phrases, morphology, the works! Have just to fix that pesky word order. Framing it this way lets us enforce 1-to-1 exactly at the permutation step. Deletion and fertility > 1 are still allowed in the subsequent transduction.

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 30 Often want to find an optimal permutation … Machine translation: Reorder French to French-prime (Brown et al. 1992) So it’s easier to align or translate MT eval: How much do you need to rearrange MT output so it scores well under an LM derived from ref translations? Discourse generation, e.g., multi-doc summarization: Order the output sentences (Lapata 2003) So they flow nicely Reconstruct temporal order of events after info extraction Learn rule ordering or constraint ranking for phonology? Multi-word anagrams that score well under a LM

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 31 Other applications (there are many …) LOP  Maximum-weight acyclic subgraph (equivalent)  Graph drawing, task scheduling, archaeology, aggregating ranked ballots, … TSP  Transportation scheduling (schoolbus, meals-on- wheels, service calls, …)  Motion scheduling (drill head, space telescopes, …)  Topology of a ring network  Genome assembly

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 32 How can we find this needle in the haystack of N! possible permutations? Permutation search: The problem 1 4 23 56 initial order 1 5 42 63 best order according to some cost function

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 33 Traditional approach: Beam search Approx. best path through a really big FSA N! paths: one for each permutation only 2 N states arc weight = cost of picking 5 next if we’ve seen {1,2,4} so far state remembers what we’ve generated so far (but not in what order)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 34 An alternative: Local search (“hill climbing”) The S WAP neighborhood 1 2 3 4 5 6 cost=22 2 1 3 4 5 6 cost=26 1 2 3 4 5 6 cost=22 1 2 3 5 4 6 cost=25 1 3 2 4 5 6 cost=20 1 2 4 3 5 6 cost=19

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 35 An alternative: Local search (“hill-climbing”) The S WAP neighborhood 1 2 3 4 5 6 cost=22 1 2 4 3 5 6 cost=19

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 36 An alternative: Local search (“hill-climbing”) Like “greedy decoder” of Germann et al. 2001 1 4 23 56 cost=22 The S WAP neighborhood cost=19cost=17cost=16... Why are the costs always going down? How long does it take to pick best swap? How many swaps might you need to reach answer? What if you get stuck in a local min? we pick best swap O(N) if you’re careful O(N 2 ) random restarts

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 37 Larger neighborhood 1 2 3 4 5 6 cost=22 2 1 3 4 5 6 cost=26 1 2 3 4 5 6 cost=22 1 2 3 5 4 6 cost=25 1 3 2 4 5 6 cost=20 1 2 4 3 5 6 cost=19

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 38 Larger neighborhood ( well-known in the literature; reportedly works well) 1 23 456 cost=22 cost=17 I NSERT neighborhood Fewer local minima? Graph diameter (max #moves needed)? How many neighbors? How long to find best neighbor? yes – 3 can move past 4  to get past 5 O(N) rather than O(N 2 ) O(N 2 ) rather than O(N)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 39 2 Even larger neighborhood 1 3 456 cost=22 cost=14 B LOCK neighborhood Fewer local minima? Graph diameter (max #moves needed)? How many neighbors? How long to find best neighbor? yes – 2 can get past 45 without having to cross 3  or move 3 first  still O(N) O(N 3 ) rather than O(N), O(N 2 )

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 40 2 Larger yet: Via dynamic programming?? 1 3 456 cost=22 Fewer local minima? Graph diameter (max #moves needed)? How many neighbors? How long to find best neighbor? logarithmic exponential polynomial

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 41 Unifying/generalizing neighborhoods so far 2 1 3 45678 Exchange two adjacent blocks, of max widths w ≤ w’ S WAP: w=1, w’=1 I NSERT : w=1, w’=N B LOCK : w=N, w’=N Move is defined by an (i,j,k) triple ijk runtime = # neighbors = O(ww’N) O(N) O(N 2 ) O(N 3 ) everything in this talk can be generalized to other values of w,w’

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 42 Very large-scale neighborhoods What if we consider multiple simultaneous exchanges that are “independent”? The D YNASEARCH neighborhood (Potts & van de Velde 1995; Congram 2000) 23456 1 2 1 4 3 6 5 3 2 5 4 1 5 24 36 Lowest-cost neighbor is lowest-cost path Cost of this arc is Δcost of swapping (4,5), here < 0 36 2 1 5 4

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 43 Very large-scale neighborhoods 23456 1 2 1 4 3 6 5 3 2 5 4 Lowest-cost neighbor is lowest-cost path Why would this be a good idea? Help get out of bad local minima? Help avoid getting into bad local minima? no; they’re still local minima yes – less greedy B = 234 1 2 1 4 3 3 2 D YNASEARCH (-20+-20) S WAP (-30) 0-20080 00-30-0 000-20 0000

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 44 no; they’re still local minima yes – less greedy yes! – shortest-path algorithm finds the best set of swaps in O(N) time, as fast as best single swap. Up to N moves as fast as 1 move:no penalty for “parallelism”! Globally optimizes over exponentially many neighbors (paths). Very large-scale neighborhoods 23456 1 2 1 4 3 6 5 3 2 5 4 Lowest-cost neighbor is lowest-cost path Why would this be a good idea? Help get out of bad local minima? Help avoid getting into bad local minima? More efficient?

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 45 Can we extend this idea – up to N moves in parallel by dynamic programming – to neighborhoods beyond S WAP ? 2 1 3 45678 Exchange two adjacent blocks, of max widths w ≤ w’ S WAP: w=1, w’=1 I NSERT : w=1, w’=N B LOCK : w=N, w’=N Move is defined by an (i,j,k) triple ijk runtime = # neighbors = O(ww’N) O(N) O(N 2 ) O(N 3 ) Yes. Asymptotic runtime is always unchanged.

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 46 Let’s define each neighbor by a “colored tree” Just like ITG! 1 4 23 56 = swap children

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 47 Let’s define each neighbor by a “colored tree” Just like ITG! 1 4 23 56 = swap children

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 48 Let’s define each neighbor by a “colored tree” Just like ITG! 1 4 23 56 = swap children This is like the B LOCK neighborhood, but with multiple block exchanges, which may be nested.

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 49 If that was the optimal neighbor … 1 456 23 … now look for its optimal neighbor new tree!

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 50 If that was the optimal neighbor … 56 1 4 2 3 … now look for its optimal neighbor new tree!

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 51 If that was the optimal neighbor … 56 1 4 23 … now look for its optimal neighbor … repeat till reach local optimum Each tree defines a neighbor. At each step, optimize over all possible trees by dynamic programming (CKY parsing). Use your favorite parsing speedups (pruning, best-first, …)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 52 Very-large-scale versions of S WAP, I NSERT, and B LOCK all by the algorithm we just saw … 2 1 3 45678 Exchange two adjacent blocks, of max widths w ≤ w’ Runtime of the algorithm we just saw was O(N 3 ) because we considered O(N 3 ) distinct (i,j,k) triples More generally, restrict to only the O(ww’N) triples of interest to define a smaller neighborhood with runtime of O(ww’N). (yes, the dynamic programming recurrences go through) Move is defined by an (i,j,k) triple ijk

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 53 How many steps to get from here to there? 84 6 2 53 7 1 45 1 2 36 7 8 One twisted-tree step? No: As you probably know, 3 1 4 2  1 2 3 4 is impossible. initial order best order

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 54 Can you get to the answer in one step? German-English, Giza++ alignment often (yay, big neighborhood) not always (yay, local search)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 55 84 6 2 53 7 1 How many steps to the answer in the worst case? (what is diameter of the search space?) 45 1 2 36 7 8 claim: only log 2 N steps at worst (if you know where to step) Let’s sketch the proof!

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 56 Quicksort anything into, e.g., 1 2 3 4 5 6 7 8 84 6 2 53 7 1  5  4 right-branching tree

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 57 Quicksort anything into, e.g., 1 2 3 4 5 6 7 8 17 2 4 38 5 6  6  5  4  7  2  3 sequence of right-branching trees Only log 2 N steps to get to 1 2 3 4 5 6 7 8 … … or to anywhere!

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 58 How can we find this needle in the haystack of N! possible permutations? 1 4 23 56 initial order 1 5 42 63 best order according to some cost function Defining “best order” What class of cost functions can we handle efficiently? How fast can we compute a subtree’s cost from its child subtrees?

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 59 + a 25 + a 56 + a 63 + a 42 How can we find this needle in the haystack of N! possible permutations? Defining “best order” What class of cost functions? best order according to some cost function 1 5 42 63 01522805-7 -300-762463-44 15280-1571-99 128-31054-6 7-94124082 65-228930 A = a 14 “Traveling Salesperson Problem” (TSP) + a 31

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 60 How can we find this needle in the haystack of N! possible permutations? Defining “best order” What class of cost functions? best order according to some cost function 1 5 42 63 05-229386 1208-31-654 -7410-92482 8817-6012-60 11-1710-59023 54-126550 B = b 26 = cost of 2 preceding 6 “Linear Ordering Problem” (LOP) (add up n(n-1)/2 such costs) (any order will incur either b 26 or b 62 )

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 61 Defining “best order” What class of cost functions? TSP and LOP are both NP-complete In fact, believed to be inapproximable  hard even to achieve C * optimal cost (any C≥1) Practical approaches:  correct answer, typically fast  branch-and-bound, ILP, …  fast answer, typically close to correct  beam search, this talk, …

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 62 Moving small blocks helps on LOP (experiment on LOLIB collection of 250-“word” problems from economics)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 63 Defining “best order” What class of cost functions? initial order 1 4 23 56 1 5 42 63 cost of this order: 1.Does my favorite WFSA like this string of #s? 2.Non-local pair order ok? 3.Non-local triple order ok? Can add these all up … 4 before 3 …?1…2…3? Generalizes TSP LOP

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 64 01522805-7 -300-762463-44 15280-1571-99 128-31054-6 7-94124082 65-228930 Costs are derived from source sentence features 1 4 23 56 initial order (French) NNP Marie NEG ne PRP m’ AUX a NEG pas VBN vu A = 05-229386 1208-31-654 -7410-92482 8817-6012-60 11-1710-59023 -754-126550 B = ne would like to be brought adjacent to the next NEG word

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 65 05-229386 1208-31-654 -7410-92482 8817-6012-60 11-1710-59023 754-126550 1 4 23 56 initial order (French) NNP Marie NEG ne PRP m’ AUX a NEG pas VBN vu 01522805-7 -300-762463-44 15280-1571-99 128-31054-6 7-94124082 65-228930 A = B = 50: a verb (e.g., vu) shouldn’t precede its subject (e.g., Marie) +27: words at a distance of 5 shouldn’t swap order -2: words with PRP between them ought to swap … 50: a verb (e.g., vu) shouldn’t precede its subject (e.g., Marie) +27: words at a distance of 5 shouldn’t swap order -2: words with PRP between them ought to swap … Can also include phrase boundary symbols in the input! Costs are derived from source sentence features = 75

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 66 05-229386 1208-31-654 -7410-92482 8817-6012-60 11-1710-59023 754-126550 1 4 23 56 initial order (French) NNP Marie NEG ne PRP m’ AUX a NEG pas VBN vu 01522805-7 -300-762463-44 15280-1571-99 128-31054-6 7-94124082 65-228930 A = B = FSA costs:Distortion model Language model – looks ahead to next step! (  good finite-state translation into good English?) Costs are derived from source sentence features

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 67 Dynamic program must pick the tree that leads to the lowest-cost permutation initial order 1 4 23 56 1 5 42 63 cost of this order: 1.Does my favorite WFSA like it as a string?

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 68 Scoring with a weighted FSA This particular WFSA implements TSP scoring for N=3: After you read 1, you’re in state 1 After you read 2, you’re in state 2 After you read 3, you’re in state 3 … and this state determines the cost of the next symbol you read nitial We’ll handle a WFSA with Q states by using a fancier grammar, with nonterminals. (Now runtime goes up to O(N 3 Q 3 ) …)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 69 Including WFSA costs via nonterminals 1 4 23 56 6161424223231414I5I55656 A possible preterminal for word 2 is an arc in A that’s labeled with 2. The preterminal 4  2 rewrites as word 2 with a cost equal to the arc’s cost. 42 2

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 70 I3I3I3I3 Including WFSA costs via nonterminals 1 4 23 6161424223231414 4343 1313 6363 56 I5I55656 I6I6 6363 I6I6 6363 I6I6 I3I3 1 4 23 56 6161424223231414I5I55656 This constituent’s total cost is the total cost of the best 6  3 path. 61 1 423 4 23. 16 1 423 4 23 5 6 I 5 cost of the new permutation

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 71 Dynamic program must pick the tree that leads to the lowest-cost permutation initial order 1 4 23 56 1 5 42 63 cost of this order: 1.Does my favorite WFSA like it as a string? 2.Non-local pair order ok? 4 before 3 …?

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 72 Incorporating the pairwise ordering costs So this hypothesis must add costs 5 < 1, 5 < 2, 5 < 3, 5 < 4, 6 < 1, 6 < 2, 6 < 3, 6 < 4, 7 < 1, 7 < 2, 7 < 3, 7 < 4 Uh-oh! So now it takes O(N 2 ) time to combine two subtrees, instead of O(1) time? Nope – dynamic programming to the rescue again! 1 4 23 56 7 This puts {5,6,7} before {1,2,3,4}.

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 73 Computing LOP cost of a block move 1 4 23 56 7 1234 5 6 7 1234 5 6 7 1234 5 6 7 1234 5 6 7 1234 5 6 7 So we have to add O(N 2 ) costs just to consider this single neighbor! This puts {5,6,7} before {1,2,3,4}. =+-+ already computed at earlier steps of parsing Reuse work from other, “narrower” block moves … computed new cost in O(1)! revise

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 74 Incorporating 3-way ordering costs See the initial paper (Eisner & Tromble 2006) A little tricky, but  comes “for free” if you’re willing to accept a certain restriction on these costs  more expensive without that restriction, but possible

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 75 Another option: Markov chain Monte Carlo Random walk in the space of permutations  interpret a permutation’s cost as a log-probability Sample a permutation from the neighborhood instead of always picking the most probable Why?  Simulated annealing might beat greedy-with-random-restarts  When learning the parameters of the distribution, can use sampling to compute the feature expectations

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 76 Another option: Markov chain Monte Carlo Random walk in the space of permutations  interpret a permutation’s cost as a log-probability Sample a permutation from the neighborhood instead of always picking the most probable How?  Pitfall: Sampling a permutation  sampling a tree Spurious ambiguity: some permutations have many trees  Solution: Exclude some trees, leaving 1 per permutation Normal form has long been known for colored trees For restricted colored trees (which limit the size of blocks to swap), we have devised a more complicated normal form

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 77 Sampling from permutation space p( π ) = exp(–cost( π )) / Z Why is this useful? To train the weights that determine the cost matrix (as we saw earlier)  And to compute expectations of other quantities (e.g., how often does 2 precede 5?) Less greedy heuristic for finding the lowest-cost permutation:  This is the mode of p, i.e., the highest-probability permutation.  Take a sample from p. If most of p’s probability mass is on the mode, you have a good chance of getting the mode.  If not, boost the odds: sample instead from p β, for β > 1 defined as p β ( π ) = (exp –β∙cost( π )) / Z β (so p 2 ( π ) proportional to p( π ) 2 )  As β  ∞, chances of getting the mode  1 But as β  ∞, MCMC sampling gets slower and slower (no free lunch!)  simulated annealing: gradually increase β during MCMC sampling -Z’/Z = ∑ π p( π) cost( π )’ = E p [cost( π )’] estimate by sampling from p

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 78 05-229386 1208-31-654 -7410-92482 8817-6012-60 11-1710-59023 754-126550 Learning the costs Where do these costs come from? If we have some examples on which we know the true permutation, could try to learn them 01522805-7 -300-762463-44 15280-1571-99 128-31054-6 7-94124082 65-228930 A = B =

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 79 05-229386 1208-31-654 -7410-92482 8817-6012-60 11-1710-59023 754-126550 01522805-7 -300-762463-44 15280-1571-99 128-31054-6 7-94124082 65-228930 A = B = Learning the costs Where do these costs come from? If we have some examples on which we know the true permutation, could try to learn them More precisely, try to learn these weights θ (the knowledge that’s reused across examples) 50: a verb (e.g., vu) shouldn’t precede its subject (e.g., Marie) 27: words at a distance of 5 shouldn’t swap order -2: words with PRP between them ought to swap … 50: a verb (e.g., vu) shouldn’t precede its subject (e.g., Marie) 27: words at a distance of 5 shouldn’t swap order -2: words with PRP between them ought to swap …

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 80 Learning the costs Typical learning approach (details omitted) : Tune the weights θ to maximize probability of correct answer π* Probability??? We were just trying to minimize the cost.  But there’s a standard way to convert costs to probabilities … 50: a verb (e.g., vu) shouldn’t precede its subject (e.g., Marie) 27: words at a distance of 5 shouldn’t swap order -2: words with PRP between them ought to swap … 50: a verb (e.g., vu) shouldn’t precede its subject (e.g., Marie) 27: words at a distance of 5 shouldn’t swap order -2: words with PRP between them ought to swap … For every permutation π, define p( π ) = exp(–cost( π )) / Z where the “partition function” Z = ∑ π exp –cost( π ), so ∑ π p( π ) = 1 Search is now argmax π p( π ) Learning is now argmax θ log p( π *) increase log p( π *) by gradient ascent actually, log probability: convex optimization with same answer

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 81 Learning the costs find the gradient of log p( π *) with respect to the weights θ we’re trying to learn easy since cost( π *) is typically just a sum of many weights slow: sum over all permutations! find the gradient of log p( π *) with respect to the weights θ we’re trying to learn easy since cost( π *) is typically just a sum of many weights slow: sum over all permutations! For every permutation π, define p( π ) = exp(–cost( π )) / Z where the “partition function” Z = ∑ π exp –cost( π ), so ∑ π p( π ) = 1 Search is now argmax π p( π ) Learning: increase log p( π *) by gradient ascent actually, log probability: convex optimization with same answer Typical learning approach (details omitted) : Tune the weights θ to maximize probability of correct answer π*

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 82 Learning the costs Typical learning approach (details omitted) : Tune the weights θ to maximize probability of correct answer π* find the gradient of log p( π *) with respect to the weights θ What is this gradient anyway? (log p( π *))’ = (–cost( π* ) – log Z)’ = –cost( π* )’ – Z’/Z Z’ = ∑ π exp(–cost( π)) ∙ (-cost( π )’) so -Z’/Z = ∑ π p( π) ∙ cost( π )’ = E p [cost( π )’] find the gradient of log p( π *) with respect to the weights θ What is this gradient anyway? (log p( π *))’ = (–cost( π* ) – log Z)’ = –cost( π* )’ – Z’/Z Z’ = ∑ π exp(–cost( π)) ∙ (-cost( π )’) so -Z’/Z = ∑ π p( π) ∙ cost( π )’ = E p [cost( π )’] For every permutation π, define p( π ) = exp(–cost( π )) / Z where the “partition function” Z = ∑ π exp –cost( π ), so ∑ π p( π ) = 1 Search is now argmax π p( π ) Learning: increase log p( π *) by gradient ascent actually, log probability: convex optimization with same answer aha! estimate by sampling from p (more about this later)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 83 Experimenting with training LOP params (LOP is quite fast: O(n 3 ) with no grammar constant) PDSVMFINPPERADVAPPRARTNNPTKNEGVVINF$. DaskannichsoausdemStandnichtsagen. B[7,9]

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 84 LOP feature templates

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 85 LOP feature templates Only LOP features so far And they’re unnecessarily simple (don’t examine syntactic constituency) And input sequence is only words (not interspersed with syntactic brackets)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 86 Learning LOP Costs for MT Define German’ to be German in English word order  To get German’ for training data, use Giza++ to align all German positions to English positions (disallow NULL) GermanEnglishGerman’ LOP MOSES MOSES baseline (interesting, if odd, to try to reorder with only the LOP costs)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 87 Learning LOP Costs for MT (interesting, if odd, to try to reorder with only the LOP costs) Easy first try: Naïve Bayes  Treat each feature in θ as independent  Count and normalize over the training data  No real improvement over baseline GermanEnglishGerman’ LOP MOSES MOSES baseline

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 88 Learning LOP Costs for MT (interesting, if odd, to try to reorder with only the LOP costs) Easy second try: Perceptron GermanEnglishGerman’ LOP MOSES MOSES baseline... search error model error global optimum local optimum update gold standard Note: Search error can be beneficial, e.g., just take 1 step from identity permutation

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 89 search error model error Warning: different data

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 90 Benefit from reordering Learning methodBLEU vs. German′ BLEU vs. English No reordering49.6525.55 Naïve Bayes—POS49.21 Naïve Bayes—POS+lexical49.75 Perceptron—POS50.0525.92 Perceptron—POS+lexical51.3026.34 obviously, not yet unscrambling German: need more features

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 91 Contrastive estimation (Smith & Eisner 2005) Maximize the probability of the desired permutation relative to its ITG neighborhood Requires summing all permutations in a neighborhood  Must use normal-form trees here Stochastic gradient descent 1-step very- large-scale neighborhood Alternatively, work back from gold standard gold standard

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 92 k-best MIRA in the neighborhood Make gold standard beat its local competitors Beat the bad ones by a bigger margin  Good = close to gold in swap distance?  Good = close to gold using BLEU?  Good = translates into English that’s close to reference? 1-step very- large-scale neighborhood gold standard current winners in the neighborhood Alternatively, work back from gold standard

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 93 Alternatively, train each iterate... update model best in neigh of  (0) oracle in neigh of  (0) Or could do a k-best MIRA version of this, too; even use a loss measure based on lookahead to  (n)

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 94 Open Questions Search: Is there practical benefit to using larger neighborhoods (speed, quality of solution) for hill-climbing? For MCMC? Search: Are the large-scale versions worth the constant-factor runtime penalty? At some sizes? Learning: How should we learn the weights if we plan to use them in greedy search? Learning: Can we tune adaptive search methods that vary the neighborhood and the temperature dynamically from step to step? Theoretical: Can it be determined in polytime whether two permutations have a common neighbor (using the full colored tree neighborhood)? Theoretical: Mixing time of MCMC with these neighborhoods? Algorithmic: Is there a master theorem for normal forms?

Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 95 Summary of part II Local search is fun and easy  Popular elsewhere in AI  Closely related to MCMC sampling Probably useful for translation  Maybe other NP-hard problems too Can efficiently use huge local neighborhoods  Algorithms are closely related to parsing and FSMs  Our community knows that stuff better than anyone!

1 Shuffling Non-Constituents Jason Eisner ACL SSST Workshop, June 2008 with David A. Smith and Roy Tromble syntactically-flavored reordering search methods.

Similar presentations

Presentation on theme: "1 Shuffling Non-Constituents Jason Eisner ACL SSST Workshop, June 2008 with David A. Smith and Roy Tromble syntactically-flavored reordering search methods."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Shuffling Non-Constituents Jason Eisner ACL SSST Workshop, June 2008 with David A. Smith and Roy Tromble syntactically-flavored reordering search methods.

Similar presentations

Presentation on theme: "1 Shuffling Non-Constituents Jason Eisner ACL SSST Workshop, June 2008 with David A. Smith and Roy Tromble syntactically-flavored reordering search methods."— Presentation transcript:

Similar presentations

About project

Feedback