1 David Smith (JHU  UMass Amherst) Jason Eisner (JHU) Dependency Parsing by Belief Propagation.

1 David Smith (JHU  UMass Amherst) Jason Eisner (JHU) Dependency Parsing by Belief Propagation

2 Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972) In the beginning, we used generative models. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * … each choice depends on a limited part of the history but which dependencies to allow? what if they’re all worthwhile? p(D | A,B,C)? … p(D | A,B) * p(C | A,B,D)?

3 Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972) In the beginning, we used generative models. Solution: Log-linear (max-entropy) modeling  Features may interact in arbitrary ways  Iterative scaling keeps adjusting the feature weights until the model agrees with the training data! p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * … which dependencies to allow? (given limited training data) (1/Z) * Φ (A) * Φ (B,A) * Φ (C,A) * Φ (C,B) * Φ (D,A,B) * Φ (D,B,C) * Φ (D,A,C) * … throw them all in!

Log-linear models great for n-way classification Also good for predicting sequences Also good for dependency parsing 4 How about structured outputs? but to allow fast dynamic programming, only use n-gram features but to allow fast dynamic programming or MST parsing, only use single-edge features …find preferred links… find preferred tags van

5 How about structured outputs? With arbitrary features, runtime blows up  Projective parsing: O(n 3 ) by dynamic programming  Non-projective: O(n 2 ) by minimum spanning tree but to allow fast dynamic programming or MST parsing, only use single-edge features …find preferred links… O(n 4 ) grandparents O(n 5 ) GP+sibling bigrams O(n 3 g 6 ) POS trigrams … O(2 n ) sibling pairs (non-adjacent) NP-hard any of the above features soft penalties for crossing links pretty much anything else!

6 Let’s reclaim our freedom (again!) Output probability is a product of local factors  Throw in any factors we want! (log-linear model) Let local factors negotiate via “belief propagation” Links (and tags) reinforce or suppress one another  Each iteration takes total time O(n 2 ) or O(n 3 ) Converges to a pretty good (but approx.) global parse certain global factors ok too each global factor can be handled fast via some traditional parsing algorithm (e.g., inside-outside) This paper in a nutshell (1/Z) * Φ (A) * Φ (B,A) * Φ (C,A) * Φ (C,B) * Φ (D,A,B) * Φ (D,B,C) * Φ (D,A,C) * …

Let’s reclaim our freedom (again!) Training with many featuresDecoding with many features Iterative scalingBelief propagation Each weight in turn is influenced by others Each variable in turn is influenced by others Iterate to achieve globally optimal weights Iterate to achieve locally consistent beliefs To train distrib. over trees, use dynamic programming to compute normalizer Z To decode distrib. over trees, use dynamic programming to compute messages This paper in a nutshell New!

First, a familiar example  Conditional Random Field (CRF) for POS tagging 8 Local factors in a graphical model … … find preferred tags vvv Possible tagging (i.e., assignment to remaining variables) Observed input sentence (shaded)

9 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags van Possible tagging (i.e., assignment to remaining variables) Another possible tagging Observed input sentence (shaded)

10 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags vna v 021 n 210 a 031 vna v 021 n 210 a 031 ”Binary” factor that measures compatibility of 2 adjacent tags Model reuses same parameters at this position

11 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags v 0.2 n a 0 “Unary” factor evaluates this tag Its values depend on corresponding word can’t be adj v 0.2 n a 0

12 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags v 0.2 n a 0 “Unary” factor evaluates this tag Its values depend on corresponding word (could be made to depend on entire observed sentence)

13 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags v 0.2 n a 0 “Unary” factor evaluates this tag Different unary factor at each position v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1

14 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags vna v 021 n 210 a 031 v 0.3 n 0.02 a 0 vna v 021 n 210 a 031 v 0.3 n 0 a 0.1 v 0.2 n a 0 van p( v a n ) is proportional to the product of all factors’ values on v a n

15 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags vna v 021 n 210 a 031 v 0.3 n 0.02 a 0 vna v 021 n 210 a 031 v 0.3 n 0 a 0.1 v 0.2 n a 0 van = … 1*3*0.3*0.1*0.2 … p( v a n ) is proportional to the product of all factors’ values on v a n

16 First, a familiar example  CRF for POS tagging Now let’s do dependency parsing!  O(n 2 ) boolean variables for the possible links van Local factors in a graphical model find preferred links … …

First, a familiar example  CRF for POS tagging Now let’s do dependency parsing!  O(n 2 ) boolean variables for the possible links 17 Local factors in a graphical model find preferred links … … tfftff Possible parse— encoded as an assignment to these vars van

First, a familiar example  CRF for POS tagging Now let’s do dependency parsing!  O(n 2 ) boolean variables for the possible links 18 Local factors in a graphical model find preferred links … … f f t f t f Possible parse— encoded as an assignment to these vars Another possible parse van

First, a familiar example  CRF for POS tagging Now let’s do dependency parsing!  O(n 2 ) boolean variables for the possible links (cycle) 19 Local factors in a graphical model find preferred links … … f t t t f Possible parse— encoded as an assignment to these vars Another possible parse An illegal parse van f

First, a familiar example  CRF for POS tagging Now let’s do dependency parsing!  O(n 2 ) boolean variables for the possible links (cycle) 20 Local factors in a graphical model find preferred links … … t t t Possible parse— encoded as an assignment to these vars Another possible parse An illegal parse Another illegal parse van t (multiple parents) f f

 So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation 21 Local factors for parsing find preferred links … … t 2 f 1 t 1 f 2 t 1 f 2 t 1 f 6 t 1 f 3 as before, goodness of this link can depend on entire observed input context t 1 f 8 some other links aren’t as good given this input sentence But what if the best assignment isn’t a tree??

22 Global factors for parsing  So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree  this is a “hard constraint”: factor is either 0 or 1 find preferred links … … ffffff 0 ffffft 0 fffftf 0 …… fftfft 1 …… tttttt 0

 So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree  this is a “hard constraint”: factor is either 0 or 1 23 Global factors for parsing find preferred links … … ffffff 0 ffffft 0 fffftf 0 …… fftfft 1 …… tttttt 0 t f f t f f 64 entries (0/1) So far, this is equivalent to edge-factored parsing (McDonald et al. 2005). Note: McDonald et al. (2005) don’t loop through this table to consider exponentially many trees one at a time. They use combinatorial algorithms; so should we! optionally require the tree to be projective (no crossing links) we’re legal!

 So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree  this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables  grandparent 24 Local factors for parsing find preferred links … … ft f11 t13 t t 3

 So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree  this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables  grandparent  no-cross 25 Local factors for parsing find preferred links … … t by t ft f11 t10.2

26 Local factors for parsing find preferred links … … by  So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree  this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables  grandparent  no-cross  siblings  hidden POS tags  subcategorization ……

Good to have lots of features, but … Nice model Shame about the NP-hardness  Can we approximate? Machine Learning (aka Statistical Physics) to the Rescue! 27

Great Ideas in ML: Message Passing 28 3 behind you 2 behind you 1 behind you 4 behind you 5 behind you 1 before you 2 before you there’s 1 of me 3 before you 4 before you 5 before you MacKay 2003 Count the soldiers

Great Ideas in ML: Message Passing 29 3 behind you 2 before you there’s 1 of me Belief: Must be 2 + 1 + 3 = 6 of us only see my incoming messages 231 MacKay 2003 Count the soldiers

Belief: Must be 2 + 1 + 3 = 6 of us 231 Great Ideas in ML: Message Passing 30 MacKay 2003 4 behind you 1 before you there’s 1 of me only see my incoming messages Belief: Must be 1 + 1 + 4 = 6 of us 141 Count the soldiers

Great Ideas in ML: Message Passing 31 MacKay 2003 7 here 3 here 11 here (= 7+3+1) 1 of me Each soldier receives reports from all branches of tree

Great Ideas in ML: Message Passing 32 MacKay 2003 3 here 6 here (= 3+3+1) Each soldier receives reports from all branches of tree

Great Ideas in ML: Message Passing 33 MacKay 2003 7 here 3 here 11 here (= 7+3+1) Each soldier receives reports from all branches of tree

Great Ideas in ML: Message Passing 34 MacKay 2003 7 here 3 here Belief: Must be 14 of us Each soldier receives reports from all branches of tree

Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 35 MacKay 2003 7 here 3 here Belief: Must be 14 of us wouldn’t work correctly with a “loopy” (cyclic) graph

36 … … find preferred tags Great ideas in ML: Forward-Backward v 0.3 n 0 a 0.1 v 1.8 n 0 a 4.2 αβα belief message v 2 n 1 a 7 In the CRF, message passing = forward-backward v 7 n 2 a 1 v 3 n 1 a 6 β vna v 021 n 210 a 031 v 3 n 6 a 1

Extend CRF to “skip chain” to capture non-local factor  More influences on belief 37 … … find preferred tags Great ideas in ML: Forward-Backward v 3 n 1 a 6 v 2 n 1 a 7 αβ v 3 n 1 a 6 v 5.4` n 0 a 25.2` v 0.3 n 0 a 0.1

Extend CRF to “skip chain” to capture non-local factor  More influences on belief  Graph becomes loopy  38 … … find preferred tags Great ideas in ML: Forward-Backward v 3 n 1 a 6 v 2 n 1 a 7 αβ v 3 n 1 a 6 v 5.4` n 0 a 25.2` v 0.3 n 0 a 0.1 Red messages not independent? Pretend they are!

 Higher-order factors (e.g., Grandparent) induce loops Let’s watch a loop around one triangle … Strong links are suppressing or promoting other links … 39 Loopy Belief Propagation for Parsing find preferred links … …

 Higher-order factors (e.g., Grandparent) induce loops Let’s watch a loop around one triangle …  How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t, given the messages it receives all the other links?” 40 Loopy Belief Propagation for Parsing find preferred links … … ? TREE factor ffffff 0 ffffft 0 fffftf 0 …… fftfft 1 …… tttttt 0

 How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?” 41 Loopy Belief Propagation for Parsing But this is the outside probability of green link!  TREE factor computes all outgoing messages at once (given all incoming messages) Projective case: total O(n 3 ) time by inside-outside Non-projective: total O(n 3 ) time by inverting Kirchhoff matrix (Smith & Smith, 2007) Belief propagation assumes incoming messages to TREE are independent. So outgoing messages can be computed with first-order parsing algorithms (fast, no grammar constant).

Runtimes for each factor type (see paper) Factor typedegreeruntimecounttotal TreeO(n 2 )O(n 3 )1 Proj. TreeO(n 2 )O(n 3 )1 Individual links1O(1)O(n 2 ) Grandparent2O(1)O(n 3 ) Sibling pairs2O(1)O(n 3 ) Sibling bigramsO(n)O(n 2 )O(n)O(n 3 ) NoCrossO(n) O(n 2 )O(n 3 ) Tag1O(n)O(n 2 )O(n 3 ) TagLink3O(g 2 )O(n 2 ) TagTrigramO(n)O(ng)1O(n) TOTALO(n 3 ) +=+= Additive, not multiplicative! per iteration

Runtimes for each factor type (see paper) Factor typedegreeruntimecounttotal TreeO(n 2 )O(n 3 )1 Proj. TreeO(n 2 )O(n 3 )1 Individual links1O(1)O(n 2 ) Grandparent2O(1)O(n 3 ) Sibling pairs2O(1)O(n 3 ) Sibling bigramsO(n)O(n 2 )O(n)O(n 3 ) NoCrossO(n) O(n 2 )O(n 3 ) Tag1O(n)O(n 2 )O(n 3 ) TagLink3O(g 2 )O(n 2 ) TagTrigramO(n)O(ng)1O(n) TOTALO(n 3 ) +=+= Additive, not multiplicative! Each “global” factor coordinates an unbounded # of variables Standard belief propagation would take exponential time to iterate over all configurations of those variables See paper for efficient propagators

Experimental Details Decoding  Run several iterations of belief propagation  Get final beliefs at link variables  Feed them into first-order parser  This gives the Minimum Bayes Risk tree Training  BP computes beliefs about each factor, too …  … which gives us gradients for max conditional likelihood. (as in forward-backward algorithm) Features used in experiments  First-order: Individual links just as in McDonald et al. 2005  Higher-order: Grandparent, Sibling bigrams, NoCross 44

Dependency Accuracy The extra, higher-order features help! (non-projective parsing) DanishDutchEnglish Tree+Link85.587.388.6 +NoCross86.188.389.1 +Grandparent86.188.689.4 +ChildSeq86.588.590.1

Dependency Accuracy The extra, higher-order features help! (non-projective parsing) DanishDutchEnglish Tree+Link85.587.388.6 +NoCross86.188.389.1 +Grandparent86.188.689.4 +ChildSeq86.588.590.1 Best projective parse with all factors 86.084.590.2 +hill-climbing86.187.690.2 exact, slow doesn’t fix enough edges

Time vs. Projective Search Error …DP 140 Compared with O(n 4 ) DPCompared with O(n 5 ) DP iterations

48 Freedom Regained Output probability defined as product of local and global factors  Throw in any factors we want! (log-linear model)  Each factor must be fast, but they run independently Let local factors negotiate via “belief propagation”  Each bit of syntactic structure is influenced by others  Some factors need combinatorial algorithms to compute messages fast e.g., existing parsing algorithms using dynamic programming  Each iteration takes total time O(n 3 ) or even O(n 2 ); see paper Compare reranking and stacking Converges to a pretty good (but approximate) global parse  Fast parsing for formerly intractable or slow models  Extra features help accuracy This paper in a nutshell

Future Opportunities Modeling hidden structure  POS tags, link roles, secondary links (DAG-shaped parses) Beyond dependencies  Constituency parsing, traces, lattice parsing Beyond parsing  Alignment, translation  Bipartite matching and network flow  Joint decoding of parsing and other tasks (IE, MT,...) Beyond text  Image tracking and retrieval  Social networks

Thanks!

 What is this message? P(3→2 link | other links)  So, if P(2→ 3 link) = 1, P(3 → 2 link | other links) = 0  For projective treesFor nonprojective The outside probability Inverse Kirchoff  Edge-factored parsing! 51 The Tree Factor find preferred links … …

52 Runtime: BP vs. DP Vs. O(n 4 ) DPVs. O(n 5 ) DP

1 David Smith (JHU  UMass Amherst) Jason Eisner (JHU) Dependency Parsing by Belief Propagation.

Similar presentations

Presentation on theme: "1 David Smith (JHU  UMass Amherst) Jason Eisner (JHU) Dependency Parsing by Belief Propagation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 David Smith (JHU  UMass Amherst) Jason Eisner (JHU) Dependency Parsing by Belief Propagation.

Similar presentations

Presentation on theme: "1 David Smith (JHU  UMass Amherst) Jason Eisner (JHU) Dependency Parsing by Belief Propagation."— Presentation transcript:

Similar presentations

About project

Feedback