Download presentation
Presentation is loading. Please wait.
Published byMatthew Higgins Modified over 9 years ago
1
28 June 2007EMNLP-CoNLL1 Probabilistic Models of Nonprojective Dependency Trees David A. Smith Center for Language and Speech Processing Computer Science Dept. Johns Hopkins University Noah A. Smith Language Technologies Institute Machine Learning Dept. School of Computer Science Carnegie Mellon University
2
28 June 2007EMNLP-CoNLL2 See Also On the Complexity of Non-Projective Data-Driven Dependency Parsing R. McDonald and G. Satta IWPT 2007 Structured-Prediction Models via the Matrix-Tree Theorem T. Koo, A. Globerson, X. Carreras and M. Collins EMNLP-CoNLL 2007 Coming Up Next!
3
28 June 2007EMNLP-CoNLL3 Nonprojective Syntax istameamnoritgloriacanitiemROOT IgiveaonbootstrappingtalktomorrowROOT‘ll that NOM my ACC may-knowglory NOM going-gray ACC How would we parse this? That glory shall last till I go gray
4
28 June 2007EMNLP-CoNLL4 Edge-Factored Models (McDonald et al., 2005) Non-neg. score for each edge Find edge sum among legal trees children parents Score edges in isolation Find maximum spanning tree with Chu-Liu-Edmonds NP hard to add sibling or degree constraints, hidden node variables What about training? Unlabeled for now
5
28 June 2007EMNLP-CoNLL5 If Only It Were Projective… IgiveaonbootstrappingtalktomorrowROOT‘ll An Inside-Outside algorithm gives us Normalizing constant for globally normalized models Posterior probability of edges Summing over hidden variables But we can’t use Inside-Outside for nonprojective parsing!
6
28 June 2007EMNLP-CoNLL6 Graph Theory to the Rescue! Tutte’s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. Exactly the Z we need! O(n 3 ) time!
7
28 June 2007EMNLP-CoNLL7 Building the Kirchoff (Laplacian) Matrix Negate edge scores Sum columns (children) Strike root row/col. Take determinant N.B.: This allows multiple children of root, but see Koo et al. 2007.
8
28 June 2007EMNLP-CoNLL8 Why Should This Work? Chu-Liu-Edmonds analogy: Every node selects best parent If cycles, contract and recur Clear for 1x1 matrix; use induction Undirected case; special root cases for directed
9
28 June 2007EMNLP-CoNLL9 When You Have a Hammer… Matrix-Tree Theorem Sequence-normalized log-linear models (Lafferty et al. ‘01) Minimum Bayes-risk parsing (cf. Goodman ‘96) Hidden-variable models O(n) inference with length constraints (cf. N. Smith & Eisner, ‘05) Minimum risk training (D. Smith & Eisner ‘06) Tree (Rényi) entropy (Hwa ‘01; S & E ‘07)
10
28 June 2007EMNLP-CoNLL10 Analogy to Other Models Sequence ProjectivePCFGs Shift- Reduce (Action- Based) Projective global log- linear Max-margin (or error-driven) (e.g. McDonald, Collins) Non- projective ? Parent- predicting (K. Hall ‘07) This Work
11
28 June 2007EMNLP-CoNLL11 More Machinery: The Gradient Since Invert Kirchoff matrix K in O(n 3 ) time via LU factorization The edge gradient is also edge posterior probability. Use the chain rule to backpropagate into s(i,j), whatever its internal structure may be.
12
28 June 2007EMNLP-CoNLL12 Nonprojective Conditional Log-Linear Training trainArabicCzechDanishDutch MIRA79.981.486.690.0 CL80.480.287.590.0 CoNLL 2006 Danish and Dutch CoNLL 2007 Arabic and Czech Features from McDonald et al. 2005 Compared with MSTParser’s MIRA max-margin training Trained LL weights with stochastic gradient descent Same #iterations and stopping criteria as MIRA Significance on paired permutation test
13
28 June 2007EMNLP-CoNLL13 Minimum Bayes-Risk Parsing parsetrainArabicCzechDanishDutch mapMIRA79.981.486.690.0 CL80.480.287.590.0 mBrMIRA79.480.385.087.2 CL80.580.487.590.0 Select the tree, not with the highest probability, but the most expected correct edges. Plug posteriors into MST MIRA doesn’t estimate probs. N.B. One could do mBr inside MIRA.
14
28 June 2007EMNLP-CoNLL14 Edge Clustering FranzlovesMilena OBJ SUBJ FranzlovesMilena ZC B A Y X (Supervised) labeled dependency parsing Simple idea: conjoin each model feature with a cluster Sum out all possible edge labelings if we don’t care about labels per se. OR
15
28 June 2007EMNLP-CoNLL15 Edge Clustering No significant gains or losses from clustering
16
28 June 2007EMNLP-CoNLL16 What’s Wrong with Edge Clustering? Edge labels don’t interact Unlike clusters on PCFG nonterminals (e.g. Matsuzaki et al.’05) Cf. small/no gains for unlabeled accuracy from supervised labeled parsers NP-A NP-BNP-A FranzlovesMilena AB No interaction Interaction in rewrite rule
17
28 June 2007EMNLP-CoNLL17 Constraints on Link Length Example with L=1, R=2 Max. left/right child distances L and R (Cf. Eisner & N. Smith‘05) Band-diagonal Kirchoff matrix once root row and column are removed Inversion in O(min(L 3 R 2, L 2 R 3 )n) time
18
28 June 2007EMNLP-CoNLL18 Conclusions O(n 3 ) inference for edge-factored nonprojective dependency models Performance closely comparable to MIRA Learned edge clustering doesn’t seem to help unlabeled parsing Many other applications to hit
19
28 June 2007EMNLP-CoNLL19 Thanks Jason Eisner Keith Hall Sanjeev Khudanpur The Anonymous Reviewers Ryan McDonald & Michael Collins & colleagues For sharing drafts
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.