Raymond J. Mooney University of Texas at Austin CS 388: Natural Language Processing: Neural Shift-Reduce Dependency Parsing Raymond J. Mooney University of Texas at Austin
Shift Reduce Parser Deterministically builds a parse incrementally, bottom up, and left to right, without backtracking. Maintains buffer of input words and a stack of constructed constituents. Perform sequence of operations/actions: Shift: Push the next word in the buffer onto the stack. Reduce: Replace a set of the top elements on the stack with a constituent composed of them.
Sample Parse of “Bob eats pasta” Buffer: Bob eats pasta Stack
Sample Parse of “Bob eats pasta” Action: Shift Buffer: eats pasta Stack Bob
Sample Parse of “Bob eats pasta” Action: Reduce(Bob NP) Buffer: eats pasta Stack (NP Bob)
Sample Parse of “Bob eats pasta” Action: Shift Buffer: pasta Stack eats (NP Bob)
Sample Parse of “Bob eats pasta” Action: Reduce(eats VB) Buffer: pasta Stack (VB eats) (NP Bob)
Sample Parse of “Bob eats pasta” Action: Shift Buffer: Stack pasta (VB eats) (NP Bob)
Sample Parse of “Bob eats pasta” Action: Reduce(pasta NP) Buffer: Stack (NP pasta) (VB eats) (NP Bob)
Sample Parse of “Bob eats pasta” Action: Reduce(VB NP VP) Buffer: Stack (VP (VB eats)(NP pasta)) (NP Bob)
Sample Parse of “Bob eats pasta” Action: Reduce(S NP VP) Buffer: Stack (S (NP Bob) (VP (VB eats)(NP pasta)))
Shift Reduce Parsing Must use “look ahead” to use next words in the buffer to pick the correct action. Originally introduced to parse programming languages which are DCFLs. Use for NLP requires heuristics to pick an action at each step which (due to ambiguity) could be wrong, resulting in a “garden path.” Can perform backup when an impasse is reached in order to search for a parse.
Shift-Reduce Dependency Parser Easily adapted to dependency parsing by using reduce operators that introduce dependency arcs. In addition to a stack and buffer, maintain a set of dependency arcs created.
Arc-Standard System (Nivre, 2004) Buffer b = [b1, b2,… bn] Stack s = [s1, s2,… sm] Arcs A = {label(wi, wj), …} Configuration c = (s, b, A) Initial Config: ([ROOT], [w1, w2, … wn], {}) Final Config: ([ROOT], [], {label(wi, wj), …})
Arc Standard Actions
Sample Parse of “He has good control” Stack Arcs Buffer: [He, has, good, control] ROOT
Sample Parse of “He has good control” Action: Shift Stack Arcs Buffer: [has, good, control] He ROOT
Sample Parse of “He has good control” Action: Shift Stack Arcs Buffer: [good, control] has He ROOT
Sample Parse of “He has good control” Action: LeftArc(nsubj) Stack Arcs Buffer: [good, control] nsubj(has,He) has ROOT
Sample Parse of “He has good control” Action: Shift Stack Arcs Buffer: [control] good nsubj(has,He) has ROOT
Sample Parse of “He has good control” Action: Shift Stack Arcs Buffer: [] control nsubj(has,He) good has ROOT
Sample Parse of “He has good control” Action: LeftArc(amod) Stack Arcs Buffer: [] control nsubj(has,He) amod(control,good) has ROOT
Sample Parse of “He has good control” Action: RightArc(dobj) Stack Arcs Buffer: [] has nsubj(has,He) amod(control,good) ROOT dobj(has,control)
Sample Parse of “He has good control” Action: RightArc(root) Stack Arcs Buffer: [] ROOT nsubj(has,He) amod(control,good) dobj(has,control) root(ROOT,has)
Stanford Neural Dependency Parser (Chen and Manning, 2014) Train a neural net to choose the best shift-reduce parser action to take at each step. Uses features (words, POS tags, arc labels) extracted from the current stack, buffer, and arcs as context. History (thru citation trail): Neural shift-reduce parser (Mayberry & Miikkulainen, 1999) Decision-tree shift-reduce parser (Hermjakob & Mooney, 1997) Simple learned shift-reduce parser (Simmons & Yu, 1992)
Parse action classification Neural Architecture Parse action classification
Context Features Used (rc = right-child, lc=left-child) The top 3 words on the stack and buffer: s1; s2; s3; b1; b2; b3; The first and second leftmost / rightmost children of the top two words on the stack: lc1(si); rc1(si); lc2(si); rc2 (si), i = 1; 2. The leftmost-of-leftmost and rightmost-of-rightmost children of the top two words on the stack: lc1(lc1(si)); rc1(rc1(si)), i = 1; 2. Also include the POS tag and parent arc label (where available) for these same items.
Input Embeddings Instead of using one-hot input encodings, words and POS tags are “embedded” in a 50 dimensional set of input features. Embedding POS tags is unusual since there are relatively few; however, it allows similar tags (e.g. NN and NNS) to have similar embeddings and thereby behave similarly.
Cube Activation Function Alternative non-linear output function instead of sigmoid (softmax) or tanh. Allows modeling the product terms of xixjxk for any three different input elements. Based on previous empirical results, capturing interactions of three elements seems important for shift-reduce dependency parsing.
Training Data Automatically construct dependency parses from treebank phrase-structure parse trees. Compute correct sequence of “oracle” shift-reduce parse actions (transitions, ti) at each step from gold-standard parse trees. Determine correct parse sequence by using a “shortest stack” oracle which always prefers LeftArc over Shift.
Training Algorithm Training objective is to minimize the cross-entropy loss, plus a L2-regularization term: Initialize word embeddings to precomputed values such as Word2Vec. Use AdaGrad with dropout to compute model parameters that approximately minimize this objective.
Evaluation Metrics for Dependency Parsing Unlabeled Atachment Score (UAS): % of tokens for which a system has predicted the correct parent. Labeled Atachment Score (LAS): % of tokens for which a system has predicted the correct parent with the correct arc label.
Sample Results on Penn WSJ Treebank
Conclusions Shift-reduce parsing is an efficient and effective alternative to standard PCFG parsing. Particularly effective for dependency parsing. Models deterministic, left-to-right parsing that seems to characterize human parsing (therefore subject to garden paths). Neural methods to select parse operations give state-of-the-art results.