Statistical NLP Winter 2009

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Parsing German with Latent Variable Grammars Slav Petrov and Dan Klein UC Berkeley.
Self-training with Products of Latent Variable Grammars Zhongqiang Huang, Mary Harper, and Slav Petrov.
Albert Gatt Corpora and Statistical Methods Lecture 11.
Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.
Learning and Inference for Hierarchically Split PCFGs Slav Petrov and Dan Klein.
Statistical NLP: Lecture 3
Probabilistic Parsing Chapter 14, Part 2 This slide set was adapted from J. Martin, R. Mihalcea, Rebecca Hwa, and Ray Mooney.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Intro to NLP - J. Eisner1 Probabilistic CKY.
Learning Accurate, Compact, and Interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein.
Features and Unification
Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.
Syntactic Parsing with CFGs CMSC 723: Computational Linguistics I ― Session #7 Jimmy Lin The iSchool University of Maryland Wednesday, October 14, 2009.
Parsing SLP Chapter 13. 7/2/2015 Speech and Language Processing - Jurafsky and Martin 2 Outline  Parsing with CFGs  Bottom-up, top-down  CKY parsing.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 29– CYK; Inside Probability; Parse Tree construction) Pushpak Bhattacharyya CSE.
Hand video 
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
SI485i : NLP Set 8 PCFGs and the CKY Algorithm. PCFGs We saw how CFGs can model English (sort of) Probabilistic CFGs put weights on the production rules.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Page 1 Probabilistic Parsing and Treebanks L545 Spring 2000.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture August 2007.
Albert Gatt Corpora and Statistical Methods Lecture 11.
Probabilistic CKY Roger Levy [thanks to Jason Eisner]
Coarse-to-Fine Efficient Viterbi Parsing Nathan Bodenstab OGI RPE Presentation May 8, 2006.
Linguistic Essentials
Statistical NLP Spring 2010 Lecture 14: PCFGs Dan Klein – UC Berkeley.
CSA2050 Introduction to Computational Linguistics Parsing I.
Sentence Parsing Parsing 3 Dynamic Programming. Jan 2009 Speech and Language Processing - Jurafsky and Martin 2 Acknowledgement  Lecture based on  Jurafsky.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
Supertagging CMSC Natural Language Processing January 31, 2006.
Hand video 
Natural Language Processing Lecture 15—10/15/2015 Jim Martin.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Spring 2006-Lecture 2.
December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)
NLP. Introduction to NLP Time flies like an arrow –Many parses –Some (clearly) more likely than others –Need for a probabilistic ranking method.
NLP. Introduction to NLP #include int main() { int n, reverse = 0; printf("Enter a number to reverse\n"); scanf("%d",&n); while (n != 0) { reverse =
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
Roadmap Probabilistic CFGs –Handling ambiguity – more likely analyses –Adding probabilities Grammar Parsing: probabilistic CYK Learning probabilities:
Natural Language Processing Vasile Rus
CSC 594 Topics in AI – Natural Language Processing
Statistical NLP Winter 2009
CSC 594 Topics in AI – Natural Language Processing
Statistical NLP: Lecture 3
Statistical NLP Spring 2011
Statistical NLP Spring 2011
Statistical NLP Spring 2011
Probabilistic and Lexicalized Parsing
CSCI 5832 Natural Language Processing
LING/C SC 581: Advanced Computational Linguistics
CSCI 5832 Natural Language Processing
Probabilistic and Lexicalized Parsing
CSCI 5832 Natural Language Processing
Parsing and More Parsing
Constraining Chart Parsing with Partial Tree Bracketing
CSCI 5832 Natural Language Processing
Probabilistic Parsing
David Kauchak CS159 – Spring 2019
Presentation transcript:

Statistical NLP Winter 2009 Lecture 11: Parsing II Roger Levy Thanks to Jason Eisner & Dan Klein for slides

PCFGs as language models time 1 flies 2 like 3 an 4 arrow 5 NP 3 Vst 3 NP 10 S 8 NP 24 S 22 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 What does the goal weight (neg. log-prob) represent? It is the probability of the most probable tree whose yield is the sentence Suppose we want to do language modeling We are interested in the probability of all trees “Put the file in the folder” vs. “Put the file and the folder”

2-22 2-27 1 S  NP VP 6 S  Vst NP 2-27 2 S  S PP 1 VP  V NP 2-22 Could just add up the parse probabilities time 1 flies 2 like 3 an 4 arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 NP 24 S 22 S 27 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 oops, back to finding exponentially many parses 2-22 2-27 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 2-27 2-22 2-27

1 S  NP VP 6 S  Vst NP 2-2 S  S PP 1 VP  V NP 2 VP  VP PP Any more efficient way? time 1 flies 2 like 3 an 4 arrow 5 NP 3 Vst 3 NP 10 S S 2-13 NP 24 S 22 S 27 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 2-12 VP 16 3 Det 1 4 N 8 2-8 1 S  NP VP 6 S  Vst NP 2-2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 2-22 2-27

1 S  NP VP 6 S  Vst NP 2-2 S  S PP 1 VP  V NP 2 VP  VP PP Add as we go … (the “inside algorithm”) time 1 flies 2 like 3 an 4 arrow 5 NP 3 Vst 3 NP 10 S NP 24 S 22 S 27 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 2-12 VP 16 3 Det 1 4 N 8 2-8+2-13 1 S  NP VP 6 S  Vst NP 2-2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 2-22 +2-27

1 S  NP VP 6 S  Vst NP 2-2 S  S PP 1 VP  V NP 2 VP  VP PP Add as we go … (the “inside algorithm”) time 1 flies 2 like 3 an 4 arrow 5 NP 3 Vst 3 NP 10 S NP 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 2-12 VP 16 3 Det 1 4 N 8 2-22 +2-27 2-8+2-13 2-22 +2-27 1 S  NP VP 6 S  Vst NP 2-2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP +2-22 +2-27

Charts and lattices You can equivalently represent a parse chart as a lattice constructed over some initial arcs This will also set the stage for Earley parsing later salt flies scratch NP N S VP V S  NP VP VP V NP VPV NP N NP N N S NP S NP VP S N NP N NP V VP N NP V VP salt flies scratch

(Speech) Lattices There was nothing magical about words spanning exactly one position. When working with speech, we generally don’t know how many words there are, or where they break. We can represent the possibilities as a lattice and parse these just as easily. Ivan eyes of awe an I a saw ‘ve van

Speech parsing mini-example Grammar (negative log-probs): [We’ll do it on the board if there’s time] 1 S  NP VP 1 VP  V NP 2 VP  V PP 2 NP  DT NN 2 NP  DT NNS 3 NP  NNP 2 NP  PRP 0 PP  IN NP 1 PRP  I 9 NNP  Ivan 6 V  saw 4 V  ’ve 7 V  awe 2 DT  a 2 DT  an 3 IN  of 6 NNP  eyes 9 NN  awe 6 NN  saw 5 NN  van

Better parsing We’ve now studied how to do correct parsing This is the problem of inference given a model The other half of the question is how to estimate a good model That’s what we’ll spend the rest of the day on

Problems with PCFGs? If we do no annotation, these trees differ only in one rule: VP  VP PP NP  NP PP Parse will go one way or the other, regardless of words We’ll look at two ways to address this: Sensitivity to specific words through lexicalization Sensitivity to structural configuration with unlexicalized methods

Problems with PCFGs? [insert performance statistics for a vanilla parser]

Problems with PCFGs What’s different between basic PCFG scores here? What (lexical) correlations need to be scored?

Problems with PCFGs Another example of PCFG indifference Left structure far more common How to model this? Really structural: “chicken with potatoes with gravy” Lexical parsers model this effect, though not by virtue of being lexical

PCFGs and Independence Symbols in a PCFG define conditional independence assumptions: At any node, the material inside that node is independent of the material outside that node, given the label of that node. Any information that statistically connects behavior inside and outside a node must flow through that node. S S  NP VP NP  DT NN NP NP VP

Solution(s) to problems with PCFGs Two common solutions seen for PCFG badness: Lexicalization: put head-word information into categories, and use it to condition rewrite probabilities State-splitting: distinguish sub-instances of more general categories (e.g., NP into NP-under-S vs. NP-under-VP) You can probably see that (1) is a special case of (2) More generally, the solution involves information propagation through PCFG nodes

Lexicalized Trees Add “headwords” to each phrasal node Syntactic vs. semantic heads Headship not in (most) treebanks Usually use head rules, e.g.: NP: Take leftmost NP Take rightmost N* Take rightmost JJ Take right child VP: Take leftmost VB* Take leftmost VP Take left child How is this information propagation?

Lexicalized PCFGs? Problem: we now have to estimate probabilities like Never going to get these atomically off of a treebank Solution: break up derivation into smaller steps

Lexical Derivation Steps Simple derivation of a local tree [simplified Charniak 97] VP[saw] Still have to smooth with mono- and non-lexical backoffs VBD[saw] NP[her] NP[today] PP[on] VP[saw] It’s markovization again! (VP->VBD...PP )[saw] PP[on] NP[today] (VP->VBD...NP )[saw] NP[her] (VP->VBD...NP )[saw] (VP->VBD )[saw] VBD[saw]

Lexical Derivation Steps Another derivation of a local tree [Collins 99] Choose a head tag and word Choose a complement bag Generate children (incl. adjuncts) Recursively derive children

Naïve Lexicalized Parsing Can, in principle, use CKY on lexicalized PCFGs O(Rn3) time and O(Sn2) memory But R = rV2 and S = sV Result is completely impractical (why?) Memory: 10K rules * 50K words * (40 words)2 * 8 bytes  6TB Can modify CKY to exploit lexical sparsity Lexicalized symbols are a base grammar symbol and a pointer into the input sentence, not any arbitrary word Result: O(rn5) time, O(sn3) Memory: 10K rules * (40 words)3 * 8 bytes  5GB Now, why do we get these space & time complexities?

Another view of CKY The fundamental operation is edge-combining X Y Z VP X VP(-NP) NP Y Z bestScore(X,i,j,h) if (j = i+1) return tagScore(X,s[i]) else return max score(X -> Y Z) * bestScore(Y,i,k) * bestScore(Z,k,j) i k j k,X->YZ Two string-position indices required to characterize each edge in memory Three string-position indices required to characterize each edge combination in time

Lexicalized CKY Lexicalized CKY has the same fundamental operation, just more edges X[h] VP [saw] Y[h] Z[h’] VP(-NP)[saw] NP [her] bestScore(X,i,j,h) if (j = i+1) return tagScore(X,s[i]) else return max max score(X[h]->Y[h] Z[h’]) * bestScore(Y,i,k,h) * bestScore(Z,k,j,h’) max score(X[h]->Y[h’] Z[h]) * bestScore(Y,i,k,h’) * bestScore(Z,k,j,h) i h k h’ j k,X->YZ Three string positions for each edge in space Five string positions for each edge combination in time k,X->YZ

Quartic Parsing Turns out, you can do better [Eisner 99] Gives an O(n4) algorithm Still prohibitive in practice if not pruned X[h] X[h] Y[h] Z[h’] Y[h] Z i h k h’ j i h k j

Dependency Parsing Lexicalized parsers can be seen as producing dependency trees Each local binary tree corresponds to an attachment in the dependency graph questioned lawyer witness the the

Dependency Parsing Pure dependency parsing is only cubic [Eisner 99] Some work on non-projective dependencies Common in, e.g. Czech parsing Can do with MST algorithms [McDonald and Pereira 05] Leads to O(n3) or even O(n2) [McDonald et al., 2005] X[h] h Y[h] Z[h’] h h’ i h k h’ j h k h’

Pruning with Beams The Collins parser prunes with per-cell beams [Collins 99] Essentially, run the O(n5) CKY Remember only a few hypotheses for each span <i,j>. If we keep K hypotheses at each span, then we do at most O(nK2) work per span (why?) Keeps things more or less cubic Side note/hack: certain spans are forbidden entirely on the basis of punctuation (crucial for speed) X[h] Y[h] Z[h’] i h k h’ j

Pruning with a PCFG The Charniak parser prunes using a two-pass approach [Charniak 97+] First, parse with the base grammar For each X:[i,j] calculate P(X|i,j,s) This isn’t trivial, and there are clever speed ups Second, do the full O(n5) CKY Skip any X :[i,j] which had low (say, < 0.0001) posterior Avoids almost all work in the second phase! Currently the fastest lexicalized parser Charniak et al 06: can use more passes Petrov et al 07: can use many more passes

Pruning with A* You can also speed up the search without sacrificing optimality For agenda-based parsers: Can select which items to process first Can do with any “figure of merit” [Charniak 98] If your figure-of-merit is a valid A* heuristic, no loss of optimiality [Klein and Manning 03] X i j n

Projection-Based A* S:fell VP:fell NP:payrolls PP:in Factory payrolls fell in Sept. NP:payrolls PP:in VP:fell S:fell Factory payrolls fell in Sept. NP PP VP S Factory payrolls fell in Sept. payrolls in fell

A* Speedup Total time dominated by calculation of A* tables in each projection… O(n3)

Typical Experimental Setup Corpus: Penn Treebank, WSJ Evaluation by precision, recall, F1 (harmonic mean) Training: sections 02-21 Development: section 22 Test: 23 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Precision=Recall=7/8 [NP,0,2] [NP,3,5] [NP,6,8] [PP,5,8] [VP,2,5] [VP,2,8] [S,1,8] [NP,0,2] [NP,3,5] [NP,6,8] [PP,5,8] [NP,3,8] [VP,2,8] [S,1,8]

Results Some results However Collins 99 – 88.6 F1 (generative lexical) Petrov et al 06 – 90.7 F1 (generative unlexical) However Bilexical counts rarely make a difference (why?) Gildea 01 – Removing bilexical counts costs < 0.5 F1 Bilexical vs. monolexical vs. smart smoothing

Unlexicalized methods So far we have looked at the use of lexicalized methods to fix PCFG independence assumptions Lexicalization creates new complications of inference (computational complexity) and estimation (sparsity) There are lots of improvements to be made without resorting to lexicalization

PCFGs and Independence Symbols in a PCFG define independence assumptions: At any node, the material inside that node is independent of the material outside that node, given the label of that node. Any information that statistically connects behavior inside and outside a node must flow through that node. S S  NP VP NP  DT NN NP NP VP

Non-Independence I Independence assumptions are often too strong. Example: the expansion of an NP is highly dependent on the parent of the NP (i.e., subjects vs. objects). Also: the subject and object expansions are correlated! All NPs NPs under S NPs under VP

Non-Independence II Who cares? Symptoms of overly strong assumptions: NB, HMMs, all make false assumptions! For generation, consequences would be obvious. For parsing, does it impact accuracy? Symptoms of overly strong assumptions: Rewrites get used where they don’t belong. Rewrites get used too often or too rarely. In the PTB, this construction is for possessives

Breaking Up the Symbols We can relax independence assumptions by encoding dependencies into the PCFG symbols: What are the most useful “features” to encode? Parent annotation [Johnson 98] Marking possessive NPs

Annotations Annotations split the grammar categories into sub-categories (in the original sense). Conditioning on history vs. annotating P(NP^S  PRP) is a lot like P(NP  PRP | S) Or equivalently, P(PRP | NP, S) P(NP-POS  NNP POS) isn’t history conditioning. Feature / unification grammars vs. annotation Can think of a symbol like NP^NP-POS as NP [parent:NP, +POS] After parsing with an annotated grammar, the annotations are then stripped for evaluation.

Lexicalization Lexical heads important for certain classes of ambiguities (e.g., PP attachment): Lexicalizing grammar creates a much larger grammar. (cf. next week) Sophisticated smoothing needed Smarter parsing algorithms More data needed How necessary is lexicalization? Bilexical vs. monolexical selection Closed vs. open class lexicalization

Unlexicalized PCFGs What is meant by an “unlexicalized” PCFG? Grammar not systematically specified to the level of lexical items NP [stocks] is not allowed NP^S-CC is fine Closed vs. open class words (NP^S [the]) Long tradition in linguistics of using function words as features or markers for selection Contrary to the bilexical idea of semantic heads Open-class selection really a proxy for semantics It’s kind of a gradual transition from unlexicalized to lexicalized (but heavily smoothed) grammars.

Typical Experimental Setup Corpus: Penn Treebank, WSJ Accuracy – F1: harmonic mean of per-node labeled precision and recall. Here: also size – number of symbols in grammar. Passive / complete symbols: NP, NP^S Active / incomplete symbols: NP  NP CC  Training: sections 02-21 Development: section 22 (here, first 20 files) Test: 23

Multiple Annotations Each annotation done in succession Order does matter Too much annotation and we’ll have sparsity issues

Horizontal Markovization Order  Order 1

Vertical Markovization Order 1 Order 2 Vertical Markov order: rewrites depend on past k ancestor nodes. (cf. parent annotation)

Markovization This leads to a somewhat more general view of generative probabilistic models over trees Main goal: estimate P( ) A bit of an interlude: Tree-Insertion Grammars deal with this problem more directly.

TIG: Insertion

Data-oriented parsing (Bod 1992) A case of Tree-Insertion Grammars Rewrite large (possibly lexicalized) subtrees in a single step Derivational ambiguity whether subtrees were generated atomically or compositionally Most probable parse is NP-complete due to unbounded number of “rules”

Markovization, cont. So the question is, how do we estimate these tree probabilities What type of tree-insertion grammar do we use? Equivalently, what type of independence assuptions do we impose? Traditional PCFGs are only one type of answer to this question

Vertical and Horizontal Examples: Raw treebank: v=1, h= Johnson 98: v=2, h= Collins 99: v=2, h=2 Best F1: v=3, h=2v Model F1 Size Base: v=h=2v 77.8 7.5K

Unary Splits Problem: unary rewrites used to transmute categories so a high-probability rule can be used. Solution: Mark unary rewrite sites with -U Annotation F1 Size Base 77.8 7.5K UNARY 78.3 8.0K

Tag Splits Problem: Treebank tags are too coarse. Example: Sentential, PP, and other prepositions are all marked IN. Partial Solution: Subdivide the IN tag. Annotation F1 Size Previous 78.3 8.0K SPLIT-IN 80.3 8.1K

Other Tag Splits F1 Size 80.4 8.1K 80.5 81.2 8.5K 81.6 9.0K 81.7 9.1K 81.8 9.3K UNARY-DT: mark demonstratives as DT^U (“the X” vs. “those”) UNARY-RB: mark phrasal adverbs as RB^U (“quickly” vs. “very”) TAG-PA: mark tags with non-canonical parents (“not” is an RB^VP) SPLIT-AUX: mark auxiliary verbs with –AUX [cf. Charniak 97] SPLIT-CC: separate “but” and “&” from other conjunctions SPLIT-%: “%” gets its own tag.

Treebank Splits The treebank comes with some annotations (e.g., -LOC, -SUBJ, etc). Whole set together hurt the baseline. One in particular is very useful (NP-TMP) when pushed down to the head tag (why?). Can mark gapped S nodes as well. Annotation F1 Size Previous 81.8 9.3K NP-TMP 82.2 9.6K GAPPED-S 82.3 9.7K

Yield Splits Problem: sometimes the behavior of a category depends on something inside its future yield. Examples: Possessive NPs Finite vs. infinite VPs Lexical heads! Solution: annotate future elements into nodes. Lexicalized grammars do this (in very careful ways – why?). Annotation F1 Size Previous 82.3 9.7K POSS-NP 83.1 9.8K SPLIT-VP 85.7 10.5K

Distance / Recursion Splits Problem: vanilla PCFGs cannot distinguish attachment heights. Solution: mark a property of higher or lower sites: Contains a verb. Is (non)-recursive. Base NPs [cf. Collins 99] Right-recursive NPs NP -v VP NP PP v Annotation F1 Size Previous 85.7 10.5K BASE-NP 86.0 11.7K DOMINATES-V 86.9 14.1K RIGHT-REC-NP 87.0 15.2K

A Fully Annotated (Unlex) Tree

Some Test Set Results Parser LP LR F1 CB 0 CB Magerman 95 84.9 84.6 84.7 1.26 56.6 Collins 96 86.3 85.8 86.0 1.14 59.9 Klein & Manning 03 86.9 85.7 1.10 60.3 Charniak 97 87.4 87.5 1.00 62.1 Collins 99 88.7 88.6 0.90 67.1 Beats “first generation” lexicalized parsers. Lots of room to improve – more complex models next.

Unlexicalized grammars: SOTA Klein & Manning 2003’s “symbol splits” were hand-coded Petrov and Klein (2007) used a hierarchical splitting process to learn symbol inventories Reminiscent of decision trees/CART Coarse-to-fine parsing makes it very fast Performance is state of the art!

Parse Reranking Nothing we’ve seen so far allows arbitrarily non-local features Assume the number of parses is very small We can represent each parse T as an arbitrary feature vector (T) Typically, all local rules are features Also non-local features, like how right-branching the overall tree is [Charniak and Johnson 05] gives a rich set of features

Parse Reranking Since the number of parses is no longer huge Can enumerate all parses efficiently Can use simple machine learning methods to score trees E.g. maxent reranking: learn a binary classifier over trees where: The top candidates are positive All others are negative Rank trees by P(+|T) The best parsing numbers have mostly been from reranking systems Charniak and Johnson 05 – 89.7 / 91.3 F1 (generative lexical / reranked) McClosky et al 06 – 92.1 F1 (gen + rerank + self-train)

Shift-Reduce Parsers Another way to derive a tree: Parsing No useful dynamic programming search Can still use beam search [Ratnaparkhi 97]

Derivational Representations Generative derivational models: How is a PCFG a generative derivational model? Distinction between parses and parse derivations. How could there be multiple derivations?

Tree-adjoining grammar (TAG) Start with local trees Can insert structure with adjunction operators Mildly context-sensitive Models long-distance dependencies naturally … as well as other weird stuff that CFGs don’t capture well (e.g. cross-serial dependencies)

TAG: Adjunction

TAG: Long Distance

TAG: complexity Recall that CFG parsing is O(n3) TAG parsing is O(n4) However, lexicalization causes the same kinds of complexity increases as in CFG X Y Z i k j

CCG Parsing Combinatory Categorial Grammar Fully (mono-) lexicalized grammar Categories encode argument sequences Very closely related to the lambda calculus (more later) Can have spurious ambiguities (why?)

Digression: Is NL a CFG? Cross-serial dependencies in Dutch