Statistical NLP Spring 2011

Slides:



Advertisements
Similar presentations
Parsing German with Latent Variable Grammars Slav Petrov and Dan Klein UC Berkeley.
Advertisements

Self-training with Products of Latent Variable Grammars Zhongqiang Huang, Mary Harper, and Slav Petrov.
Albert Gatt Corpora and Statistical Methods Lecture 11.
Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.
Learning and Inference for Hierarchically Split PCFGs Slav Petrov and Dan Klein.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 3.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Learning Accurate, Compact, and Interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein.
Parsing SLP Chapter 13. 7/2/2015 Speech and Language Processing - Jurafsky and Martin 2 Outline  Parsing with CFGs  Bottom-up, top-down  CKY parsing.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
PARSING David Kauchak CS457 – Fall 2011 some slides adapted from Ray Mooney.
Hand video 
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Statistical NLP Spring 2010 Lecture 13: Parsing II Dan Klein – UC Berkeley.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav.
ADVANCED PARSING David Kauchak CS159 – Fall 2014 some slides adapted from Dan Klein.
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
Page 1 Probabilistic Parsing and Treebanks L545 Spring 2000.
Albert Gatt Corpora and Statistical Methods Lecture 11.
Statistical NLP Spring 2010 Lecture 14: PCFGs Dan Klein – UC Berkeley.
Statistical NLP Spring 2010 Lecture 12: Parsing I Dan Klein – UC Berkeley.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
Hand video 
Natural Language Processing Lecture 15—10/15/2015 Jim Martin.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Chapter 12: Probabilistic Parsing and Treebanks Heshaam Faili University of Tehran.
Roadmap Probabilistic CFGs –Handling ambiguity – more likely analyses –Adding probabilities Grammar Parsing: probabilistic CYK Learning probabilities:
Natural Language Processing Vasile Rus
CSC 594 Topics in AI – Natural Language Processing
CSC 594 Topics in AI – Natural Language Processing
Parsing Recommended Reading: Ch th Jurafsky & Martin 2nd edition
Statistical NLP Winter 2009
CSC 594 Topics in AI – Natural Language Processing
CSE 517 Natural Language Processing Winter 2015
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
Statistical NLP Spring 2011
Statistical NLP Spring 2011
CS 388: Natural Language Processing: Statistical Parsing
Probabilistic and Lexicalized Parsing
CSCI 5832 Natural Language Processing
CS 388: Natural Language Processing: Syntactic Parsing
LING/C SC 581: Advanced Computational Linguistics
Statistical NLP Spring 2011
CSCI 5832 Natural Language Processing
Probabilistic and Lexicalized Parsing
CSCI 5832 Natural Language Processing
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
Parsing and More Parsing
Constraining Chart Parsing with Partial Tree Bracketing
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 26
Chunk Parsing CS1573: AI Application Development, Spring 2003
CSCI 5832 Natural Language Processing
CSA2050 Introduction to Computational Linguistics
David Kauchak CS159 – Spring 2019
Probabilistic Parsing
David Kauchak CS159 – Spring 2019
David Kauchak CS159 – Spring 2019
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning
Presentation transcript:

Statistical NLP Spring 2011 Lecture 16: Parsing II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA

Classical NLP: Parsing Write symbolic or logical rules: Use deduction systems to prove parses from words Minimal grammar on “Fed raises” sentence: 36 parses Simple 10-rule grammar: 592 parses Real-size grammar: many millions of parses This scaled very badly, didn’t yield broad-coverage tools Grammar (CFG) Lexicon ROOT  S S  NP VP NP  DT NN NP  NN NNS NP  NP PP VP  VBP NP VP  VBP NP PP PP  IN NP NN  interest NNS  raises VBP  interest VBZ  raises …

A Recursive Parser Will this parser work? Why or why not? bestScore(X,i,j,s) if (j = i+1) return tagScore(X,s[i]) else return max score(X->YZ) * bestScore(Y,i,k) * bestScore(Z,k,j) Will this parser work? Why or why not? Memory requirements?

A Memoized Parser One small change: bestScore(X,i,j,s) if (scores[X][i][j] == null) if (j = i+1) score = tagScore(X,s[i]) else score = max score(X->YZ) * bestScore(Y,i,k) * bestScore(Z,k,j) scores[X][i][j] = score return scores[X][i][j]

A Bottom-Up Parser (CKY) Can also organize things bottom-up bestScore(s) for (i : [0,n-1]) for (X : tags[s[i]]) score[X][i][i+1] = tagScore(X,s[i]) for (diff : [2,n]) for (i : [0,n-diff]) j = i + diff for (X->YZ : rule) for (k : [i+1, j-1]) score[X][i][j] = max score[X][i][j], score(X->YZ) * score[Y][i][k] * score[Z][k][j] X Y Z i k j

Unary Rules Unary rules? bestScore(X,i,j,s) if (j = i+1) return tagScore(X,s[i]) else return max max score(X->YZ) * bestScore(Y,i,k) * bestScore(Z,k,j) max score(X->Y) * bestScore(Y,i,j)

CNF + Unary Closure We need unaries to be non-cyclic Can address by pre-calculating the unary closure Rather than having zero or more unaries, always have exactly one Alternate unary and binary layers Reconstruct unary chains afterwards VP SBAR VP SBAR VBD NP VBD NP S VP NP DT NN VP DT NN

Alternating Layers bestScoreB(X,i,j,s) return max max score(X->YZ) * bestScoreU(Y,i,k) * bestScoreU(Z,k,j) bestScoreU(X,i,j,s) if (j = i+1) return tagScore(X,s[i]) else return max max score(X->Y) * bestScoreB(Y,i,j)

Memory How much memory does this require? Pruning: Beams Have to store the score cache Cache size: |symbols|*n2 doubles For the plain treebank grammar: X ~ 20K, n = 40, double ~ 8 bytes = ~ 256MB Big, but workable. Pruning: Beams score[X][i][j] can get too large (when?) Can keep beams (truncated maps score[i][j]) which only store the best few scores for the span [i,j] Pruning: Coarse-to-Fine Use a smaller grammar to rule out most X[i,j] Much more on this later…

Time: Theory How much time will it take to parse? For each diff (<= n) For each i (<= n) For each rule X  Y Z For each split point k Do constant work Total time: |rules|*n3 Something like 5 sec for an unoptimized parse of a 20-word sentences X Y Z i k j

(not an optimized parser!) Time: Practice Parsing with the vanilla treebank grammar: Why’s it worse in practice? Longer sentences “unlock” more of the grammar All kinds of systems issues don’t scale ~ 20K Rules (not an optimized parser!) Observed exponent: 3.6

Same-Span Reachability TOP SQ X RRC NX LST ADJP ADVP FRAG INTJ NP PP PRN QP S SBAR UCP VP WHNP CONJP NAC SINV PRT SBARQ WHADJP WHPP WHADVP

Rule State Reachability Many states are more likely to match larger spans! Example: NP CC  NP CC 1 Alignment n-1 n Example: NP CC NP  NP CC NP n Alignments n-k-1 n-k n

Agenda-Based Parsing Agenda-based parsing is like graph search (but over a hypergraph) Concepts: Numbering: we number fenceposts between words “Edges” or items: spans with labels, e.g. PP[3,5], represent the sets of trees over those words rooted at that label (cf. search states) A chart: records edges we’ve expanded (cf. closed set) An agenda: a queue which holds edges (cf. a fringe or open set) PP critics write reviews with computers 1 2 3 4 5

Word Items Building an item for the first time is called discovery. Items go into the agenda on discovery. To initialize, we discover all word items (with score 1.0). AGENDA critics[0,1], write[1,2], reviews[2,3], with[3,4], computers[4,5] CHART [EMPTY] 1 2 3 4 5 critics write reviews with computers

Unary Projection When we pop a word item, the lexicon tells us the tag item successors (and scores) which go on the agenda critics[0,1] write[1,2] reviews[2,3] with[3,4] computers[4,5] NNS[0,1] VBP[1,2] NNS[2,3] IN[3,4] NNS[4,5] critics write reviews with computers 1 2 3 4 5 critics write reviews with computers

Item Successors Y[i,j] with X  Y forms X[i,j] When we pop items off of the agenda: Graph successors: unary projections (NNS  critics, NP  NNS) Hypergraph successors: combine with items already in our chart Enqueue / promote resulting items (if not in chart already) Record backtraces as appropriate Stick the popped edge in the chart (closed set) Queries a chart must support: Is edge X:[i,j] in the chart? (What score?) What edges with label Y end at position j? What edges with label Z start at position i? Y[i,j] with X  Y forms X[i,j] Y[i,j] and Z[j,k] with X  Y Z form X[i,k] X Y Z

An Example 1 2 3 4 5 NNS[0,1] VBP[1,2] NNS[2,3] IN[3,4] NNS[3,4] NP[0,1] VP[1,2] NP[2,3] NP[4,5] S[0,2] VP[1,3] PP[3,5] ROOT[0,2] S[0,3] VP[1,5] NP[2,5] ROOT[0,3] S[0,5] ROOT[0,5] ROOT S ROOT S VP ROOT NP S VP PP NP VP NP NP NNS VBP NNS IN NNS critics write reviews with computers 1 2 3 4 5

Empty Elements Sometimes we want to posit nodes in a parse tree that don’t contain any pronounced words: These are easy to add to a chart parser! For each position i, add the “word” edge :[i,i] Add rules like NP   to the grammar That’s it! I want you to parse this sentence I want [ ] to parse this sentence NP VP       I like to parse empties 1 2 3 4 5

UCS / A* With weighted edges, order matters Must expand optimal parse from bottom up (subparses first) CKY does this by processing smaller spans before larger ones UCS pops items off the agenda in order of decreasing Viterbi score A* search also well defined You can also speed up the search without sacrificing optimality Can select which items to process first Can do with any “figure of merit” [Charniak 98] If your figure-of-merit is a valid A* heuristic, no loss of optimiality [Klein and Manning 03] X i j n

(Speech) Lattices There was nothing magical about words spanning exactly one position. When working with speech, we generally don’t know how many words there are, or where they break. We can represent the possibilities as a lattice and parse these just as easily. Ivan eyes of awe an I a saw ‘ve van

Treebank PCFGs [Charniak 96] Model F1 Baseline 72.0 Use PCFGs for broad coverage parsing Can take a grammar right off the trees (doesn’t work well): ROOT  S 1 S  NP VP . 1 NP  PRP 1 VP  VBD ADJP 1 ….. Model F1 Baseline 72.0

Conditional Independence? Not every NP expansion can fill every NP slot A grammar with symbols like “NP” won’t be context-free Statistically, conditional independence too strong

Non-Independence Independence assumptions are often too strong. Example: the expansion of an NP is highly dependent on the parent of the NP (i.e., subjects vs. objects). Also: the subject and object expansions are correlated! All NPs NPs under S NPs under VP

Grammar Refinement Example: PP attachment

Grammar Refinement What is parsing about and why do we need to have whole sessions devoted to it? Well, even though we have labeled corpora with observed parse trees, designing grammars is difficult because the observed categories are too coarse to model the true underlying process. Consider this short sentence. There are two NP nodes, one in subject and one in object position. Even though the nodes have the same label in the three, they have very different properties. For example, you couldn’t exchange them and say “The noise heard she.” Learning grammars or building parsers therefore involves refining the grammar categories. This can be done by structural annotation in form of parent annotation, which propagates information top-down. Or by lexicalization, which propagates information bottom-up. Lately some latent variable methods have been shown to achieve state of the art performance. This is the route we will take here, but we will present new estimation techniques which allow us to drastically reduce the size of the grammars. Structure Annotation [Johnson ’98, Klein&Manning ’03] Lexicalization [Collins ’99, Charniak ’00] Latent Variables [Matsuzaki et al. 05, Petrov et al. ’06]

The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Structural annotation

Typical Experimental Setup Corpus: Penn Treebank, WSJ Accuracy – F1: harmonic mean of per-node labeled precision and recall. Here: also size – number of symbols in grammar. Passive / complete symbols: NP, NP^S Active / incomplete symbols: NP  NP CC  Training: sections 02-21 Development: section 22 (here, first 20 files) Test: 23

Vertical Markovization Order 1 Order 2 Vertical Markov order: rewrites depend on past k ancestor nodes. (cf. parent annotation)

Horizontal Markovization Order  Order 1

Unary Splits Problem: unary rewrites used to transmute categories so a high-probability rule can be used. Solution: Mark unary rewrite sites with -U Annotation F1 Size Base 77.8 7.5K UNARY 78.3 8.0K

Tag Splits Problem: Treebank tags are too coarse. Example: Sentential, PP, and other prepositions are all marked IN. Partial Solution: Subdivide the IN tag. Annotation F1 Size Previous 78.3 8.0K SPLIT-IN 80.3 8.1K

Other Tag Splits F1 Size 80.4 8.1K 80.5 81.2 8.5K 81.6 9.0K 81.7 9.1K 81.8 9.3K UNARY-DT: mark demonstratives as DT^U (“the X” vs. “those”) UNARY-RB: mark phrasal adverbs as RB^U (“quickly” vs. “very”) TAG-PA: mark tags with non-canonical parents (“not” is an RB^VP) SPLIT-AUX: mark auxiliary verbs with –AUX [cf. Charniak 97] SPLIT-CC: separate “but” and “&” from other conjunctions SPLIT-%: “%” gets its own tag.

A Fully Annotated (Unlex) Tree

Some Test Set Results Parser LP LR F1 CB 0 CB Magerman 95 84.9 84.6 84.7 1.26 56.6 Collins 96 86.3 85.8 86.0 1.14 59.9 Unlexicalized 86.9 85.7 1.10 60.3 Charniak 97 87.4 87.5 1.00 62.1 Collins 99 88.7 88.6 0.90 67.1 Beats “first generation” lexicalized parsers. Lots of room to improve – more complex models next.

The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Structural annotation [Johnson ’98, Klein and Manning 03] Head lexicalization [Collins ’99, Charniak ’00]

Problems with PCFGs What’s different between basic PCFG scores here? What (lexical) correlations need to be scored?

Lexicalized Trees Add “headwords” to each phrasal node Syntactic vs. semantic heads Headship not in (most) treebanks Usually use head rules, e.g.: NP: Take leftmost NP Take rightmost N* Take rightmost JJ Take right child VP: Take leftmost VB* Take leftmost VP Take left child

Lexicalized PCFGs? Problem: we now have to estimate probabilities like Never going to get these atomically off of a treebank Solution: break up derivation into smaller steps

Lexical Derivation Steps A derivation of a local tree [Collins 99] Choose a head tag and word Choose a complement bag Generate children (incl. adjuncts) Recursively derive children

Lexicalized CKY X[h] Y[h] Z[h’] bestScore(X,i,j,h) if (j = i+1) (VP->VBD...NP )[saw] X[h] (VP->VBD )[saw] NP[her] Y[h] Z[h’] bestScore(X,i,j,h) if (j = i+1) return tagScore(X,s[i]) else return max max score(X[h]->Y[h] Z[h’]) * bestScore(Y,i,k,h) * bestScore(Z,k,j,h’) max score(X[h]->Y[h’] Z[h]) * bestScore(Y,i,k,h’) * bestScore(Z,k,j,h) i h k h’ j k,h’,X->YZ k,h’,X->YZ

Pruning with Beams The Collins parser prunes with per-cell beams [Collins 99] Essentially, run the O(n5) CKY Remember only a few hypotheses for each span <i,j>. If we keep K hypotheses at each span, then we do at most O(nK2) work per span (why?) Keeps things more or less cubic Also: certain spans are forbidden entirely on the basis of punctuation (crucial for speed) X[h] Y[h] Z[h’] i h k h’ j

Pruning with a PCFG The Charniak parser prunes using a two-pass approach [Charniak 97+] First, parse with the base grammar For each X:[i,j] calculate P(X|i,j,s) This isn’t trivial, and there are clever speed ups Second, do the full O(n5) CKY Skip any X :[i,j] which had low (say, < 0.0001) posterior Avoids almost all work in the second phase! Charniak et al 06: can use more passes Petrov et al 07: can use many more passes

Pruning with A* You can also speed up the search without sacrificing optimality For agenda-based parsers: Can select which items to process first Can do with any “figure of merit” [Charniak 98] If your figure-of-merit is a valid A* heuristic, no loss of optimiality [Klein and Manning 03] X i j n

Projection-Based A* S:fell VP:fell NP:payrolls PP:in Factory payrolls fell in Sept. NP:payrolls PP:in VP:fell S:fell Factory payrolls fell in Sept. NP PP VP S Factory payrolls fell in Sept. payrolls in fell

A* Speedup Total time dominated by calculation of A* tables in each projection… O(n3)

Results Some results However Collins 99 – 88.6 F1 (generative lexical) Charniak and Johnson 05 – 89.7 / 91.3 F1 (generative lexical / reranked) Petrov et al 06 – 90.7 F1 (generative unlexical) McClosky et al 06 – 92.1 F1 (gen + rerank + self-train) However Bilexical counts rarely make a difference (why?) Gildea 01 – Removing bilexical counts costs < 0.5 F1

The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson ’98]

The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson ’98] Head lexicalization [Collins ’99, Charniak ’00]

The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson ’98] Head lexicalization [Collins ’99, Charniak ’00] Automatic clustering?

Manual Annotation Manually split categories Advantages: Disadvantages: NP: subject vs object DT: determiners vs demonstratives IN: sentential vs prepositional Advantages: Fairly compact grammar Linguistic motivations Disadvantages: Performance leveled out Manually annotated Model F1 Naïve Treebank Grammar 72.6 Klein & Manning ’03 86.3 Let me first introduce some previous work that we will build upon. KlMa start by annotating a grammar with structural information, in form of parent and sibling annotation. Then further sub split the grammar categories by hand, which is a very tedious process and does not guarantee the best performance. Nonetheless they showed for the first time that unlexicalized systems can achieve high parsing accuracies. In addition to parsing well their grammar was fairly compact since the splitting was done in a linguistically motivated manual hill-climb loop.

Automatic Annotation Induction Advantages: Automatically learned: Label all nodes with latent variables. Same number k of subcategories for all categories. Disadvantages: Grammar gets too large Most categories are oversplit while others are undersplit. Model F1 Klein & Manning ’03 86.3 Matsuzaki et al. ’05 86.7

Learning Latent Annotations Forward EM algorithm: Brackets are known Base categories are known Only induce subcategories X1 X2 X7 X4 X5 X6 X3 He was right . Backward Just like Forward-Backward for HMMs.

Refinement of the DT tag

Hierarchical refinement

Adaptive Splitting Want to split complex categories more Idea: split everything, roll back splits which were least useful

Adaptive Splitting Evaluate loss in likelihood from removing each split = Data likelihood with split reversed Data likelihood with split No loss in accuracy when 50% of the splits are reversed.

Adaptive Splitting Results Model F1 Previous 88.4 With 50% Merging 89.5

Number of Phrasal Subcategories Here are the number of subcategories for each of the 25 phrasal categories. In general phrasal categories are split less heavily than lexical categories, a trend that one could have expected.

Number of Lexical Subcategories Or have little variance.

Final Results F1 ≤ 40 words all words Parser Klein & Manning ’03 86.3 85.7 Matsuzaki et al. ’05 86.7 86.1 Collins ’99 88.6 88.2 Charniak & Johnson ’05 90.1 89.6 Petrov et. al. 06 90.2 89.7 Not only that, we even outperform the fully lexicalized systems of Collins ’99 and the generative component of Charniak and Johnson ’05.

Learned Splits Proper Nouns (NNP): Personal pronouns (PRP): NNP-14 Oct. Nov. Sept. NNP-12 John Robert James NNP-2 J. E. L. NNP-1 Bush Noriega Peters NNP-15 New San Wall NNP-3 York Francisco Street Before I finish, let me give you some interesting examples of what our grammars learn. These are intended to highlight some interesting observations, the full list of subcategories and more details are in the paper. The subcategories sometimes capture syntactic and sometimes semantic difference. But very often they represent syntactico – semantic relations, which are similar to those found in distributional clustering results. For example for the proper nouns the system learns a subcategory for months, first names, last names and initials. It also learns which words typically come first and second in multi-word units. For personal pronouns there is a subcategory for accusative case and one for sentence initial and sentence medial nominative case. PRP-0 It He I PRP-1 it he they PRP-2 them him

Learned Splits Relative adverbs (RBR): Cardinal Numbers (CD): RBR-0 further lower higher RBR-1 more less More RBR-2 earlier Earlier later CD-7 one two Three CD-4 1989 1990 1988 CD-11 million billion trillion CD-0 1 50 100 CD-3 30 31 CD-9 78 58 34 Relative adverbs are divided into distance, degree and time. For the cardinal numbers the system learns to distinguish between spelled out numbers, dates and others.

Treebank PCFGs [Charniak 96] Model F1 Baseline 72.0 Use PCFGs for broad coverage parsing Can take a grammar right off the trees (doesn’t work well): ROOT  S 1 S  NP VP . 1 NP  PRP 1 VP  VBD ADJP 1 ….. Model F1 Baseline 72.0

Conditional Independence? Not every NP expansion can fill every NP slot A grammar with symbols like “NP” won’t be context-free Statistically, conditional independence too strong

Non-Independence Independence assumptions are often too strong. Example: the expansion of an NP is highly dependent on the parent of the NP (i.e., subjects vs. objects). Also: the subject and object expansions are correlated! All NPs NPs under S NPs under VP

Grammar Refinement Example: PP attachment

Grammar Refinement What is parsing about and why do we need to have whole sessions devoted to it? Well, even though we have labeled corpora with observed parse trees, designing grammars is difficult because the observed categories are too coarse to model the true underlying process. Consider this short sentence. There are two NP nodes, one in subject and one in object position. Even though the nodes have the same label in the three, they have very different properties. For example, you couldn’t exchange them and say “The noise heard she.” Learning grammars or building parsers therefore involves refining the grammar categories. This can be done by structural annotation in form of parent annotation, which propagates information top-down. Or by lexicalization, which propagates information bottom-up. Lately some latent variable methods have been shown to achieve state of the art performance. This is the route we will take here, but we will present new estimation techniques which allow us to drastically reduce the size of the grammars. Structure Annotation [Johnson ’98, Klein&Manning ’03] Lexicalization [Collins ’99, Charniak ’00] Latent Variables [Matsuzaki et al. 05, Petrov et al. ’06]

The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Structural annotation

Typical Experimental Setup Corpus: Penn Treebank, WSJ Accuracy – F1: harmonic mean of per-node labeled precision and recall. Here: also size – number of symbols in grammar. Passive / complete symbols: NP, NP^S Active / incomplete symbols: NP  NP CC  Training: sections 02-21 Development: section 22 (here, first 20 files) Test: 23

Vertical Markovization Order 1 Order 2 Vertical Markov order: rewrites depend on past k ancestor nodes. (cf. parent annotation)

Horizontal Markovization Order  Order 1

Vertical and Horizontal Examples: Raw treebank: v=1, h= Johnson 98: v=2, h= Collins 99: v=2, h=2 Best F1: v=3, h=2v Model F1 Size Base: v=h=2v 77.8 7.5K

Unary Splits Problem: unary rewrites used to transmute categories so a high-probability rule can be used. Solution: Mark unary rewrite sites with -U Annotation F1 Size Base 77.8 7.5K UNARY 78.3 8.0K

Tag Splits Problem: Treebank tags are too coarse. Example: Sentential, PP, and other prepositions are all marked IN. Partial Solution: Subdivide the IN tag. Annotation F1 Size Previous 78.3 8.0K SPLIT-IN 80.3 8.1K

Other Tag Splits F1 Size 80.4 8.1K 80.5 81.2 8.5K 81.6 9.0K 81.7 9.1K 81.8 9.3K UNARY-DT: mark demonstratives as DT^U (“the X” vs. “those”) UNARY-RB: mark phrasal adverbs as RB^U (“quickly” vs. “very”) TAG-PA: mark tags with non-canonical parents (“not” is an RB^VP) SPLIT-AUX: mark auxiliary verbs with –AUX [cf. Charniak 97] SPLIT-CC: separate “but” and “&” from other conjunctions SPLIT-%: “%” gets its own tag.

A Fully Annotated (Unlex) Tree

Some Test Set Results Parser LP LR F1 CB 0 CB Magerman 95 84.9 84.6 84.7 1.26 56.6 Collins 96 86.3 85.8 86.0 1.14 59.9 Unlexicalized 86.9 85.7 1.10 60.3 Charniak 97 87.4 87.5 1.00 62.1 Collins 99 88.7 88.6 0.90 67.1 Beats “first generation” lexicalized parsers. Lots of room to improve – more complex models next.