Statistical NLP Spring 2011

Slides:



Advertisements
Similar presentations
Parsing German with Latent Variable Grammars Slav Petrov and Dan Klein UC Berkeley.
Advertisements

Self-training with Products of Latent Variable Grammars Zhongqiang Huang, Mary Harper, and Slav Petrov.
Albert Gatt Corpora and Statistical Methods Lecture 11.
Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.
Learning and Inference for Hierarchically Split PCFGs Slav Petrov and Dan Klein.
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Learning Accurate, Compact, and Interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein.
Syntactic Parsing with CFGs CMSC 723: Computational Linguistics I ― Session #7 Jimmy Lin The iSchool University of Maryland Wednesday, October 14, 2009.
Parsing SLP Chapter 13. 7/2/2015 Speech and Language Processing - Jurafsky and Martin 2 Outline  Parsing with CFGs  Bottom-up, top-down  CKY parsing.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
Hand video 
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning CPSC 315 – Programming Studio Spring 2008 Project 2, Lecture 2 Adapted from slides of Yoonsuck.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav.
12/06/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Parsing Dr. Jan Hajič CS Dept., Johns Hopkins Univ.
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Statistical NLP Spring 2010 Lecture 14: PCFGs Dan Klein – UC Berkeley.
CSA2050 Introduction to Computational Linguistics Parsing I.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
Hand video 
Natural Language Processing Lecture 15—10/15/2015 Jim Martin.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Roadmap Probabilistic CFGs –Handling ambiguity – more likely analyses –Adding probabilities Grammar Parsing: probabilistic CYK Learning probabilities:
Natural Language Processing Vasile Rus
1 Statistical methods in NLP Diana Trandabat
CSC 594 Topics in AI – Natural Language Processing
CSC 594 Topics in AI – Natural Language Processing
Parsing Recommended Reading: Ch th Jurafsky & Martin 2nd edition
Statistical NLP Winter 2009
CSC 594 Topics in AI – Natural Language Processing
CSE 517 Natural Language Processing Winter 2015
B+-Trees.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
Statistical NLP Spring 2011
Daniel Fried*, Mitchell Stern* and Dan Klein UC Berkeley
Statistical NLP Spring 2011
CS 388: Natural Language Processing: Statistical Parsing
Probabilistic and Lexicalized Parsing
CSCI 5832 Natural Language Processing
CS 388: Natural Language Processing: Syntactic Parsing
CSCI 5832 Natural Language Processing
Probabilistic and Lexicalized Parsing
CSCI 5832 Natural Language Processing
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
Parsing and More Parsing
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 26
Chunk Parsing CS1573: AI Application Development, Spring 2003
CS : Language Technology for the Web/Natural Language Processing
CSCI 5832 Natural Language Processing
David Kauchak CS159 – Spring 2019
Amortized Analysis and Heaps Intro
Parsing I: CFGs & the Earley Parser
Probabilistic Parsing
David Kauchak CS159 – Spring 2019
David Kauchak CS159 – Spring 2019
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning
Presentation transcript:

Statistical NLP Spring 2011 Lecture 17: Parsing III Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA

Treebank PCFGs [Charniak 96] Model F1 Baseline 72.0 Use PCFGs for broad coverage parsing Can take a grammar right off the trees (doesn’t work well): ROOT  S 1 S  NP VP . 1 NP  PRP 1 VP  VBD ADJP 1 ….. Model F1 Baseline 72.0

Conditional Independence? Not every NP expansion can fill every NP slot A grammar with symbols like “NP” won’t be context-free Statistically, conditional independence too strong

The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Structural annotation

A Fully Annotated (Unlex) Tree

Some Test Set Results Parser LP LR F1 CB 0 CB Magerman 95 84.9 84.6 84.7 1.26 56.6 Collins 96 86.3 85.8 86.0 1.14 59.9 Unlexicalized 86.9 85.7 1.10 60.3 Charniak 97 87.4 87.5 1.00 62.1 Collins 99 88.7 88.6 0.90 67.1 Beats “first generation” lexicalized parsers. Lots of room to improve – more complex models next.

The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Structural annotation [Johnson ’98, Klein and Manning 03] Head lexicalization [Collins ’99, Charniak ’00]

Problems with PCFGs If we do no annotation, these trees differ only in one rule: VP  VP PP NP  NP PP Parse will go one way or the other, regardless of words We addressed this in one way with unlexicalized grammars (how?) Lexicalization allows us to be sensitive to specific words

Problems with PCFGs What’s different between basic PCFG scores here? What (lexical) correlations need to be scored?

Lexicalized Trees Add “headwords” to each phrasal node Syntactic vs. semantic heads Headship not in (most) treebanks Usually use head rules, e.g.: NP: Take leftmost NP Take rightmost N* Take rightmost JJ Take right child VP: Take leftmost VB* Take leftmost VP Take left child

Lexicalized PCFGs? Problem: we now have to estimate probabilities like Never going to get these atomically off of a treebank Solution: break up derivation into smaller steps

Lexical Derivation Steps A derivation of a local tree [Collins 99] Choose a head tag and word Choose a complement bag Generate children (incl. adjuncts) Recursively derive children

Lexicalized CKY X[h] Y[h] Z[h’] bestScore(X,i,j,h) if (j = i+1) (VP->VBD...NP )[saw] X[h] (VP->VBD )[saw] NP[her] Y[h] Z[h’] bestScore(X,i,j,h) if (j = i+1) return tagScore(X,s[i]) else return max max score(X[h]->Y[h] Z[h’]) * bestScore(Y,i,k,h) * bestScore(Z,k,j,h’) max score(X[h]->Y[h’] Z[h]) * bestScore(Y,i,k,h’) * bestScore(Z,k,j,h) i h k h’ j k,h’,X->YZ k,h’,X->YZ

Quartic Parsing Turns out, you can do (a little) better [Eisner 99] Gives an O(n4) algorithm Still prohibitive in practice if not pruned X[h] X[h] Y[h] Z[h’] Y[h] Z i h k h’ j i h k j

Pruning with Beams The Collins parser prunes with per-cell beams [Collins 99] Essentially, run the O(n5) CKY Remember only a few hypotheses for each span <i,j>. If we keep K hypotheses at each span, then we do at most O(nK2) work per span (why?) Keeps things more or less cubic Also: certain spans are forbidden entirely on the basis of punctuation (crucial for speed) X[h] Y[h] Z[h’] i h k h’ j

Pruning with a PCFG The Charniak parser prunes using a two-pass approach [Charniak 97+] First, parse with the base grammar For each X:[i,j] calculate P(X|i,j,s) This isn’t trivial, and there are clever speed ups Second, do the full O(n5) CKY Skip any X :[i,j] which had low (say, < 0.0001) posterior Avoids almost all work in the second phase! Charniak et al 06: can use more passes Petrov et al 07: can use many more passes

E.g. consider the span 5 to 12: Prune? For each chart item X[i,j], compute posterior probability: < threshold E.g. consider the span 5 to 12: coarse: … QP NP VP To make this explicit, what do I mean by pruning? For each item we compute its posterior probability using the inside/outside scores. For example for a given span we have constituents like QP, NP, VP in our coarse grammar. When we prune, we will compute the probability of the coarse symbols. If this probability is below a threshold we can be pretty confident that there won’t be such a constituent in the final parse tree. Each of those symbols corresponds to a set of symbols in the refined grammar. We can therefore safely remove all its refined version from the chart of the refined grammar. How much does this help? refined:

Pruning with A* You can also speed up the search without sacrificing optimality For agenda-based parsers: Can select which items to process first Can do with any “figure of merit” [Charniak 98] If your figure-of-merit is a valid A* heuristic, no loss of optimiality [Klein and Manning 03] X i j n

Projection-Based A* S:fell VP:fell NP:payrolls PP:in Factory payrolls fell in Sept. NP:payrolls PP:in VP:fell S:fell Factory payrolls fell in Sept. NP PP VP S Factory payrolls fell in Sept. payrolls in fell

A* Speedup Total time dominated by calculation of A* tables in each projection… O(n3)

Results Some results However Collins 99 – 88.6 F1 (generative lexical) Charniak and Johnson 05 – 89.7 / 91.3 F1 (generative lexical / reranked) Petrov et al 06 – 90.7 F1 (generative unlexical) McClosky et al 06 – 92.1 F1 (gen + rerank + self-train) However Bilexical counts rarely make a difference (why?) Gildea 01 – Removing bilexical counts costs < 0.5 F1

The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Structural annotation Head lexicalization Automatic clustering?

Latent Variable Grammars ... So, what are latent variable grammars? The idea is that we take our observed parse trees and augment them with latent variables at each node. Each observed category, like NP, is thereby split into a set of subcategories NP-0 through NP-k. This also creates a set of exponentially many derivations. We then learn distributions over the new subcategories and these derivations. The resulting grammars look like this, where we have now many rules for each of the original rules that we started with. For example ... The learning process determines the exact set of rules, as well as the parameter values. Parse Tree Sentence Derivations Parameters

Learning Latent Annotations Forward EM algorithm: Brackets are known Base categories are known Only induce subcategories X1 X2 X7 X4 X5 X6 X3 He was right . Backward Just like Forward-Backward for HMMs.

Refinement of the DT tag

Hierarchical refinement

Hierarchical Estimation Results As we had hoped the hierarchical training leads to better parameter estimates, which in turn results in a 1% improvement in parsing accuracy. Model F1 Flat Training 87.3 Hierarchical Training 88.4

Refinement of the , tag Splitting all categories equally is wasteful: This being said, there are still some things that are broken in our model. Let’s look at another category, e.g. the , tag. Whether we apply hierarchical training or not, the result will be the same. We are generating several subcategories that are exactly the same.

Adaptive Splitting Want to split complex categories more Idea: split everything, roll back splits which were least useful

Adaptive Splitting Results Here you can see how the grammar size has been dramatically reduced. The arrows connect models that have undergone the same number of split iterations and therefore have the same maximal number of subcategories. Because we have merged some of the subcategories back in, the blue model has far fewer total subcategories. The adaptive splitting allows us to do two more split iterations and further refine more complicated categories. This additional expressiveness gives us in turn another 1% improvement in parsing accuracy. We are now close to 90% with a grammar that has a total of barely more than 1000 subcategories. This grammar can allocate upto 64 subcategories, and here is where they get allocated. Model F1 Previous 88.4 With 50% Merging 89.5

Number of Phrasal Subcategories Here are the number of subcategories for each of the 25 phrasal categories. In general phrasal categories are split less heavily than lexical categories, a trend that one could have expected.

Number of Lexical Subcategories Or have little variance.

Learned Splits Proper Nouns (NNP): Personal pronouns (PRP): NNP-14 Oct. Nov. Sept. NNP-12 John Robert James NNP-2 J. E. L. NNP-1 Bush Noriega Peters NNP-15 New San Wall NNP-3 York Francisco Street Before I finish, let me give you some interesting examples of what our grammars learn. These are intended to highlight some interesting observations, the full list of subcategories and more details are in the paper. The subcategories sometimes capture syntactic and sometimes semantic difference. But very often they represent syntactico – semantic relations, which are similar to those found in distributional clustering results. For example for the proper nouns the system learns a subcategory for months, first names, last names and initials. It also learns which words typically come first and second in multi-word units. For personal pronouns there is a subcategory for accusative case and one for sentence initial and sentence medial nominative case. PRP-0 It He I PRP-1 it he they PRP-2 them him

Learned Splits Relative adverbs (RBR): Cardinal Numbers (CD): RBR-0 further lower higher RBR-1 more less More RBR-2 earlier Earlier later CD-7 one two Three CD-4 1989 1990 1988 CD-11 million billion trillion CD-0 1 50 100 CD-3 30 31 CD-9 78 58 34 Relative adverbs are divided into distance, degree and time. For the cardinal numbers the system learns to distinguish between spelled out numbers, dates and others.

Coarse-to-Fine Inference Example: PP attachment ?????????

Hierarchical Pruning coarse: split in two: split in four: … QP NP VP split in two: … QP1 QP2 NP1 NP2 VP1 VP2 split in four: … QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 Consider again the same span as before. We had decided that QP is not a valid constituent. In hierarchical pruning we will do several pre-parses instead of doing just one. Instead of going directly to our most refined grammar we can then use a grammar where the categories have been split in two. We already said that QPs are not valid, so that we won’t need to consider them. We will compute the posterior probability of the different NPs and VPs and we might decide that NP1 is valid but NP2 is not. We can then go to to the next, more refined grammar, where some other subcategories will get pruned and so on. This is just an example but it is pretty representative. The pruning allows us to keep the number of active chart items roughly constant even though we are doubling the size of the grammar. split in eight: …

Bracket Posteriors Well, actually every pass is useful. After each pass we double the number of chart items, but by pruning we are able to keep the number of valid chart items roughly constant. You can also see that some ambiguities are resolved early on, while others, like some attachment ambiguities, are not resolved until the end. At the end we have this extremely sparse chart which contains most of the probability mass. What we need to do is extract a single parse tree from it.

15 min 1621 min 111 min 35 min (no search error) Wow, that’s fast now. 15 min without search error, remember that’s at 91.2 F-score. And if you are willing to sacrifice 0.1 in accuracy you can this time in half again.

Final Results (Accuracy) ≤ 40 words F1 all ENG Charniak&Johnson ‘05 (generative) 90.1 89.6 Split / Merge 90.6 GER Dubey ‘05 76.3 - 80.8 80.1 CHN Chiang et al. ‘02 80.0 76.6 86.3 83.4 In terms of accuracy we are better than the generative component of the Brown parser, but still worse than the reranking parser. However, we could extract n-best lists and rerank them too, if we wanted… The nice thing about our approach is that we do not require any knowledge about the target language. We just need a treebank. We therefore trained some grammars for German and Chinese. We surpass the previous stae of the art by a huge margin there. Mainly because most work on parsing has been done on English and parsing techniques for other languages are still somewhat behind. Still higher numbers from reranking / self-training methods

Parse Reranking Assume the number of parses is very small We can represent each parse T as an arbitrary feature vector (T) Typically, all local rules are features Also non-local features, like how right-branching the overall tree is [Charniak and Johnson 05] gives a rich set of features

1-Best Parsing

K-Best Parsing [Huang and Chiang 05, Pauls, Klein, Quirk 10]

K-Best Parsing

Dependency Parsing Lexicalized parsers can be seen as producing dependency trees Each local binary tree corresponds to an attachment in the dependency graph questioned lawyer witness the the

Dependency Parsing Pure dependency parsing is only cubic [Eisner 99] Some work on non-projective dependencies Common in, e.g. Czech parsing Can do with MST algorithms [McDonald and Pereira 05] X[h] h Y[h] Z[h’] h h’ i h k h’ j h k h’

Shift-Reduce Parsers Another way to derive a tree: Parsing No useful dynamic programming search Can still use beam search [Ratnaparkhi 97]

Data-oriented parsing: Rewrite large (possibly lexicalized) subtrees in a single step Formally, a tree-insertion grammar Derivational ambiguity whether subtrees were generated atomically or compositionally Most probable parse is NP-complete

TIG: Insertion

Tree-adjoining grammars Start with local trees Can insert structure with adjunction operators Mildly context-sensitive Models long-distance dependencies naturally … as well as other weird stuff that CFGs don’t capture well (e.g. cross-serial dependencies)

TAG: Long Distance

CCG Parsing Combinatory Categorial Grammar Fully (mono-) lexicalized grammar Categories encode argument sequences Very closely related to the lambda calculus (more later) Can have spurious ambiguities (why?)