Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 5.

Slides:

Advertisements

Similar presentations

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.

Advertisements

Parsing German with Latent Variable Grammars Slav Petrov and Dan Klein UC Berkeley.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Albert Gatt Corpora and Statistical Methods Lecture 11.

Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.

Learning and Inference for Hierarchically Split PCFGs Slav Petrov and Dan Klein.

Exponential Decay Pruning for Bottom-Up Beam-Search Parsing Nathan Bodenstab, Brian Roark, Aaron Dunlop, and Keith Hall April 2010.

Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.

Probabilistic Parsing Chapter 14, Part 2 This slide set was adapted from J. Martin, R. Mihalcea, Rebecca Hwa, and Ray Mooney.

A Joint Model For Semantic Role Labeling Aria Haghighi, Kristina Toutanova, Christopher D. Manning Computer Science Department Stanford University.

Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.

Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.

PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.

Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.

Stemming, tagging and chunking Text analysis short of parsing.

Amirkabir University of Technology Computer Engineering Faculty AILAB Efficient Parsing Ahmad Abdollahzadeh Barfouroush Aban 1381 Natural Language Processing.

Learning Accurate, Compact, and Interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein.

Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.

Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.

Thoughts on Treebanks Christopher Manning Stanford University.

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

CS224N Interactive Session Competitive Grammar Writing Chris Manning Sida, Rush, Ankur, Frank, Kai Sheng.

SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.

11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.

Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.

Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.

LING 388: Language and Computers Sandiway Fong Lecture 17.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

1 Statistical Parsing Chapter 14 October 2012 Lecture #9.

Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová

1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa

1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav.

12/06/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Parsing Dr. Jan Hajič CS Dept., Johns Hopkins Univ.

ADVANCED PARSING David Kauchak CS159 – Fall 2014 some slides adapted from Dan Klein.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart

11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.

Albert Gatt Corpora and Statistical Methods Lecture 11.

Statistical NLP Spring 2010 Lecture 14: PCFGs Dan Klein – UC Berkeley.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

CPE 480 Natural Language Processing Lecture 4: Syntax Adapted from Owen Rambow’s slides for CSc Fall 2006.

CSA2050 Introduction to Computational Linguistics Parsing I.

CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.

Supertagging CMSC Natural Language Processing January 31, 2006.

Part-of-speech tagging

Natural Language Processing Lecture 15—10/15/2015 Jim Martin.

CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.

December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.

Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.

Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.

Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.

CSC 594 Topics in AI – Natural Language Processing

Statistical NLP Winter 2009

Statistical NLP: Lecture 3

Authorship Attribution Using Probabilistic Context-Free Grammars

Statistical NLP Spring 2011

CSCI 5832 Natural Language Processing

Probabilistic and Lexicalized Parsing

LING/C SC 581: Advanced Computational Linguistics

LING 581: Advanced Computational Linguistics

CSCI 5832 Natural Language Processing

Probabilistic and Lexicalized Parsing

CSCI 5832 Natural Language Processing

Constraining Chart Parsing with Partial Tree Bracketing

Probabilistic Parsing

Presentation transcript:

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 5

A bit more on lexicalized parsing

Complexity of lexicalized PCFG parsing Running time is O ( g 3  n 5 ) !! ikj B[d 1 ]C[d 2 ] A[d 2 ] d1d1 d2d2 Time charged : i, k, j  n 3 A, B, C  g 3 Naively, g becomes huge d 1, d 2  n 2

Complexity of exhaustive lexicalized PCFG parsing

Complexity of lexicalized PCFG parsing Work such as Collins (1997) and Charniak (1997) is O(n 5 ) – but uses heuristic search to be fast in practice Eisner and Satta (2000, etc.) have explored various ways to parse more restricted classes of bilexical grammars in O(n 4 ) or O(n 3 ) time Neat algorithmic stuff!!! See example later from dependency parsing

Refining the node expansion probabilities Charniak (1997) expands each phrase structure tree in a single step. This is good for capturing dependencies between child nodes But it is bad because of data sparseness A pure dependency, one child at a time, model is worse But one can do better by in between models, such as generating the children as a Markov process on both sides of the head (Collins 1997; Charniak 2000) Cf. the accurate unlexicalized parsing discussion

Collins (1997, 1999); Bikel (2004) Collins (1999): also a generative model Underlying lexicalized PCFG has rules of form A more elaborate set of grammar transforms and factorizations to deal with data sparseness and interesting linguistic properties So, generate each child in turn: given P has been generated, generate H, then generate modifying nonterminals from head-adjacent outward with some limited conditioning

Overview of Collins’ Model P(th,wh)P(th,wh) H(t h,w h ) … L1L1 L i–1 LiLi  subcat L i generated conditioning on {subcat L }

Modifying nonterminals generated in two steps S(VBD–sat) VP(VBD–sat) –John NP(NNP )

Back-off level Smoothing for head words of modifying nonterminals Other parameter classes have similar or more elaborate backoff schemes

Collins model … and linguistics Collins had 3 generative models: Models 1 to 3 Especially as you work up from Model 1 to 3, significant linguistic modeling is present: Distance measure: favors close attachments Model is sensitive to punctuation Distinguish base NP from full NP with post-modifiers Coordination feature Mark gapped subjects Model of subcategorization; arguments vs. adjuncts Slash feature/gap threading treatment of displaced constituents Didn’t really get clear gains from this.

Bilexical statistics: Is use of maximal context of P M w useful? Collins (1999): “Most importantly, the model has parameters corresponding to dependencies between pairs of headwords.” Gildea (2001) reproduced Collins’ Model 1 (like regular model, but no subcats) Removing maximal back-off level from P M w resulted in only 0.5% reduction in F-measure Gildea’s experiment somewhat unconvincing to the extent that his model’s performance was lower than Collins’ reported results

Choice of heads If not bilexical statistics, then surely choice of heads is important to parser performance… Chiang and Bikel (2002): parsers performed decently even when all head rules were of form “if parent is X, choose left/rightmost child” Parsing engine in Collins Model 2–emulation mode: LR 88.55% and LP 88.80% on §00 (sent. len. ≤40 words) compared to LR 89.9%, LP 90.1%

Use of maximal context of P M w [Bikel 2004] LRLPCBs0 CBs≤2 CBs Full model No bigrams Performance on §00 of Penn Treebank on sentences of length ≤40 words

Use of maximal context of P M w Back-off level Number of accesses Percentage 03,257, ,294, ,527, Total219,078, Number of times parsing engine was able to deliver a probability for the various back-off levels of the mod-word generation model, P M w, when testing on §00 having trained on §§02–21

Bilexical statistics are used often [Bikel 2004] The 1.49% use of bilexical dependencies suggests they don’t play much of a role in parsing But the parser pursues many (very) incorrect theories So, instead of asking how often the decoder can use bigram probability on average, ask how often while pursuing its top- scoring theory Answering question by having parser constrain-parse its own output train as normal on §§02–21 parse §00 feed parse trees as constraints Percentage of time parser made use of bigram statistics shot up to 28.8% So, used often, but use barely affect overall parsing accuracy Exploratory Data Analysis suggests explanation distributions that include head words are usually sufficiently similar to those that do not as to make almost no difference in terms of accuracy

Charniak (2000) NAACL: A Maximum-Entropy-Inspired Parser There was nothing maximum entropy about it. It was a cleverly smoothed generative model Smoothes estimates by smoothing ratio of conditional terms (which are a bit like maxent features): Biggest improvement is actually that generative model predicts head tag first and then does P(w|t,…) Like Collins (1999) Markovizes rules similarly to Collins (1999) Gets 90.1% LP/LR F score on sentences ≤ 40 wds

Outside Petrov and Klein (2006): Learning Latent Annotations Can you automatically find good symbols? X1X1 X2X2 X7X7 X4X4 X5X5 X6X6 X3X3 Hewasright.  Brackets are known  Base categories are known  Induce subcategoriesInduce subcategories  Clever split/merge category refinement EM algorithm, like Forward-Backward for HMMs, but constrained by tree. Inside

Number of phrasal subcategories

 Proper Nouns (NNP):  Personal pronouns (PRP): NNP-14Oct.Nov.Sept. NNP-12JohnRobertJames NNP-2J.E.L. NNP-1BushNoriegaPeters NNP-15NewSanWall NNP-3YorkFranciscoStreet PRP-0ItHeI PRP-1ithethey PRP-2itthemhim POS tag splits, commonest words: effectively a class-based model

F1 ≤ 40 words F1 all words Parser Klein & Manning unlexicalized Matsuzaki et al. simple EM latent states Charniak generative (“maxent inspired”) Petrov and Klein NAACL Charniak & Johnson discriminative reranker The Latest Parsing Results…

Treebanks and linguistic theory

Treebanks Treebank and parsing experimentation has been dominated by the Penn Treebank If you parse other languages, parser performance generally heads, downhill, even if you normalize for treebank size: WSJ small82.5% Chinese TB75.2% What do we make of this? We’re changing several variables at once. “Is it harder to parse Chinese, or the Chinese Treebank?” [Levy and Manning 2003] What is the basis of the structure of the Penn English Treebank anyway?

Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HDSBOC HDOAMO AC NK S VP NP PP will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen annotation on word level: part-of-speech, morphology, lemmata TiGer Treebank node labels: phrase categories edge labels: syntactic functions crossing branches for discontinuous constituency types

chce wants Sb Kdo who Sb ACT.T investovat to-invest Obj ACT.VOL.T ste hundred Obj RESTR.F korun crown s Atr PAT.F do to AuxP automobilu car Adv DIR.F annotation on word level: lemmata, morphology Prague Dependency Bank syntactic functions dependency structure semantic information on constituent roles, theme/rheme, etc.

Treebanks A good treebank has an extensive manual for each stage of annotation Consistency is often at least as important as being right But is what’s in these treebanks right? Tokenization: has n’tI ’ll Hyphenated terms cancer-causing/JJ asbestos/NN the back-on-terra-firma/JJ toast/NN the/DT nerd-and-geek/JJ club/NN

Treebank: POS Some POS tagging errors reflect not only human inconsistency but problems in the definition of POS tags, suggestive of clines/blends Example: near In Middle English, an adjective Today is it an adjective or a preposition? The near side of the moon We were near the station Not just a word with multiple parts of speech! Evidence of blending: I was nearer the bus stop than the train

Criteria for Part Of Speech In some cases functional/notional tagging dominates structure in Penn Treebank, even against explicit instructions to the contrary: worth: 114 instances 10 tagged IN (8 placed in ADJP!) 65 tagged JJ (48 in ADJP, 13 in PP, 4 NN/NP errors) 39 tagged NN (2 IN/JJ errors) Linguist hat on: I tend to agree with IN choice (when not a noun): tagging accuracy only 41% for worth!

Where a rule was followed: “marginal prepositions” Fowler (1926): “there is a continual change going on by which certain participles or adjectives acquire the character of prepositions or adverbs, no longer needing the prop of a noun to cling to” It is easy to have no tagging ambiguity in such cases: Penn Treebank (Santorini 1991): “Putative prepositions ending in -ed or -ing should be tagged as past participles (VBN) or gerunds (VBG), respectively, not as prepositions (IN). According/VBG to reliable sources Concerning/VBG your request of last week”

Preposition IN ⇄ Verb (VBG) But this makes no real sense Rather we see “a development caught in the act” (Fowler 1926) They moved slowly, toward the main gate, following the wall Repeat the instructions following the asterisk This continued most of the week following that ill-starred trip to church He bled profusely following circumcision Following a telephone call, a little earlier, Winter had said … IN: during [cf. endure], pending, notwithstanding ??: concerning, excepting, considering, …