Albert Gatt Corpora and Statistical Methods Lecture 11.

Slides:



Advertisements
Similar presentations
Computational language: week 10 Lexical Knowledge Representation concluded Syntax-based computational language Sentence structure: syntax Context free.
Advertisements

Basic Parsing with Context-Free Grammars CS 4705 Julia Hirschberg 1 Some slides adapted from Kathy McKeown and Dan Jurafsky.
Probabilistic and Lexicalized Parsing CS Probabilistic CFGs: PCFGs Weighted CFGs –Attach weights to rules of CFG –Compute weights of derivations.
Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.
May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Approaches to Parsing.
Probabilistic Parsing Chapter 14, Part 2 This slide set was adapted from J. Martin, R. Mihalcea, Rebecca Hwa, and Ray Mooney.
Chapter 12 Lexicalized and Probabilistic Parsing Guoqiang Shan University of Arizona November 30, 2006.
10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
CKY Parsing Ling 571 Deep Processing Techniques for NLP January 12, 2011.
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 11 Giuseppe Carenini.
Albert Gatt LIN3022 Natural Language Processing Lecture 8.
Parsing with PCFG Ling 571 Fei Xia Week 3: 10/11-10/13/05.
Features and Unification
Parsing SLP Chapter 13. 7/2/2015 Speech and Language Processing - Jurafsky and Martin 2 Outline  Parsing with CFGs  Bottom-up, top-down  CKY parsing.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
1 Basic Parsing with Context Free Grammars Chapter 13 September/October 2012 Lecture 6.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
PARSING David Kauchak CS457 – Fall 2011 some slides adapted from Ray Mooney.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 29– CYK; Inside Probability; Parse Tree construction) Pushpak Bhattacharyya CSE.
Hand video 
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
BİL711 Natural Language Processing1 Statistical Parse Disambiguation Problem: –How do we disambiguate among a set of parses of a given sentence? –We want.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Spring /22/071 Beyond PCFGs Chris Brew Ohio State University.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
December 2004CSA3050: PCFGs1 CSA305: Natural Language Algorithms Probabilistic Phrase Structure Grammars (PCFGs)
GRAMMARS David Kauchak CS159 – Fall 2014 some slides adapted from Ray Mooney.
May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Parsing with Context Free Grammars.
Parsing I: Earley Parser CMSC Natural Language Processing May 1, 2003.
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Page 1 Probabilistic Parsing and Treebanks L545 Spring 2000.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture August 2007.
Albert Gatt Corpora and Statistical Methods Lecture 11.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 29– CYK; Inside Probability; Parse Tree construction) Pushpak Bhattacharyya CSE.
CSA2050 Introduction to Computational Linguistics Parsing I.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
CS 4705 Lecture 10 The Earley Algorithm. Review Top-Down vs. Bottom-Up Parsers –Both generate too many useless trees –Combine the two to avoid over-generation:
csa3050: Parsing Algorithms 11 CSA350: NLP Algorithms Parsing Algorithms 1 Top Down Bottom-Up Left Corner.
Natural Language Processing Lecture 15—10/15/2015 Jim Martin.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)
/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Chapter 12: Probabilistic Parsing and Treebanks Heshaam Faili University of Tehran.
Roadmap Probabilistic CFGs –Handling ambiguity – more likely analyses –Adding probabilities Grammar Parsing: probabilistic CYK Learning probabilities:
Natural Language Processing Vasile Rus
1 Statistical methods in NLP Diana Trandabat
CSC 594 Topics in AI – Natural Language Processing
CSC 594 Topics in AI – Natural Language Processing
Parsing Recommended Reading: Ch th Jurafsky & Martin 2nd edition
Basic Parsing with Context Free Grammars Chapter 13
CS 388: Natural Language Processing: Statistical Parsing
Probabilistic and Lexicalized Parsing
CSCI 5832 Natural Language Processing
CS 388: Natural Language Processing: Syntactic Parsing
CSCI 5832 Natural Language Processing
Probabilistic and Lexicalized Parsing
CSCI 5832 Natural Language Processing
Parsing and More Parsing
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
David Kauchak CS159 – Spring 2019
Presentation transcript:

Albert Gatt Corpora and Statistical Methods Lecture 11

Statistical parsing Part 2

Preliminary issues How parsers are evaluated

Evaluation The issue: what objective criterion are we trying to maximise? i.e. under what objective function can I say that my parser does “well” (and how well?) need a gold standard Possibilities: strict match of candidate parse against gold standard match of components of candidate parse against gold standard components

Evaluation A classic evaluation metric is the PARSEVAL one initiative to compare parsers on the same data not initially concerned with stochastic parsers evaluate parser output piece by piece Main components: compares gold standard tree to parser tree typically, gold standard is the tree in a treebank computes: precision recall crossing brackets

PARSEVAL: labeled recall Correct node = node in candidate parse which: has same node label originally omitted from PARSEVAL to avoid theoretical conflict spans the same words

PARSEVAL: labeled precision The proportion of correctly labelled and correctly spanning nodes in the candidate.

Combining Precision and Recall As usual, Precision and recall can be combined into a single F-measure:

PARSEVAL: crossed brackets number of brackets in the candidate parse which cross brackets in the treebank parse e.g. treebank has ((X Y) Z) and candidate has (X (Y Z)) Unlike precision/recall, this is an objective function to minimise

Current performance Current parsers achieve: ca. 90% precision >90% recall 1% cross-bracketed constituents

Some issues with PARSEVAL 1. These measures evaluate parses at the level of individual decisions (nodes). ignore the difficulty of getting a globally correct solution by carrying out a correct sequence of decisions 2. Success on crossing brackets depends on the kind of parse trees used Penn Treebank has very flat trees (not much embedding), therefore likelihood of crossed brackets decreases. 3. In PARSEVAL, if a constituent is attached lower in a tree than the gold standard, all its daughters are counted wrong.

Probabilistic parsing with PCFGs The basic algorithm

The basic PCFG parsing algorithm Many statistical parsers use a version of the CYK algorithm. Assumptions: CFG productions are in Chomsky Normal Form. A  BC A  a Use indices between words: Book the flight through Houston (0) Book (1) the (2) flight (3) through (4) Houston (5) Procedure (bottom-up): Traverse input sentence left-to-right Use a chart to store constituents and their span + their probability.

Probabilistic CYK: example PCFG S  NP VP [.80] NP  Det N [.30] VP  V NP [.20] V  includes [.05] Det  the [.4] Det  a [.4] N  meal [.01] N  flight [.02]

Probabilistic CYK: initialisation The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chart ij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chart ik & C in chart kj : chart ij := chart ij U {A}

Probabilistic CYK: lexical step Det (.4) The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G}

Probabilistic CYK: lexical step Det (.4) 1 N The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G}

Probabilistic CYK: syntactic step Det (.4) NP N The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chart ij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chart ik & C in chart kj : chart ij := chart ij U {A}

Probabilistic CYK: lexical step Det (.4) NP N.02 2 V The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G}

Probabilistic CYK: lexical step Det (.4) NP N.02 2 V.05 3 Det The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G}

Probabilistic CYK: syntactic step Det (.4) NP N.02 2 V.05 3 Det.4 4 N.01 The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G}

Probabilistic CYK: syntactic step Det (.4) NP N.02 2 V.05 3 Det.4 NP N.01 The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chart ij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chart ik & C in chart kj : chart ij := chart ij U {A}

Probabilistic CYK: syntactic step Det (.4) NP N.02 2 V.05 VP Det.4 NP N.01 The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chart ij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chart ik & C in chart kj : chart ij := chart ij U {A}

Probabilistic CYK: syntactic step Det (.4) NP.0024 S N.02 2 V.05 VP Det.4 NP N.01 The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chart ij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chart ik & C in chart kj : chart ij := chart ij U {A}

Probabilistic CYK: summary Cells in chart hold probabilities Bottom-up procedure computes probability of a parse incrementally. To obtain parse trees, cells need to be augmented with backpointers.

Probabilistic parsing with lexicalised PCFGs Main approaches (focus on Collins (1997,1999)) see also: Charniak (1997)

Unlexicalised PCFG Estimation Charniak (1996) used Penn Treebank POS and phrasal categories to induce a maximum likelihood PCFG only used relative frequency of local trees as the estimates for rule probabilities did not apply smoothing or any other techniques Works surprisingly well: 80.4% recall; 78.8% precision (crossed brackets not estimated) Suggests that most parsing decisions are mundane and can be handled well by unlexicalized PCFG

Probabilistic lexicalised PCFGs Standard format of lexicalised rules: associate head word with non-terminal e.g. dumped sacks into VP(dumped)  VBD(dumped) NP(sacks) PP(into) associate head tag with non-terminal VP(dumped,VBD)  VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) Types of rules: lexical rules expand pre-terminals to words: e.g. NNS(sacks,NNS)  sacks probability is always 1 internal rules expand non-terminals e.g. VP(dumped,VBD)  VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN)

Estimating probabilities Non-generative model: take an MLE estimate of the probability of an entire rule non-generative models suffer from serious data sparseness problems Generative model: estimate the probability of a rule by breaking it up into sub-rules.

Collins Model 1 Main idea: represent CFG rules as expansions into Head + left modifiers + right modifiers L i /R i is of the form L/R(word,tag); e.g. NP(sacks,NNS) STOP: special symbol indicating left/right boundary. Parsing: Given the LHS, generate the head of the rule, then the left modifiers (until STOP) and right modifiers (until STOP) inside-out. Each step has a probability.

Collins Model 1: example VP(dumped,VBD)  VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) 1. Head H(hw,ht):

Collins Model 1: example VP(dumped,VBD)  VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) 1. Head H(hw,ht): 2. Left modifiers:

Collins Model 1: example VP(dumped,VBD)  VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) 1. Head H(hw,ht): 2. Left modifiers: 3. Right modifiers:

Collins Model 1: example VP(dumped,VBD)  VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) 1. Head H(hw,ht): 2. Left modifiers: 3. Right modifiers: 4. Total probability: multiplication of (1) – (3)

Variations on Model 1: distance Collins proposed to extend rules by conditioning on distance of modifiers from the head: a function of the yield of modifiers seen. Distance for R 2 probability = words under R 1

Using a distance function Simplest kind of distance function is a tuple of binary features: Is the string of length 0? Does the string contain a verb? … Example uses: if the string has length 0, P R should be higher: English is right-branching & most right modifiers are adjacent to the head verb if string contains a verb, P R should be higher: accounts for preference to attach dependencies to main verb

Further additions Collins Model 2: subcategorisation preferences distinction between complements and adjuncts. Model 3 augmented to deal with long-distance (WH) dependencies.

Smoothing and backoff Rules may condition on words that never occur in training data. Collins used 3-level backoff model. Combined using linear interpolation. 1. use head word 2. use head tag 3. parent only

Other parsing approaches

Data-oriented parsing Alternative to “grammar-based” models does not attempt to derive a grammar from a treebank treebank data is stored as fragments of trees parser uses whichever trees seem to be useful

Data-oriented parsing Suppose we want to parse Sue heard Jim. Corpus contains the following potentially useful fragments: Parser can combine these to give a parse

Data-oriented Parsing Multiple fundamentally distinct derivations of a single tree. Parse using Monte Carlo simulation methods: randomly produce a large sample of derivations use these to find the most probable parse disadvantage: needs very large samples to make parses accurate, therefore potentially slow

Data-oriented parsing vs. PCFGs Possible advantages: using partial trees directly accounts for lexical dependencies also accounts for multi-word expressions and idioms (e.g. take advantage of) while PCFG rules only represent trees of depth 1, DOP fragments can represent trees of arbitrary length Similarities to PCFG: tree fragments could be equivalent to PCFG rules probabilities estimated for grammar rules are exactly the same as for tree fragments

History Based Grammars (HBG) General idea: any derivational step can be influenced by any earlier derivational step (Black et al. 1993) the probability of expansion of the current node conditioned on all previous nodes along the path from the root

History Based Grammars (HBG) Black et al lexicalise their grammar. every phrasal node inherits 2 words: its lexical head H 1 a secondary head H 2, deemed to be useful e.g. the PP in the bank might have H1=in and H2=bank Every non-terminal is also assigned: a syntactic category (Syn) e.g. PP a semantic category (Sem) e.g with-Data Use the index I that indicates what number child of the parent node is being expanded

HBG Example (Black et al 1993)

History Based Grammars (HBG) Estimation of the probability of a rule R: probability of: the current rule R to be applied its Syn and Sem category its heads H1 and H2 conditioned on: Syn and Sem of parent node the rule that gave rise to the parent the index of this child relative to the parent the heads H1 and H2 of the parent

Summary This concludes our overview of statistical parsing We’ve looked at three important models Also considered basic search techniques and algorithms