1 Semi-Supervised Approaches for Learning to Parse Natural Languages Slides are from Rebecca Hwa, Ray Mooney.

Slides:



Advertisements
Similar presentations
Expectation Maximization Dekang Lin Department of Computing Science University of Alberta.
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Albert Gatt Corpora and Statistical Methods Lecture 11.
Probabilistic and Lexicalized Parsing CS Probabilistic CFGs: PCFGs Weighted CFGs –Attach weights to rules of CFG –Compute weights of derivations.
CS 388: Natural Language Processing: Statistical Parsing
Probabilistic Parsing Chapter 14, Part 2 This slide set was adapted from J. Martin, R. Mihalcea, Rebecca Hwa, and Ray Mooney.
Introduction and Jurafsky Model Resource: A Probabilistic Model of Lexical and Syntactic Access and Disambiguation, Jurafsky 1996.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.
Basic Parsing with Context- Free Grammars 1 Some slides adapted from Julia Hirschberg and Dan Jurafsky.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
PARSING David Kauchak CS457 – Fall 2011 some slides adapted from Ray Mooney.
Hand video 
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
BİL711 Natural Language Processing1 Statistical Parse Disambiguation Problem: –How do we disambiguate among a set of parses of a given sentence? –We want.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
GRAMMARS David Kauchak CS159 – Fall 2014 some slides adapted from Ray Mooney.
1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
Parsing I: Earley Parser CMSC Natural Language Processing May 1, 2003.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Page 1 Probabilistic Parsing and Treebanks L545 Spring 2000.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture August 2007.
Albert Gatt Corpora and Statistical Methods Lecture 11.
For Wednesday Read chapter 23 Homework: –Chapter 22, exercises 1,4, 7, and 14.
CSA2050 Introduction to Computational Linguistics Parsing I.
Natural Language - General
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
Report on Semi-supervised Training for Statistical Parsing Zhang Hao
Supertagging CMSC Natural Language Processing January 31, 2006.
Hand video 
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.
December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)
GRAMMARS David Kauchak CS457 – Spring 2011 some slides adapted from Ray Mooney.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
NLP. Parsing ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NLP. Introduction to NLP #include int main() { int n, reverse = 0; printf("Enter a number to reverse\n"); scanf("%d",&n); while (n != 0) { reverse =
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Roadmap Probabilistic CFGs –Handling ambiguity – more likely analyses –Adding probabilities Grammar Parsing: probabilistic CYK Learning probabilities:
Statistical Parsing IP disclosure: Content borrowed from J&M 3 rd edition and Raymond Mooney.
Natural Language Processing Vasile Rus
CSC 594 Topics in AI – Natural Language Processing
COSC 6336 Natural Language Processing Statistical Parsing
Sample Selection for Statistical Parsing
CS 388: Natural Language Processing: Statistical Parsing
Probabilistic and Lexicalized Parsing
LING/C SC 581: Advanced Computational Linguistics
Probabilistic and Lexicalized Parsing
CSCI 5832 Natural Language Processing
Natural Language - General
David Kauchak CS159 – Spring 2019
David Kauchak CS159 – Spring 2019
Presentation transcript:

1 Semi-Supervised Approaches for Learning to Parse Natural Languages Slides are from Rebecca Hwa, Ray Mooney

2 The Role of Parsing in Language Applications… As a stand-alone application –Grammar checking As a pre-processing step –Question Answering –Information extraction As an integral part of a model –Speech Recognition –Machine Translation

3 Parsing Parsers provide syntactic analyses of sentences VP saw PN her VB NP S PN I NP Input: I saw her

4 Challenges in Building Parsers Disambiguation –Lexical disambiguation –Structural disambiguation Rule Exceptions –Many lexical dependencies Manual Grammar Construction –Limited coverage –Difficult to maintain

5 Meeting these Challenges: Statistical Parsing Disambiguation? –Resolve local ambiguities with global likelihood Rule Exceptions? –Lexicalized representation Manual Grammar Construction? –Automatic induction from large corpora –A new challenge: how to obtain training corpora? –Make better use of unlabeled data with machine learning techniques and linguistic knowledge

6 Roadmap Parsing as a learning problem Semi-supervised approaches –Sample selection –Co-training –Corrected Co-training Conclusion and further directions

7 Parsing Ambiguities T1:T1:T2:T2: VP saw her N duck PNS VB NP S PN I PP with PNP a NDET NP telescope VP saw her N duck PNS VB NP S PN I PP with PNP a NDET NP telescope Input: “I saw her duck with a telescope”

8 Disambiguation with Statistical Parsing T1:T1:T2:T2: VP saw her N duck PNS VB NP S PN I PP with PNP a NDET NP telescope VP saw her N duck PNS VB NP S PN I PP with PNP a NDET NP telescope W = “I saw her duck with a telescope”

9 A Statistical Parsing Model Probabilistic Context-Free Grammar (PCFG) Associate probabilities with production rules Likelihood of the parse is computed from the rules used Learn rule probabilities from training data 0.3 NP PN 0.5 DET a 0.1 DET an the 0.4 DET Example of PCFG rules: NP DET N

10 Sentence Probability Assume productions for each node are chosen independently. Probability of derivation is the product of the probabilities of its productions. P(D 1 ) = 0.1 x 0.5 x 0.5 x 0.6 x 0.6 x 0.5 x 0.3 x 1.0 x 0.2 x 0.2 x 0.5 x 0.8 = D1D1 S VP Verb NP Det Nominal Nominal PP book Prep NP through Houston Proper-Noun the flight Noun

Syntactic Disambiguation Resolve ambiguity by picking most probable parse tree. 11 D2D2 VP Verb NP Det Nominal book Prep NP through Houston Proper-Noun the flight Noun S VP 0.1 PP 0.3 P(D 2 ) = 0.1 x 0.3 x 0.5 x 0.6 x 0.5 x 0.6 x 0.3 x 1.0 x 0.5 x 0.2 x 0.2 x 0.8 =

Sentence Probability Probability of a sentence is the sum of the probabilities of all of its derivations. 12 P(“book the flight through Houston”) = P(D 1 ) + P(D 2 ) = =

13 PCFG: Supervised Training If parse trees are provided for training sentences, a grammar and its parameters can be can all be estimated directly from counts accumulated from the tree-bank (with appropriate smoothing) Tree Bank Supervised PCFG Training S → NP VP S → VP NP → Det A N NP → NP PP NP → PropN A → ε A → Adj A PP → Prep NP VP → V NP VP → VP PP English S NP VP John V NP PP put the dog in the pen S NP VP John V NP PP put the dog in the pen

Estimating Production Probabilities Set of production rules can be taken directly from the set of rewrites in the treebank. Parameters can be directly estimated from frequency counts in the treebank. 14

15 Vanilla PCFG Limitations Since probabilities of productions do not rely on specific words or concepts, only general structural disambiguation is possible (e.g. prefer to attach PPs to Nominals). Consequently, vanilla PCFGs cannot resolve syntactic ambiguities that require semantics to resolve, e.g. ate with fork vs. meatballs. In order to work well, PCFGs must be lexicalized, i.e. productions must be specialized to specific words by including their head-word in their LHS non-terminals (e.g. VP-ate).

Example of Importance of Lexicalization A general preference for attaching PPs to NPs rather than VPs can be learned by a vanilla PCFG. But the desired preference can depend on specific words. 16 S → NP VP S → VP NP → Det A N NP → NP PP NP → PropN A → ε A → Adj A PP → Prep NP VP → V NP VP → VP PP English PCFG Parser S NP VP John V NP PP put the dog in the pen John put the dog in the pen.

17 Example of Importance of Lexicalization A general preference for attaching PPs to NPs rather than VPs can be learned by a vanilla PCFG. But the desired preference can depend on specific words. S → NP VP S → VP NP → Det A N NP → NP PP NP → PropN A → ε A → Adj A PP → Prep NP VP → V NP VP → VP PP English PCFG Parser S NP VP John V NP put the dog in the pen X John put the dog in the pen.

Head Words Syntactic phrases usually have a word in them that is most “central” to the phrase. Linguists have defined the concept of a lexical head of a phrase. Simple rules can identify the head of any phrase by percolating head words up the parse tree. –Head of a VP is the main verb –Head of an NP is the main noun –Head of a PP is the preposition –Head of a sentence is the head of its VP

Lexicalized Productions Specialized productions can be generated by including the head word and its POS of each non- terminal as part of that non-terminal’s symbol. S VP VBD NP DT Nominal Nominal PP liked IN NP in the dog NN DT Nominal NNthe pen NNP NP John pen-NN in-IN dog-NN liked-VBD John-NNP Nominal dog-NN → Nominal dog-NN PP in-IN

Lexicalized Productions S VP VP PP DT Nominal put IN NP in the dog NN DT Nominal NNthe pen NNP NP John pen-NN in-IN dog-NN put-VBD John-NNP NPVBD put-VBD VP put-VBD → VP put-VBD PP in-IN

Parameterizing Lexicalized Productions Accurately estimating parameters on such a large number of very specialized productions could require enormous amounts of treebank data. Need some way of estimating parameters for lexicalized productions that makes reasonable independence assumptions so that accurate probabilities for very specific rules can be learned.

Collins’ Parser Collins’ (1999) parser assumes a simple generative model of lexicalized productions. Models productions based on context to the left and the right of the head daughter. –LHS → L n L n  1 …L 1 H R 1 …R m  1 R m First generate the head (H) and then repeatedly generate left (L i ) and right (R i ) context symbols until the symbol STOP is generated.

Sample Production Generation VP put-VBD → VBD put-VBD NP dog-NN PP in-IN VP put-VBD →VBD put-VBD NP dog- NN HL1L1 STOPPP in-IN STOP R1R1 R2R2 R3R3 P L (STOP | VP put-VBD) * P H (VBD | VP put-VBD )* P R (NP dog-NN | VP put-VBD ) * P R (PP in-IN | VP put-VBD ) * P R (STOP | VPput-VBD)

Count(PPin-IN right of head in a VPput-VBD production ) Estimating Production Generation Parameters Estimate P H, P L, and P R parameters from treebank data. P R (PP in-IN | VP put-VBD ) = Count(symbol right of head in a VP put-VBD ) Count(NPdog-NN right of head in a VPput-VBD production ) P R (NP dog-NN | VP put-VBD ) = Smooth estimates by linearly interpolating with simpler models conditioned on just POS tag or no lexical info. smP R (PP in-IN | VP put-VBD ) = 1 P R (PP in-IN | VP put-VBD ) + (1  1 ) ( 2 P R (PP in-IN | VP VBD ) + (1  2 ) P R (PP in-IN | VP)) Count(symbol right of head in a VP put-VBD )

25 Parsing Evaluation Metrics PARSEVAL metrics measure the fraction of the constituents that match between the computed and human parse trees. If P is the system’s parse tree and T is the human parse tree (the “gold standard”): –Recall = (# correct constituents in P) / (# constituents in T) –Precision = (# correct constituents in P) / (# constituents in P) Labeled Precision and labeled recall require getting the non-terminal label on the constituent node correct to count as correct. F 1 is the harmonic mean of precision and recall.

26 Treebank Results Results of current state-of-the-art systems on the English Penn WSJ treebank are slightly greater than 90% labeled precision and recall.

27 Supervised Learning Avoids Manual Construction Training examples are pairs of problems and answers Training examples for parsing: a collection of sentence, parse tree pairs (Treebank) –From the treebank, get maximum likelihood estimates for the parsing model New challenge: treebanks are difficult to obtain –Needs human experts –Takes years to complete

28 Learning to ClassifyLearning to Parse Train a model to decide: should a prepositional phrase modify the verb before it or the noun? Train a model to decide: what is the most likely parse for a sentence W? Training examples: (v, saw, duck, with, telescope) (n, saw, duck, with, feathers) (v, saw, stars, with, telescope) (n, saw, stars, with, Oscars) … [S [NP-SBJ [NNP Ford] [NNP Motor] [NNP Co.]] [VP [VBD acquired] [NP [NP [CD 5] [NN %]] [PP [IN of] [NP [NP [DT the] [NNS shares]] [PP [IN in] [NP [NNP Jaguar] [NNP PLC]]]]]]]. ] [S [NP-SBJ [NNP Pierre] [NNP Vinken]] [VP [MD will] [VP [VB join] [NP [DT the] [NN board]] [PP [IN as] [NP [DT a] [NN director]]].] …

29 Hwa’s Approach Sample selection –Reduce the amount of training data by picking more useful examples Co-training –Improve parsing performance from unlabeled data Corrected Co-training –Combine ideas from both sample selection and co-training

30 Roadmap Parsing as a learning problem Semi-supervised approaches –Sample selection Overview Scoring functions Evaluation –Co-training –Corrected Co-training Conclusion and further directions

31 Sample Selection Assumption –Have lots of unlabeled data (cheap resource) –Have a human annotator (expensive resource) Iterative training session –Learner selects sentences to learn from –Annotator labels these sentences Goal: Predict the benefit of annotation –Learner selects sentences with the highest Training Utility Values (TUVs) –Key issue: scoring function to estimate TUV

32 Algorithm Initialize Train the parser on a small treebank (seed data) to get the initial parameter values. Repeat Create candidate set by randomly sample the unlabeled pool. Estimate the TUV of each sentence in the candidate set with a scoring function, f. Pick the n sentences with the highest score (according to f). Human labels these n sentences and add them to training set. Re-train the parser with the updated training set. Until (no more data).

33 Scoring Function Approximate the TUV of each sentence –True TUVs are not known Need relative ranking Ranking criteria –Knowledge about the domain e.g., sentence clusters, sentence length, … –Output of the hypothesis e.g., error-rate of the parse, uncertainty of the parse, … ….

34 Proposed Scoring Functions Using domain knowledge – long sentences tend to be complex Uncertainty about the output of the parser – tree entropy Minimize mistakes made by the parser – use an oracle scoring function find sentences with the most parsing inaccuracies f error f te f len

35 Entropy Measure of uncertainty in a distribution –Uniform distribution very uncertain –Spike distribution very certain Expected number of bits for encoding a probability distribution, X

36 Tree Entropy Scoring Function Distribution over parse trees for sentence W: Tree entropy: uncertainty of the parse distribution Scoring function: ratio of actual parse tree entropy to that of a uniform distribution

37 Oracle Scoring Function 1 - the accuracy rate of the most-likely parse Parse accuracy metric: f-score Precision = # of correctly labeled constituents # of constituents generated Recall = # of correctly labeled constituents # of constituents in correct answer f-score = harmonic mean of precision and recall f error

38 Experimental Setup Parsing model: –Collins Model 2 Candidate pool –WSJ sec 02-21, with the annotation stripped Initial labeled examples: 500 sentences Per iteration: add 100 sentences Testing metric: f-score (precision/recall) Test data: –~2000 unseen sentences (from WSJ sec 00) Baseline –Annotate data in sequential order

39 Training Examples Vs. Parsing Performance

40 Parsing Performance Vs. Constituents Labeled

41 Co-Training [ Blum and Mitchell, 1998 ] Assumptions –Have a small treebank –No further human assistance –Have two different kinds of parsers A subset of each parser’s output becomes new training data for the other Goal: –select sentences that are labeled with confidence by one parser but labeled with uncertainty by the other parser.

42 Algorithm Initialize Train two parsers on a small treebank (seed data) to get the initial models. Repeat Create candidate set by randomly sample the unlabeled pool. Each parser labels the candidate set and estimates the accuracy of its output with scoring function, f. Choose examples according to some selection method, S (using the scores from f). Add them to the parsers’ training sets. Re-train parsers with the updated training sets. Until (no more data).

43 Scoring Functions Evaluates the quality of each parser’s output Ideally, function measures accuracy –Oracle f F-score combined prec./rec. of the parse Practical scoring functions –Conditional probability f cprob Prob(parse | sentence) –Others (joint probability, entropy, etc.)

44 Selection Methods Above-n: S above-n –The score of the teacher’s parse is greater than n Difference: S diff-n –The score of the teacher’s parse is greater than that of the student’s parse by n Intersection: S int-n –The score of the teacher’s parse is one of its n% highest while the score of the student’s parse for the same sentence is one of the student’s n% lowest

45 Experimental Setup Co-training parsers: –Lexicalized Tree Adjoining Grammar parser [Sarkar, 2002] –Lexicalized Context Free Grammar parser [Collins, 1997] Seed data: 1000 parsed sentences from WSJ sec02 Unlabeled pool: rest of the WSJ sec02-21, stripped Consider 500 unlabeled sentences per iteration Development set: WSJ sec00 Test set: WSJ sec23 Results: graphs for the Collins parser

46 Selection Methods and Co-Training Two scoring functions: f F-score (oracle), f cprob Multiple view selection vs. one view selection –Three selection methods: S above-n, S diff-n, S int-n Maximizing utility vs. minimizing error –For f F-score, we vary n to control accuracy rate of the training data –Loose control More sentences (avg. F-score = 85%) –Tight control Fewer sentences (avg. F-score = 95%)

47 Co-Training using f F-score with Loose Control

48 Co-Training using f F-score with Tight Control

49 Co-Training using f cprob

50 Roadmap Parsing as a learning problem Semi-supervised approaches –Sample selection –Co-training –Corrected Co-training Conclusion and further directions

51 Corrected Co-Training Human reviews and corrects the machine outputs before they are added to the training set Can be seen as a variant of sample selection [cf. Muslea et al., 2000 ] Applied to Base NP detection [ Pierce & Cardie, 2001 ]

52 Algorithm Initialize: Train two parsers on a small treebank (seed data) to get the initial models. Repeat Create candidate set by randomly sample the unlabeled pool. Each parser labels the candidate set and estimates the accuracy of its output with scoring function, f. Choose examples according to some selection method, S (using the scores from f). Human reviews and corrects the chosen examples. Add them to the parsers’ training sets. Re-train parsers with the updated training sets. Until (no more data).

53 Selection Methods and Corrected Co-Training Two scoring functions: f F-score, f cprob Three selection methods: S above-n, S diff-n, S int-n

54 Corrected Co-Training using f F-score (Reviews)

55 Corrected Co-Training using f F-score (Corrections)

56 Corrected Co-Training using f cprob (Reviews)

57 Corrected Co-Training using f cprob (Corrections)