684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.

Slides:

Advertisements

Similar presentations

Albert Gatt Corpora and Statistical Methods Lecture 11.

Advertisements

Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.

May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Approaches to Parsing.

GRAMMAR & PARSING (Syntactic Analysis) NLP- WEEK 4.

10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :

September PROBABILISTIC CFGs & PROBABILISTIC PARSING Universita’ di Venezia 3 Ottobre 2003.

1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.

March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,

Statistical NLP: Lecture 11

Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.

PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.

6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 11 Giuseppe Carenini.

Albert Gatt LIN3022 Natural Language Processing Lecture 8.

Amirkabir University of Technology Computer Engineering Faculty AILAB Efficient Parsing Ahmad Abdollahzadeh Barfouroush Aban 1381 Natural Language Processing.

Parsing with PCFG Ling 571 Fei Xia Week 3: 10/11-10/13/05.

1/13 Parsing III Probabilistic Parsing and Conclusions.

1/17 Probabilistic Parsing … and some other approaches.

Parsing — Part II (Ambiguity, Top-down parsing, Left-recursion Removal)

Syntactic Parsing with CFGs CMSC 723: Computational Linguistics I ― Session #7 Jimmy Lin The iSchool University of Maryland Wednesday, October 14, 2009.

Parsing SLP Chapter 13. 7/2/2015 Speech and Language Processing - Jurafsky and Martin 2 Outline  Parsing with CFGs  Bottom-up, top-down  CKY parsing.

Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.

Basic Parsing with Context- Free Grammars 1 Some slides adapted from Julia Hirschberg and Dan Jurafsky.

Context-Free Grammar CSCI-GA.2590 – Lecture 3 Ralph Grishman NYU.

1 Basic Parsing with Context Free Grammars Chapter 13 September/October 2012 Lecture 6.

SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.

11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.

PARSING David Kauchak CS457 – Fall 2011 some slides adapted from Ray Mooney.

1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.

BİL711 Natural Language Processing1 Statistical Parse Disambiguation Problem: –How do we disambiguate among a set of parses of a given sentence? –We want.

Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.

For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.

1 Statistical Parsing Chapter 14 October 2012 Lecture #9.

GRAMMARS David Kauchak CS159 – Fall 2014 some slides adapted from Ray Mooney.

SI485i : NLP Set 8 PCFGs and the CKY Algorithm. PCFGs We saw how CFGs can model English (sort of) Probabilistic CFGs put weights on the production rules.

CS Describing Syntax CS 3360 Spring 2012 Sec Adapted from Addison Wesley’s lecture notes (Copyright © 2004 Pearson Addison Wesley)

May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Parsing with Context Free Grammars.

October 2005csa3180: Parsing Algorithms 11 CSA350: NLP Algorithms Sentence Parsing I The Parsing Problem Parsing as Search Top Down/Bottom Up Parsing Strategies.

Parsing with Context Free Grammars CSC 9010 Natural Language Processing Paula Matuszek and Mary-Angela Papalaskari This slide set was adapted from: Jim.

Parsing I: Earley Parser CMSC Natural Language Processing May 1, 2003.

PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.

11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.

Page 1 Probabilistic Parsing and Treebanks L545 Spring 2000.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture August 2007.

Albert Gatt Corpora and Statistical Methods Lecture 11.

Probabilistic CKY Roger Levy [thanks to Jason Eisner]

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 29– CYK; Inside Probability; Parse Tree construction) Pushpak Bhattacharyya CSE.

1 Chart Parsing Allen ’ s Chapter 3 J & M ’ s Chapter 10.

Rules, Movement, Ambiguity

CSA2050 Introduction to Computational Linguistics Parsing I.

Sentence Parsing Parsing 3 Dynamic Programming. Jan 2009 Speech and Language Processing - Jurafsky and Martin 2 Acknowledgement  Lecture based on  Jurafsky.

PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.

795M Winter /12/20151 Hidden Markov Models Chris Brew The Ohio State University.

CS 4705 Lecture 10 The Earley Algorithm. Review Top-Down vs. Bottom-Up Parsers –Both generate too many useless trees –Combine the two to avoid over-generation:

csa3050: Parsing Algorithms 11 CSA350: NLP Algorithms Parsing Algorithms 1 Top Down Bottom-Up Left Corner.

Top-down Parsing lecture slides from C OMP 412 Rice University Houston, Texas, Fall 2001.

CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.

December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)

Instructor: Nick Cercone CSEB - 1 Parsing and Context Free Grammars Parsers, Top Down, Bottom Up, Left Corner, Earley.

NLP. Introduction to NLP Time flies like an arrow –Many parses –Some (clearly) more likely than others –Need for a probabilistic ranking method.

November 2004csa3050: Parsing Algorithms 11 CSA350: NLP Algorithms Parsing Algorithms 1 Top Down Bottom-Up Left Corner.

PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,

Speech and Language Processing SLP Chapter 13 Parsing.

Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.

Roadmap Probabilistic CFGs –Handling ambiguity – more likely analyses –Adding probabilities Grammar Parsing: probabilistic CYK Learning probabilities:

Natural Language Processing Vasile Rus

Basic Parsing with Context Free Grammars Chapter 13

CSCI 5832 Natural Language Processing

Parsing and More Parsing

Presentation transcript:

/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University

/02/20162 Context Free Grammars  HMMs are sophisticated tools for language modelling based on finite state machines.  Context-free grammars go beyond FSMs  They can encode longer range dependencies than FSMs  They too can be made probabilistic

/02/20163 An example s -> np vps -> np vp pp np -> det nnp -> np pp vp->v np pp->p np n->girln -> boy n -> parkn -> telescope v-> saw p-> withp -> in Sample sentence: “The boy saw the girl in the park with the telescope”

/02/20164 Multiple analyses  2 of the 5 are

/02/20165 How serious is this ambiguity?  Very serious, ambiguities in different places multiply  Easy to get millions of analyses for simple seeming sentences  Maybe we can use probabilities to disambiguate, just as we chose from exponentially many paths through FSM  Fortunately, similar techniques apply

/02/20166 Probabilistic Context Free Grammars  Same as context free grammars, with one extension –Where there is a choice of productions for a non-terminal, give each alternative a probability. –For each choice point, sum of probabilities of available options is 1 –i.e. Production probability is p(rhs|lhs)

/02/20167 An example s -> np vp:0.8s -> np vp pp:0.2 np -> det n:0.5np -> np pp:0.5 vp->v np:1.0 pp->p np:1.0 n->girl:0.25n -> boy :0.25 n -> park:0.25n -> telescope:0.25 v-> saw:1.0 p-> with:0.5p -> in:0.5 Sample sentence: “The boy saw the girl in the park with the telescope”

/02/20168 The “low” attachment p(“np vp”|s) * p(“det n”|np) * p(“the”|det) * p(“boy”|n) * p(“v np”|vp) * p(“det n”|np) * p(“the”|det) *...

/02/20169 The “high” attachment p(“np vp pp”|s) * p(“det n”|np) * p(“the”|det) * p(“boy”|n) * p(“v np”|vp) * p(“det n”|np) * p(“the”|det) *... Note: I’m not claiming that this matches any particular set of psycholinguistic claims, only that the formalism allows such distinctions to be made.

/02/ Generating from Probabilistic Context Free Grammars  Start with the distinguished symbol “s”  Choose a way of expanding “s” –This introduces new non-terminals (eg. “np” “vp”)  Choose ways of expanding these  Carry on until no more non-terminals

/02/ Issues  The space of possible trees is infinite. –But the sum of probabilities for all trees is 1  There is a strong assumption built in to the model –Expansion probability is independent of position of non-terminal within tree –This assumption is questionable.

/02/ Training for Probabilistic Context Free Grammars  Supervised: you have a treebank  Unsupervised: you have only words  In between: Pereira and Schabes

/02/ Supervised Training  Look at the trees in your corpus  Count the number of times each lhs -> rhs occurs  Divide these counts by number of times each lhs occurs  Maybe smooth as described in the lecture on probability estimation from counts

/02/ Unsupervised Training  These are Rabiner’s problems, but for PCFGs –Calculate the probability of a corpus given a model –Guess the sequence of states passed through –Adapt the model to the corpus

/02/ Hidden Trees  All you see is the output: –“The boy saw the girl in the park”  But you can’t tell which of several trees led to that sentence  Each tree may have a different probability. Although trees which use the same rules the same number of times must give the same answer.  Don’t know which state you are in.

/02/ The three problems  Probability estimation –Given a sequence of observations O and a grammar G. Find P(O|G)  Best tree estimation –Given a sequence of observations O and a grammar G, find a Tree which maximizes P(O,Tree|G).

/02/ The third problem  Training –Adjust the model parameters so that P(O|G) is as large as possible for given O. Hard problem because there are so many adjustable parameters which could vary. Worse than for HMMs. More local maxima.

/02/ Probability estimation  Easy in principle. Marginalize out the trees, leaving probability of strings.  But this involves sum over exponentially many trees.  Efficient algorithm keeps track of inside and outside probabilities.

/02/ Inside Probability  The probability that non-terminal NT expands to the words between i and j

/02/ Outside probability  Dual of inside probability. NP SENT A LETTER... i SENT A LETTER j... A MAN

/02/ Corpus probability  Inside probability of S node and entire string is probability of all ways of making sentences over that string  Product over all strings in corpus is corpus probability  Can also get corpus probability from outside probabilities

/02/ Training  Uses inside and outside probabilities  Starts from an initial guess  Improves the initial guess using data  Stops at a (locally) best model  Specialization of the EM algorithm

/02/ Expected rule counts  Consider p(uses rule lhs -> rhs to cover i through j)  Four things need to happen –Generate outside words leaving hole for lhs –Choose correct rhs –Generate word seen between i and k from first item in rhs (inside probability) –Generate words seen between k and j using other items in rhs (more inside probailities)

/02/ Refinements  In practice there are very many local maxima, so strategies which involve generating hundreds of thousands of rules may fail badly.  Pereira and Schabes discovered that letting the system know some limited stuff about bracketting is enough to guide it to correct answers  Different grammar formalisms (TAGs, Categorial Grammars...)

/02/ A basic parsing algorithm  The simplest statistical parsing algorithm is called CYK or CKY.  It is a statistical variant of a bottom-up tabular parsing algorithm that you should have seen in  It (somewhat surprisingly) turns out to be closely related to the problem of multiplying matrices.

/02/ Basic CKY (review)  Assume we have organized the lexicon as a function lexicon: string -> nonterminal set  Organize these nonterminals into the relevant parts of a two dimensional array indexed by left and right end of the item For I = 1 to length(sentence) do chart[I,I+1] = lexicon(sentence[i]) endfor

/02/ Basic CKY  Assume we have organized the grammar as a function grammar: nonterminal -> nonterminal -> nonterminal set

/02/ Basic CKY  Build up new entries from existing entries, working from shorter entries to longer ones for l = 2 to length(sentence) do // l is length of constituent for s = 1 to len – l + 1 do // s is start of rhs1 for t = 1 to l-1 do (left,mid,right) = (s,s+t,s+l) chart[left,right] = combine(chart[left,mid],chart[mid,right]) endfor

/02/ Basic CKY  Combine is fun combine(set1,set2) result = empty for item1 in set1 do for item2 in set2 do result = union result (grammar item1 item2) endfor return result

/02/ Going statistical  The basic algorithm tracks labels for each substring of the input  The cell contents are sets of labels  A statistical version keeps track of labels and their probabilities  Now the cell contents must be weighted sets

/02/ Going statistical  Make the grammar and lexicon produce weighted sets. gexicon: word -> real*nt set grammar: real*nt->real*nt -> real*nt set  We now need an operation corresponding to set union for weighted sets.  {s:0.1,np:0.2} WU {s:0.2,np:0.1} = ???

/02/ Going statistical (one way) {s:0.1,np:0.2} WU {s:0.2,np:0.1} = {s:0.3,np:0.3} If we implement this, we get a parser that calculates the inside probability for each label on each span.

/02/ Going statistical (another way) {s:0.1,np:0.2} WU {s:0.2,np:0.1} = {s:0.2,np:0.2} If we implement this, we get a parser that calculates the best parse probability for each label on each span. The difference is that in one case we are combining weights with +, while in the second we use max

/02/ Building trees  Make the cell contents be sets of trees  Make the lexicon be a function from words to little trees  Make the grammar be a function from pairs of trees to sets of newly created (bigger) trees  Set union is now over sets of trees  Nothing else needs to change

/02/ Building weighted trees  Make the cell contents be sets of trees, labelled with probabilities  Make the lexicon be a function from words to weighted (little trees)  Make the grammar be a function from pairs of weighted trees to sets of newly created (bigger) trees  Set union is now over sets of weighted trees  Again we have a choice of min or +, to get either parse forest or just best parse

/02/ Where to get more information  Roark and Sproat ch 7  Charniak chapters 5 and 6  Allen Natural Language Understanding ch 7  Lisp code associated with Natural Language Understanding  Goodman: Semiring parsing (