1 Page 1 Probabilistic Parsing and Treebanks L545 Spring 2000

2 Page 2 Motivation and Outline  Previously, we used CFGs to parse with, but: -Some ambiguous sentences could not be disambiguated, and we would like to know the most likely parse -We did not discuss how to obtain such grammars  Where we’re going: -Probabilistic Context-Free Grammars (PCFGs) + some discussion of treebanks -Lexicalized PCFGs  We’ll only cover PCFGs in very broad strokes - L645 offers more details

3 Page 3 Statistical Parsing  Basic idea -Start with a treebank  a collection of sentences with syntactic annotation, i.e., already-parsed sentences -Examine which parse trees occur frequently -Extract grammar rules corresponding to those parse trees, estimating the probability of the grammar rule based on its frequency  Result: a CFG augmented with probabilities

4 Page 4 Probabilistic Context-Free Grammars (PCFGs)  Definition of a CFG (review): -Set of non-terminals (N) -Set of terminals (T)  Set of rules/productions (P), of the form Α  β -Designated start symbol (S)  Definition of a PCFG: -Same as a CFG, but with one more function, D -D assigns probabilities to each rule in P

5 Page 5 Probabilities  The function D gives probabilities for a non-terminal A to be expanded to a sequence β.  Written as P(A  β )  or as P(A  β |A)  Idea: given A as the mother non-terminal (LHS), what is the likelihood that β is the correct RHS?  Note that Σ i (A  β i | A) = 1  i.e., these are all the ways of expanding A -For example, we would augment a CFG with these probabilities:  P(S  NP VP | S) =.80  P(S  Aux NP VP | S) =.15  P(S  VP | S) =.05

6 Page 6 Estimating Probabilities using a Treebank  Given a corpus of sentences annotated with syntactic annotation (e.g., the Penn Treebank) -Consider all parse trees  (1) Each time you have a rule of the form A  β applied in a parse tree, increment a counter for that rule -(2) Also count the number of times A is on the left hand side of a rule -Divide (1) by (2)  P(A  β |A) = Count(A  β )/Count(A)  If you don’t have annotated data, parse the corpus (as we’ll describe next) and estimate the probabilities … which are then used to re-parse.

7 Page 7 Using Probabilities to Parse  P(T): probability of a particular parse tree  P(T) = Π n є T p(r(n)) i.e., the product of the probabilities of the rules r used to expand each node n in the parse tree

8 Page 8 Computing probabilities  We have the following rules and probabilities (adapted from Figure 14.1): -S  VP.05 -VP  V NP.40 -NP  Det N.20 -V  book.30 -Det  that.05 -N  flight.25  P(T) = P(S  VP)*P(VP  V NP)*…*P(N  flight) =.05*.40*.20*.30*.05*.25 =.000015, or 1.5 x 10 -5

9 Page 9 Using probabilities  So, the probability for that parse is 0.000015. What’ does this mean? -Probabilities are useful for comparing with other probabilities  Whereas we couldn’t decide between two parses using a regular CFG, we now can.  For example, TWA flights is ambiguous between being two separate NPs (cf. I gave [ NP John] [ NP money]) or one NP: -A: [book [TWA] [flights]] -B: [book [TWA flights]]  Probabilities allows us to choose choice B

10 Page 10 Obtaining the best parse  Call the best parse T(S), where S is your sentence -Get the tree which has the highest probability, i.e.  T(S) = argmax T є parse-trees(S) P(T)  Can use the Cocke-Younger-Kasami (CYK) algorithm to calculate best parse

11 Page 11 The CYK algorithm  Base case -Add words to the chart -Store P(A  w_i) for every category A in the chart  Recursive case -Get the probability for A at this node by multiplying the probabilities for B and for C by P(A  BC)  P(B)*P(C)*P(A  BC) -Only calculate this once -Rules must be of the form A  BC, i.e., exactly two items on the RHS (Chomsky Normal Form (CNF))  For a given A, only keep the maximum probability -Previously, we kept A, but without any probabilities

12 Page 12 Problems with PCFGs  It’s still only a CFG, so dependencies on non-CFG info not captured -e.g., Pronouns are more likely to be subjects than objects: -P[(NP  Pronoun) | NP=subj] >> P[(NP  Pronoun) | NP =obj]  Ignores lexical information (statistics), which is usually crucial for disambiguation -(T1) America sent [[250,000 soldiers] [into Iraq]] -(T2) America sent [250,000 soldiers] [into Iraq]  send with into-PP is the correct analysis (T2) because they “go well” together  To handle lexical information, we’ll turn to lexicalized PCFGs

13 Page 13 Lexicalized Grammars  Remember how head information is passed up in a syntactic analysis? -e.g., VP[head [1]]  V[head [1]] NP -If you follow this down all the way to the bottom of a tree, you wind up with a head word  In some sense, we can say that Book that flight is not just an S, but an S rooted in book -Thus, book is the headword of the whole sentence  By adding headword information to nonterminals, we wind up with a lexicalized grammar

14 Page 14 Lexicalized PCFGs  Lexicalized Parse Trees -Each PCFG rule in a tree is augmented to identify one RHS constituent to be the head daughter -The headword for a node is set to the head word of its head daughter [book] [flight]

15 Page 15 Incorporating Head Probabilities: Wrong Way  Simply adding headword w to node won’t work: -So, the node A becomes A[w]  e.g., P(A[w]  β |A) =Count(A[w]  β )/Count(A)  The probabilities are too small, i.e., we don’t have a big enough corpus to calculate these probabilities -VP(dumped)  VBD(dumped) NP(sacks) PP(into) 3x10 -10 -VP(dumped)  VBD(dumped) NP(cats) PP(into) 8x10 -11  These probabilities are tiny, and others will never occur

16 Page 16 Incorporating head probabilities: Right way  Previously, we conditioned on the mother node (A):  P(A  β |A)  Now, we can condition on the mother node and the headword of A (h(A)):  P(A  β |A, h(A))  We’re no longer conditioning on simply the mother category A, but on the mother category when h(A) is the head -e.g., P(VP  VBD NP PP | VP, dumped) -The likelihood of VP expanding to VBD NP PP when dumped is the head is different than with ate

17 Page 17 Calculating rule probabilities  We’ll write the probability more generally as: -P(r(n) | n, h(n)) -where n = node, r = rule, and h = headword  We calculate this by comparing how many times the rule occurs with h(n) as the headword versus how many times the mother/headword combination appear in total: P(VP  VBD NP PP | VP, dumped) = C(VP(dumped)  VBD NP PP)/ Σ β C(VP(dumped)  β )

18 Page 18 Adding info about word-word dependencies  We want to take into account one other factor: the probability of being a head word (in a given context) -P(h(n)=word | …)  We condition this probability on two things: 1. the category of the node (n), and 2. the headword of the mother (h(m(n))) -P(h(n)=word | n, h(m(n))), shortened as: P(h(n) | n, h(m(n))) -P(sacks | NP, dumped)  What we’re really doing is factoring in how words relate to each other  These are dependency relations: sacks is dependent on dumped, in this case

19 Page 19 Putting it all together  See sec. 14.6 for an example lexicalized parse tree for workers dumped sacks into a bin  For rules r, category n, head h, mother m P(T) = Π n є T p(r(n)| n, h(n)) e.g., P(VP  VBD NP PP |VP, dumped) subcategorization info * p(h(n) | n, h(m(n))) e.g. P(sacks | NP, dumped) dependency info between words

20 Page 20 Evaluating Parser Output  Traditional measures of parser accuracy: -Labeled bracketing precision: # correct constituents in parse/# constituents in parse -Labeled bracketing recall: # correct constituents in parse/# (correct) constituents in treebank parse  There are known problems with these measures, so people are trying to use dependency-based measures instead -How many dependency relations did the parse get correct?

