Download presentation
Presentation is loading. Please wait.
Published byRosamund Hardy Modified over 9 years ago
1
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012
2
Roadmap Collins & Duffy, 2001 Tree Kernels for Parsing: Motivation Parsing as reranking Tree kernels for similarity Case study: Penn Treebank parsing 2
3
Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree 3
4
Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree Approaches: 4
5
Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree Approaches: “Classical” approach: Hand-write CFG productions; use standard alg, e.g. CKY 5
6
Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree Approaches: “Classical” approach: Hand-write CFG productions; use standard alg, e.g. CKY Probabilistic approach: Build large treebank of parsed sentences Learn production probabilities Use probabilistic versions of standard alg Pick highest probability parse 6
7
Parsing Issues Main issues: 7
8
Parsing Issues Main issues: Robustness: get reasonable parse for any input Ambiguity: select best parse given alternatives “Classic” approach: 8
9
Parsing Issues Main issues: Robustness: get reasonable parse for any input Ambiguity: select best parse given alternatives “Classic” approach: Hand-coded grammars often fragile No obvious mechanism to select among alternatives Probabilistic approach: 9
10
Parsing Issues Main issues: Robustness: get reasonable parse for any input Ambiguity: select best parse given alternatives “Classic” approach: Hand-coded grammars often fragile No obvious mechanism to select among alternatives Probabilistic approach: Fairly good robustness, small probabilities for any Select by probability, but decisions are local Hard to capture more global structure 10
11
Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair 11
12
Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair Create a parse tree vector representation Identify (more global) parse tree features 12
13
Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair Create a parse tree vector representation Identify (more global) parse tree features Train a reranker to rank gold standard highest 13
14
Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair Create a parse tree vector representation Identify (more global) parse tree features Train a reranker to rank gold standard highest Apply to rerank candidate parses for new sentence 14
15
Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree 15
16
Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree 16
17
Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i 17
18
Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i h(x ij ): feature vector representation of x ij 18
19
Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i h(x ij ): feature vector representation of x ij Training: Learn 19
20
Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i h(x ij ): feature vector representation of x ij Training: Learn Decoding: Compute 20
21
Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints 21
22
Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? 22
23
Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? Here, ranking constraints: Specifically, correct parse outranks all other candidates 23
24
Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? Here, ranking constraints: Specifically, correct parse outranks all other candidates Formally, 24
25
Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? Here, ranking constraints: Specifically, correct parse outranks all other candidates Formally, 25
26
Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? Here, ranking constraints: Specifically, correct parse outranks all other candidates Formally, 26
27
Reformulating with α Training learns α ij, such that 27
28
Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint 28
29
Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: 29
30
Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: After substitution, we have 30
31
Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: After substitution, we have After the kernel trick, we have 31
32
Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: After substitution, we have After the kernel trick, we have Note: With a suitable kernel K, don’t need h(x)s 32
33
Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane 33
34
Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane Modeled with weight vector w Using simple iterative procedure Based on correcting errors in current model 34
35
Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane Modeled with weight vector w Using simple iterative procedure Based on correcting errors in current model Initialize α ij =0 35
36
Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane Modeled with weight vector w Using simple iterative procedure Based on correcting errors in current model Initialize α ij =0 For i=1,…,n; for j=2,…,n If f(x i1 )>f(x ij ): continue else: α ij += 1 36
37
Defining the Kernel So, we have a model: Framework for training Framework for decoding 37
38
Defining the Kernel So, we have a model: Framework for training Framework for decoding But need to define a kernel K 38
39
Defining the Kernel So, we have a model: Framework for training Framework for decoding But need to define a kernel K We need: K: X x X R 39
40
Defining the Kernel So, we have a model: Framework for training Framework for decoding But need to define a kernel K We need: K: X x X R Recall that X is a tree, and K is a similarity function 40
41
What’s in a Kernel? What are good attributes of a kernel? 41
42
What’s in a Kernel? What are good attributes of a kernel? Capture similarity between instances Here, between parse trees 42
43
What’s in a Kernel? What are good attributes of a kernel? Capture similarity between instances Here, between parse trees Capture more global parse information than PCFG 43
44
What’s in a Kernel? What are good attributes of a kernel? Capture similarity between instances Here, between parse trees Capture more global parse information than PCFG Computable tractably, even over complex, large trees 44
45
Tree Kernel Proposal Idea: PCFG models learn MLE probabilities on rewrite rules NP N vs NP DT N vs NP PN vs NP DT JJ N Local to parent:children levels 45
46
Tree Kernel Proposal Idea: PCFG models learn MLE probabilities on rewrite rules NP N vs NP DT N vs NP PN vs NP DT JJ N Local to parent:children levels New measure incorporates all tree fragments in parse Captures higher order, longer distances dependencies Track counts of individual rules + much more 46
47
Tree Fragment Example Fragments of NP over ‘apple’ Not exhaustive 47
48
Tree Representation Tree fragments: Any subgraph with more than one node Restriction: Must include full (not partial) rule productions Parse tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) n: number of distinct tree fragments in training data h i (T): # of occurrences of i th tree fragment in current tree 48
49
Tree Representation Tree fragments: Any subgraph with more than one node Restriction: Must include full (not partial) rule productions Parse tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) n: number of distinct tree fragments in training data h i (T): # of occurrences of i th tree fragment in current tree 49
50
Tree Representation Pros: 50
51
Tree Representation Pros: Fairly intuitive model Natural inner product interpretation Captures long- and short-range dependencies Cons: 51
52
Tree Representation Pros: Fairly intuitive model Natural inner product interpretation Captures long- and short-range dependencies Cons: Size!!!: # subtrees exponential in size of tree Direct computation of inner product intractable 52
53
Key Challenge 53
54
Key Challenge Efficient computation: Find a kernel that can compute similarity efficiently In terms of common subtrees 54
55
Key Challenge Efficient computation: Find a kernel that can compute similarity efficiently In terms of common subtrees Pure enumeration clearly intractable 55
56
Key Challenge Efficient computation: Find a kernel that can compute similarity efficiently In terms of common subtrees Pure enumeration clearly intractable Compute recursively over subtrees Using a polynomial process 56
57
Counting Common Subtrees Example: C(n1,n2): number of common subtrees rooted at n1,n2 C(n1,n2): Due to F. Xia 57 aa red apple
58
Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, 58
59
Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, 59
60
Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =1 Else: 60
61
Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =1 Else: nc(n1): # children of n1: What about n2? 61
62
Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =1 Else: nc(n1): # children of n1: What about n2? same production same # children ch(n1,j) j th child of n1 62
63
Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively 63
64
Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively Define: I i (n) = 1if i th subtree is rooted at n, 0 o.w. 64
65
Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively Define: I i (n) = 1if i th subtree is rooted at n, 0 o.w. Then, 65
66
Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively Define: I i (n) = 1if i th subtree is rooted at n, 0 o.w. Then, 66
67
Tree Kernel Computation 67
68
Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C 68
69
Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C Remaining Issues: 69
70
Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C Remaining Issues: K(T1,T2) depends on size of T1 and T2 70
71
Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C Remaining Issues: K(T1,T2) depends on size of T1 and T2 K(T1,T1) >> K(T1,T2), T1 != T2 10 6 vs 10 2 : Very ‘peaked’ 71
72
Improvements Managing tree size: 72
73
Improvements Managing tree size: Normalize!! (like cosine similarity) 73
74
Improvements Managing tree size: Normalize!! (like cosine similarity) Downweight large trees: 74
75
Improvements Managing tree size: Normalize!! (like cosine similarity) Downweight large trees: Restrict depth: just threshold Rescale with weight λ 75
76
Rescaling Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =λ Else: 0<λ<= 1 76
77
Rescaling Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =λ Else: 0<λ<= 1 77
78
Case Study 78
79
Parsing Experiment Data: Penn Treebank ATIS corpus segment Training: 800 sentences Top 20 parses Development: 200 sentences Test: 336 sentences Select best candidate from top 100 parses 79
80
Parsing Experiment Data: Penn Treebank ATIS corpus segment Training: 800 sentences Top 20 parses Development: 200 sentences Test: 336 sentences Select best candidate from top 100 parses Classifier: Voted perceptron Kernelized like SVM, more computationally tractable 80
81
Parsing Experiment Data: Penn Treebank ATIS corpus segment Training: 800 sentences Top 20 parses Development: 200 sentences Test: 336 sentences Select best candidate from top 100 parses Classifier: Voted perceptron Kernelized like SVM, more computationally tractable Evaluation: 10 runs, average parse score reported 81
82
Experimental Results Baseline system: 74% Substantial improvement: 6% absolute score 82
83
Summary Parsing as reranking problem Tree Kernel: Computes similarity between trees based on fragments Efficient recursive computation procedure Yields improved performance on parsing task 83
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.