The Cocke-Kasami-Younger Algorithm An example of a CFG in CNF An example of bottom-up parsing, for CFG in Chomsky normal form G : S AB | BB A CC | AB | a B BB | CA | b C BA | AA | b 2 possibilities for first production S S S A B A B A B a abb aa bb aab b S S S Possible splits for the string aabb B B B B B B a abb aa bb aab b
The CKYounger Algorithm Provides an efficient way of generating substring devisions and checking whether each substring can be legally derived Thus if the cell (4,1) contains S, string L(G) A non terminal will be placed in the cell (i,j) if it can derive i consecutive symbols of the string starting at jth position 2,1 4,1 3,2 3,1 2,3 2,2 1,1 1,4 1,3 1,2 b a If the cell (i,j) contains the nonterminal A1 and the cell (i’,i+j) contains the nonterminal A2 and there is a production A A1 A2 then the cell (i+i’,j) will contain the nonterminal A
The CKYounger Algorithm Provides an efficient way of generating substring devisions and checking whether each substring can be legally derived G : S AB | BB A CC | AB | a B BB | CA | b C BA | AA | b A nonterminal will be placed in the cell (i,j) if it can derive i consecutive symbols of the string starting at jth position 2,1 4,1 3,2 3,1 2,3 2,2 1,1 1,4 1,3 1,2 b a
The Cocke-Kasami-Younger Algorithm Relation derivation tree and pyramid S S S A B A B A B a abb aa bb aab b B S A a A S B a A S B a
S S S B B B B B B a abb aa bb aab b B S a B S a B S a
The Cocke-Kasami-Younger Algorithm 5/3/2019 Builds up the pyramid in a bottom-up fashion G : S AB | BB A CC | AB | a B BB | CA | b C BA | AA | b A B,C a Step 1, fill the cell at row 1 Because of A a Because of B b, and C b
The Cocke-Kasami-Younger Algorithm Builds up the pyramid in a bottom-up fashion G : S AB | BB A CC | AB | a B BB | CA | b C BA | AA | b B is in cell (2,3) Because of B BB and B is in cell (1,3) and B is in cell (1,4) A C S,B B,C a Step 2, fill the cell at row 2 C is in cell (2,1) Because of C AA and A is in cell (1,1) and A is in cell (1,2) A is in cell (2,2) Because of A AB and A is in cell (1,2) and B is in cell (1,3) S is in cell (2,3) Because of S BB and B is in cell (1,3) and B is in cell (1,4)
The Cocke-Kasami-Younger Algorithm Builds up the pyramid in a bottom-up fashion G : S AB | BB A CC | AB | a B BB | CA | b C BA | AA | b C is in cell (3,1) Because of C AA and A is in cell (1,1) and A is in cell (2,2) A C A,C S,A,C S,B B,C a Step 3, fill the cell at row 3 ? is in cell (3,1) Because of ? XY X is in cell (1,1) Y is in cell (2,2) or X is in cell (2,1) Y is in cell (1,3) or A is in cell (3,1) Because of C CC and C is in cell (2,1) and C is in cell (1,3)
The Cocke-Kasami-Younger Algorithm Builds up the pyramid in a bottom-up fashion G : S AB | BB A CC | AB | a B BB | CA | b C BA | AA | b Since S is at the top, aabb L(G) A C A,C S,A,C A,B,C,S S,B B,C a Step 4, fill the cell at row 4 S General rule ? is in cell (i,j) Because of ? XY X is in cell (m,j) Y is in cell (i-m,j+m) with 1 ≤ m ≤ i-1 Step i A B b C C A A b a a
The CKY algorithm is correct Theorem The CKY algorithm is correct Given a grammar (T, N, P, S) in Chomsky normal form and w = x1 ... xn T* then A N is in cell (i,j) of the CKY pyramid if and only if A xj ... xj+i-1 Proof by induction on the row number Base step i= 1 in row 1 we get the nonterminals from which length 1 substrings of the string to parse can be derived. This is only possible by using productions of type A a. Thus if A is in cell (1,i), 1 ≤ i ≤ n, then A xi P, thus A xi Induction hypothesis theorem applies for all rows < i, i.e. all substrings of length < i. * *
* * * * * Induction step we first prove Assume a derivation of a substring of length i, i>1, A BC xj ... xj+i-1, then for some m > 0 there must hold that B xj ... xj+m-1 and C xj+m ... xj+i-1. Thus by the induction hypothesis if B is in cell (m,j) and C in the cell (i-m, j+m). Since there is a production A BC, A is in the cell (i,j). We now prove Assume A is in the cell (i,j), then form A we can derive a string xj ... xj+i-1, with length i > 1, therefore there must be a production of the form A BC with B,C N, and for some m, 1 ≤ m ≤ i-1, B is in cell (m,j) and C in the cell (i-m, j+m). By the induction hypothesis we have B xj ... xj+m-1 and C xj+m ... xj+i-1. Therefore we can write A BC xj ... xj+i-1 and conclude A xj ... xj+i-1 * * * * * Both cells have a lower row #, so induction hypothesis applies
The complexity of the CKY algorithm The time complexity for wL(G)? Let G = (T, N, P, S) be a CFG in Chomsky normal form, with k = #N. Then using the CKY algorithm, w L(G) can be decided in time proportional to n3 , where n = |w|. Proof First notice that the number of entries in a cell is at most k. maximum number of productions is k3, I Complexity for row 1 cells For each A N, we have to check if it can be placed in cell(1,i), i.e. if A derives (in 1 step) the terminal on position i. There are k nonterminals, thus cost per cell is k X 1. There are n row 1 cells, thus total cost for row 1 = kn. Each nonterminal can only occur once in a cell A BC Cfr. 3
Since k is independent of n II Complexity for cell in a row > 1 The content of a cell is the result of at most n-1 pairings of lower cells. For each paring at most k nonterminals are paired with at most k other nonterminals, and each pairing is checked against at most k3 productions. Thus for each cell : cost ≤ k X k X k3 X 1 X (n-1) = k5 X (n-1) There are (n-1)+ (n-2) + …. + 1 = n(n-1)/2 cells in rows 2 to n, thus total cost for these rows is bounded above by n(n-1)/2 X k5 X (n-1) To conclude : The total cost is bounded above by : kn + n(n-1)/2 X k5 X (n-1) See slide 119 Cfr. 1 and 2 Since k is independent of n the conclusion is O(n3)
See course on compilers for faster algorithms Some remarks Not really of practical use since O(n3) is too slow the grammar must be converted to CNF only tests membership, this is not the complexity for building the derivation tree See course on compilers for faster algorithms Semantics!!!! To think about : CKY and unambiguous grammars.