CYK )Cocke-Younger-Kasami) Parsing Algorithm دانشگاه صنعتی امیر کبیر دانشکده مهندسی کامپیوتر CYK )Cocke-Younger-Kasami) Parsing Algorithm سید محمد حسین معطر پردازش زبان طبیعی
Parsing Algorithms CFGs are basis for describing (syntactic) structure of NL sentences Thus - Parsing Algorithms are core of NL analysis systems Recognition vs. Parsing: Recognition - deciding the membership in the language: Parsing – Recognition+ producing a parse tree for it Parsing is more “difficult” than recognition? (time complexity) Ambiguity - an input may have exponentially many parses
Parsing Algorithms Parsing General CFLs vs. Limited Forms Efficiency: Deterministic (LR) languages can be parsed in linear time A number of parsing algorithms for general CFLs require O(n3) time Asymptotically best parsing algorithm for general CFLs requires O(n2.37), but is not practical Utility - why parse general grammars and not just CNF? Grammar intended to reflect actual structure of language Conversion to CNF completely destroys the parse structure
CYK )Cocke-Younger-Kasami) One of the earliest recognition and parsing algorithms The standard version of CYK can only recognize languages defined by context-free grammars in Chomsky Normal Form (CNF). It is also possible to extend the CYK algorithm to handle some grammars which are not in CNF Harder to understand Based on a “dynamic programming” approach: Build solutions compositionally from sub-solutions Store sub-solutions and re-use them whenever necessary Uses the grammar directly (no PDA is used) Recognition version: decide whether S == > w ?
CYK Algorithm The CYK algorithm for the membership problem is as follows: Let the input string be a sequence of n letters a1 ... an. Let the grammar contain r terminal and nonterminal symbols R1 ... Rr, and let R1 be the start symbol. Let P[n,n,r] be an array of booleans. Initialize all elements of P to false. For each i = 1 to n For each unit production Rj -> ai, set P[i,1,j] = true. For each i = 2 to n -- Length of span For each j = 1 to n-i+1 -- Start of span For each k = 1 to i-1 -- Partition of span For each production RA -> RB RC If P[j,k,B] and P[j+k,i-k,C] then set P[j,i,A] = true If P[1,n,1] is true Then string is member of language Else string is not member of language
CYK Pseudocode On input x = x1x2 … xn : for (i = 1 to n) //create middle diagonal for (each var. A) if(Axi) add A to table[i-1][i] for (d = 2 to n) // d’th diagonal for (i = 0 to n-d) for (k = i+1 to i+d-1) for(each var. B in table[i][k]) for(each var. C in table[k][k+d]) if(ABC) add A to table[i][k+d] return Stable[0][n] ? ACCEPT : REJECT
CYK Algorithm this algorithm considers every possible consecutive subsequence of the sequence of letters and sets P[i,j,k] to be true if the sequence of letters starting from i of length j can be generated from Rk. Once it has considered sequences of length 1, it goes on to sequences of length 2, and so on. For subsequences of length 2 and greater, it considers every possible partition of the subsequence into two halves, and checks to see if there is some production P -> Q R such that Q matches the first half and R matches the second half. If so, it records P as matching the whole subsequence. Once this process is completed, the sentence is recognized by the grammar if the subsequence containing the entire string is matched by the start symbol
CYK Algorithm for Deciding Context Free Languages Q: Consider the grammar G given by S e | AB | XB T AB | XB X AT A a B b Is x = aaabb in L(G ) Is x = aaabbb in L(G )
CYK Algorithm for Deciding Context Free Languages The algorithm is “bottom-up” in that we start with bottom of derivation tree. S e | AB | XB T AB | XB X AT A a B b a a a b b
CYK Algorithm for Deciding Context Free Languages 1) Write variables for all length 1 substrings S e | AB | XB T AB | XB X AT A a B b a a a b b A A A B B
CYK Algorithm for Deciding Context Free Languages 2) Write variables for all length 2 substrings S e | AB | XB T AB | XB X AT A a B b a a a b b A A A B B S,T T
CYK Algorithm for Deciding Context Free Languages 3) Write variables for all length 3 substrings S e | AB | XB T AB | XB X AT A a B b a a a b b A A A B B S,T T X
CYK Algorithm for Deciding Context Free Languages 4) Write variables for all length 4 substrings S e | AB | XB T AB | XB X AT A a B b a a a b b A A A B B S,T T X S,T
CYK Algorithm for Deciding Context Free Languages Write variables for all length 5 substrings. S e | AB | XB T AB | XB X AT A a B b REJECT! a a a b b A A A B B S,T T X S,T X
CYK Algorithm for Deciding Context Free Languages Now look at aaabbb : S e | AB | XB T AB | XB X AT A a B b a a a b b b
CYK Algorithm for Deciding Context Free Languages 1) Write variables for all length 1 substrings. S e | AB | XB T AB | XB X AT A a B b a a a b b b A A A B B B
CYK Algorithm for Deciding Context Free Languages 2) Write variables for all length 2 substrings. S e | AB | XB T AB | XB X AT A a B b a a a b b b A A A B B B S,T
CYK Algorithm for Deciding Context Free Languages 3) Write variables for all length 3 substrings. S e | AB | XB T AB | XB X AT A a B b a a a b b b A A A B B B S,T T X
CYK Algorithm for Deciding Context Free Languages 4) Write variables for all length 4 substrings. S e | AB | XB T AB | XB X AT A a B b a a a b b b A A A B B B S,T T X S,T
CYK Algorithm for Deciding Context Free Languages 5) Write variables for all length 5 substrings. S e | AB | XB T AB | XB X AT A a B b a a a b b b A A A B B B S,T T X S,T X
CYK Algorithm for Deciding Context Free Languages 6) Write variables for all length 6 substrings. S e | AB | XB T AB | XB X AT A a B b S is included so aaabbb accepted! a a a b b b A A A B B B S,T T X S,T X S,T
CYK Algorithm for Deciding Context Free Languages Can also use a table for same purpose. end at start at 1: aaabbb 2: aaabbb 3: aaabbb 4: aaabbb 5: aaabbb 6: aaabbb 0:aaabbb 1:aaabbb 2:aaabbb 3:aaabbb 4:aaabbb 5:aaabbb
CYK Algorithm for Deciding Context Free Languages 1. Variables for length 1 substrings. end at start at 1: aaabbb 2: aaabbb 3: aaabbb 4: aaabbb 5: aaabbb 6: aaabbb 0:aaabbb A 1:aaabbb 2:aaabbb 3:aaabbb B 4:aaabbb 5:aaabbb
CYK Algorithm for Deciding Context Free Languages 2. Variables for length 2 substrings. end at start at 1: aaabbb 2: aaabbb 3: aaabbb 4: aaabbb 5: aaabbb 6: aaabbb 0:aaabbb A - 1:aaabbb 2:aaabbb S,T 3:aaabbb B 4:aaabbb 5:aaabbb
CYK Algorithm for Deciding Context Free Languages 3. Variables for length 3 substrings. end at start at 1: aaabbb 2: aaabbb 3: aaabbb 4: aaabbb 5: aaabbb 6: aaabbb 0:aaabbb A - 1:aaabbb X 2:aaabbb S,T 3:aaabbb B 4:aaabbb 5:aaabbb
CYK Algorithm for Deciding Context Free Languages 4. Variables for length 4 substrings. end at start at 1: aaabbb 2: aaabbb 3: aaabbb 4: aaabbb 5: aaabbb 6: aaabbb 0:aaabbb A - 1:aaabbb X S,T 2:aaabbb 3:aaabbb B 4:aaabbb 5:aaabbb
CYK Algorithm for Deciding Context Free Languages 5. Variables for length 5 substrings. end at start at 1: aaabbb 2: aaabbb 3: aaabbb 4: aaabbb 5: aaabbb 6: aaabbb 0:aaabbb A - X 1:aaabbb S,T 2:aaabbb 3:aaabbb B 4:aaabbb 5:aaabbb
CYK Algorithm for Deciding Context Free Languages 6. Variables for aaabbb. ACCEPTED! end at start at 1: aaabbb 2: aaabbb 3: aaabbb 4: aaabbb 5: aaabbb 6: aaabbb 0:aaabbb A - X S,T 1:aaabbb 2:aaabbb 3:aaabbb B 4:aaabbb 5:aaabbb
Parsing results We keep the results for every wij in a table. Note that we only need to fill in entries up to the diagonal – the longest substring starting at i is of length n-i+1
Constructing parse tree we need to construct parse trees for string w: Idea: Keep back-pointers to the table entries that we combine At the end - reconstruct a parse from the back-pointers This allows us to find all parse trees
Ambiguity Efficient Representation of Ambiguities Local Ambiguity Packing : a Local Ambiguity - multiple ways to derive the same substring from a non-terminal All possible ways to derive each non-terminal are stored together When creating back-pointers, create a single back-pointer to the “packed” representation Allows to efficiently represent a very large number of ambiguities (even exponentially many) Unpacking - producing one or more of the packed parse trees by following the back-pointers.
References Hopcroft and Ullman,“Intro. to Automata Theory, Lang. and Comp.”Section 6.3, pp. 139-141 “CYK algorithm ” , Wikipedia, the free encyclopedia A representation by Zeph Grunschlag