CSC 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Normal forms and parsing Fall 2009
Testing membership and parsing Given a grammar How can we know if a string x is in its language? If so, can we obtain a parse tree for x ? Can we tell if the parse tree is unique? S → 0S1 | 1S0S1 | T T → S | e
First attempt Maybe we can try all possible derivations: S → 0S1 | 1S0S1 | T T → S | x = S0S1 1S0S1 T 00S11 01S0S11 0T1 S 10S10S1... when do we stop?
Problems How do we know when to stop? S → 0S1 | 1S0S1 | T T → S | x = S0S1 1S0S1 00S11 01S0S11 0T1 10S10S1... when do we stop?
Problems Idea: Stop derivation when length exceeds |x| Not right because of -productions We might want to eliminate -productions too S → 0S1 | 1S0S1 | T T → S | x = S 0S1 01S0S11 01S011
Problems Loops among the variables ( S → T → S ) might make us go forever We want to eliminate such loops S → 0S1 | 1S0S1 | T T → S | x = 00111
Removal of -productions A variable N is nullable if there is a derivation How to remove -productions (except from S ) Find all nullable variables N 1,..., N k For every production of the form A → N i , add another production A → If N i → is a production, remove it If S is nullable, add the special production S → N N *
Example Find the nullable variables S ACD A a B C ED | D BC | b E b BCD nullable variablesgrammar Find all nullable variables N 1,..., N k
Finding nullable variables To find nullable variables, we work backwards –First, mark all variables A s.t. A as nullable –Then, as long as there are productions of the form where all of A 1,…, A k are marked as nullable, mark A as nullable A → A 1 … A k
Eliminating -productions S ACD A a B C ED | D BC | b E b nullable variables: B, C, D For every production of the form A → N i , add another production A → If N i → is a production, remove it D C S AD D B D S AC S A C E
Dealing with loops A unit production is a production of the form where A 1 and A 2 are both variables Example A 1 → A 2 S → 0S1 | 1S0S1 | T T → S | R | R → 0SR grammar:unit productions: ST R
Removal of unit productions If there is a cycle of unit productions delete it and replace everything with A 1 Example A 1 → A 2 →... → A k → A 1 S → 0S1 | 1S0S1 | T T → S | R | R → 0SR ST R S → 0S1 | 1S0S1 S → R | R → 0SR T is replaced by S in the {S, T} cycle
Removal of unit productions For other unit productions, replace every chain by productions A 1 → ,..., A k → Example A 1 → A 2 →... → A k → S → R → 0SR is replaced by S → 0SR, R → 0SR S → 0S1 | 1S0S1 | R | R → 0SR S → 0S1 | 1S0S1 | 0SR | R → 0SR
Recap After eliminating -productions and unit productions, we know that every derivation doesn’t shrink in length and doesn’t go into cycles Exception: S → –We will not use this rule at all, except to check if L Note – -productions must be eliminated before unit productions S a 1 …a k where a 1, …, a k are terminals *
Example: testing membership S → 0S1 | 1S0S1 | T T → S | x = S → | 01 | 101 | 0S1 |10S1 | 1S01 | 1S0S1 S 01, S1 1S01 1S0S , strings of length ≥ , strings of length ≥ 6 unit, -prod eliminate only strings of length ≥ 6 0S1 0011, S11 strings of length ≥ 6 only strings of length ≥ 6
Algorithm 1 for testing membership How to check if a string x ≠ is in L(G) Eliminate all -productions and unit productions Let X := S While some new rule R can be applied to X Apply R to X If X = x, you have found a derivation for x If |X| > |x|, backtrack If no more rules can be applied to X, x is not in L
Practical limitations of Algorithm I This method can be very slow if x is long There is a faster algorithm, but it requires that we do some more transformations on the grammar G = CFG of the java programming language x = code for a 200-line java program algorithm might take about steps!
Chomsky Normal Form A grammar is in Chomsky Normal Form if every production (except possibly S → ) is of the type Conversion to Chomsky Normal Form is easy: A → BC A → a or A → BcDE replace terminals with new variables A → BCDE C → c break up sequences with new variables A → BX 1 X 1 → CX 2 X 2 → DE C → c
Exercise Convert this CFG into Chomsky Normal Form: S |ADDA A a C c D bCb
Algorithm 2 for testing membership S AB | BC A BA | a B CC | b C AB | a x = baaba Idea: We generate each substring of x bottom up abbaa ACBB BSA SC B–B SAC–
Parse tree reconstruction S AB | BC A BA | a B CC | b C AB | a x = baaba abbaa ACACBBACACACAC BSASASASCSC B–B SAC– Tracing back the derivations, we obtain the parse tree
Cocke-Younger-Kasami algorithm For cells in last row If there is a production A x i Put A in table cell ii For cells st in other rows If there is a production A BC where B is in cell sj and C is in cell jt Put A in cell st x 1 x 2 … x k 11 22kk …… 1k1k table cells s jtk 1 Input: Grammar G in CNF, string x = x 1 …x k Cell ij remembers all possible derivations of substring x i …x j