Chapter 6 Simplification of Context-free Grammars and Normal Forms These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata, 4 th ed., by Peter 12/3/2015 9:55 PM
Parsing Given a string w and a grammar G, a parser finds a derivation of the string w from the grammar G, or else determines that the string is not part of the language Thus, a parser solves the membership problem for a language, which is the problem of deciding, for any string w and grammar G, whether w belongs to the language generated by G Typically, a parser also constructs a parse tree for the string (which can be used by a compiler for code generation) 12/3/2015 9:55 PM
Two questions Can we solve the membership problem for context-free languages? That is, can we develop a parsing algorithm for any context-free language? If so, can we develop an efficient parsing algorithm? We saw in the previous chapter that we can, if we place restrictions on the grammar. 12/3/2015 9:55 PM
Simplified forms and normal forms Simplified forms can eliminate ambiguity and otherwise “improve” a grammar What we would like to do is to have all productions in a CFG be in a form such that the string length is strictly non-decreasing. Once the productions are in this form, whenever we find in the process of deriving a string that the derivation string is longer than the input string, we know that the string cannot belong to the language. 12/3/2015 9:55 PM
Simplified forms and normal forms Normal forms of context-free grammars are interesting in that, although they are restricted forms, it can be shown that every CFG can be converted to a normal form. The two types of normal forms that we will look at are Chomsky normal form and Greibach normal form. 12/3/2015 9:55 PM
The empty string The empty string often complicates things, so we would like to define (and work with) a subset of a language which accepts the empty string. Let L be a context-free language and let G’ = (V, T, S, P) be a context free grammar for L – { λ }. Then we can construct a grammar G that generates L by adding the following to G’: Create a new Start variable, S 0 Add two new production rules to G’: S 0 S S 0 λ 12/3/2015 9:55 PM
The empty string Most of the proofs for CFG languages are demonstrated by using λ-free languages. It usually can be shown quite easily that the proof can also be extended to “equivalent” languages for which the only difference is the acceptance of the empty string. (yes, this is handwaving, but...) 12/3/2015 9:55 PM
Simplified forms Theorem 6.1: Let G = (V, T, S, P) be a context- free grammar. Suppose that P contains a production rule of the form: A x 1 Bx 2 Assume that A and B are different variables and that B y 1 | y 2 |... | y n is the set of all productions in P which have B as the left side. 12/3/2015 9:55 PM
Simplified forms Theorem 6.1: (continued) Let G’ = (V, T, S, P’) be the grammar in which P’ is constructed by deleting A x 1 Bx 2 from P, and adding to it A x 1 y 1 x 2 | x 1 y 2 x 2 |... | x 1 y n x 2 Then it may be shown that L(G’) = L(G) (see the Linz textbook, for the proof) 12/3/2015 9:55 PM
Simplified forms Example: A a | aaA | abBc B abbA | b Here we can’t eliminate all rules with B on the left side, but we can eliminate it from the right side of any A rules. The equivalent productions would be: A a | aaA | ababbAc | abbc B abbA | b 12/3/2015 9:55 PM
Simplified forms Example: Suppose that our complete simplified grammar is: S A A a | aaA | ababbAc | abbc B abbA | b Since you can’t get to B from S, there is no longer any way that any B rules can play a part in any derivation; they are useless. 12/3/2015 9:55 PM
Simplified forms Another example: Suppose that our grammar is: S aSb | λ | A A aA Notice that the production rule A aA can never be used to produce a sequence of all terminals. It is therefore useless. The production rule S A is also useless. (Why?) Both of these rules may be deleted without effectively changing the grammar. 12/3/2015 9:55 PM
Reachable Definition: A variable A in a CFG grammar G = (V, , S, P) is reachable if S * xAy for some x y (V T) *. Reachable variables are variables that appear in strings derivable from S. 12/3/2015 9:55 PM
Example S EA A abA | ab C EC | Ab E bC G EbE | CE | ba Reachable variables: R 0 = {S} R 1 = {S, E, A} R 2 = {S, E, A, C} R 3 = {S, E, A, C} 12/3/2015 9:55 PM
Useful variables Definition: Let G = (V, , S, P) be a context-free grammar. Let A V; then A is live iff there is at least one string w L(G) such that xAy * w with x, y in (V T) * Informally, live variables are those from which strings of terminals can be derived. Variables which are not live are said to be dead. 12/3/2015 9:55 PM
Example S AB | CD | ADF | CF | EA A abA |ab B bB | aD | BF | aF C cB | EC | Ab D bB | FFB E bC | AB F abbF | baF | bD | BB G EbE | CE | ba Live variables: L 0 ={A, G} L 1 ={A, G, C} L 2 ={A, G, C, E} L 3 ={A, G, C, E, S} 12/3/2015 9:55 PM
Useful variables Definition 6.1 (modified): A variable A in a CFG grammar G = (V, , S, P) is useful if, for some string w L(G), there is a derivation of w that takes the form S * xAb * w. Informally, a variable is useful if it can be used in a derivation of a string in the language L(G). A variable which is not useful is said to be useless. Variables which are dead are useless. Variables which are not reachable are useless. 12/3/2015 9:55 PM
Useless variables So a variable is useless if either: 1. it is not live (i.e., cannot derive a terminal string), or 2. it is not reachable from the start symbol A production is useless if it involves any useless variables. 12/3/2015 9:55 PM
Exercise Example: Given G = ({S, A, B, C}, {a, b}, S, P), with P = S aS | A | C A a B aa C aCb eliminate all useless variables and productions. First, we find any dead variables. It should be obvious that C can never generate a string of all-terminals. C is dead. 12/3/2015 9:55 PM
Exercise Delete any productions involving C. New grammar:S aS | A A a B aa Next, we check to see if there are any variables which cannot be reached from the start symbol. To do this, we may use a dependency graph. 12/3/2015 9:55 PM
Exercise Example:S aS | A | C A a B aa C aCb Dependency graph: SA C B Clearly, B is not reachable from S. 12/3/2015 9:55 PM
Exercise Delete any productions involving B. New grammar:S aS | A A a The only productions that were deleted from the original grammar were useless. This new grammar generates all and only the strings generated by the original grammar. It is equivalent to the original grammar. 12/3/2015 9:55 PM
Useless variables Theorem 6.2: Let G = (V, T, S, P) be a context- free grammar. Then there exists an equivalent grammar G’ = (V’, T’, S, P’) that does not contain any useless variables or productions. Note that useless variables may be removed from V to give V’, and any terminals not occurring in any useful production may be removed from T to give T’. 12/3/2015 9:55 PM
Simplified forms and normal forms Two undesirable types of productions in a CFG can make the string length in sentential forms not increase: productions - these productions are of the form A , and they actually decrease the length of the string unit productions - these productions are of the form A B, and they allow rules to be applied to a string without increasing the length of the string and without getting us any closer to the goal of ending up with a string of all terminals 12/3/2015 9:55 PM
productions Definition 6.2: Any production of a context-free grammar of the form A λ is called a λ-production. Any variable A for which the derivation A * λ is possible is called nullable. 12/3/2015 9:55 PM
Nullable variables A nullable variable in a context-free grammar G = (V, , S, P) is defined as follows: 1. Any variable A for which P contains the production A is nullable. 2. If P contains the production A B 1 B 2 …B n and B 1 B 2 …B n are nullable variables, then A is nullable. 3. No other variables in V are nullable. The nullable variables in V are precisely those variables A for which A *. 12/3/2015 9:55 PM
The effect of productions Suppose we are trying to see if our CFG generates the string aabaa, which contains 5 terminal characters. In the process of applying productions, we have generated an intermediate string, aaYbYaa, containing 7 characters. Since productions decrease the length of the string, it might still be possible to generate aabaa from aaYbYaa (if there were a derivation path Y ). 12/3/2015 9:55 PM
productions Note that without productions, a grammar would have no way to reduce the number of characters in its intermediate strings. In such a grammar, we could stop processing intermediate strings as soon as they exceeded the length of the target string. 12/3/2015 9:55 PM
productions So, given a CFG G without productions, we could determine if a given string x of length |x| belonged to L(G) simply by applying production rules and generating all strings of length |x|. If x had not been generated up to that point, it could not belong to that language. 12/3/2015 9:55 PM
productions Given the grammar S aS 1 b S 1 aS 1 b | λ What is the effect of the production S 1 λ? The effect is to delete S 1 from any sentential form occurring on the right-hand side of a production rule. 12/3/2015 9:55 PM
productions If we apply the production S 1 λ to S aS 1 b the resulting production rule is S ab If we apply the production S 1 λ to S 1 aS 1 b the resulting production rule is S 1 ab 12/3/2015 9:55 PM
productions Therefore, we can eliminate any λ-productions from this grammar by adding the new productions obtained by substituting λ for S 1 wherever S 1 appears on the right-hand side of the production rules, and then deleting the λ-production. When we do this, we obtain the equivalent grammar: S aS 1 b | ab S 1 aS 1 b | ab 12/3/2015 9:55 PM
productions Theorem 6.3: Let G be any context-free grammar with λ not in L(G). Then there exists an equivalent grammar G’ having no λ-productions. 12/3/2015 9:55 PM
Algorithm FindNull Establish the set N 0, which is the set of all variables A in the grammar that go directly to. Now loop: The first time through the loop, add to this set all variables B that go to A. The second time through the loop, add to this set all variables C that go to B. The third time through the loop, add to this set all variables D that go to C. etc.... Stop when no new variables were added to the set during the last iteration of the loop. 12/3/2015 9:55 PM
Example Let G be the CFG with the productions: S ABCBCDA A CD B Cb C a | D bD | Here, C and D are nullable because there are production rules C and D . But A is also nullable, because A CD, and both C and D are nullable. 12/3/2015 9:55 PM
Algorithm: Eliminate productions Given a CFG G = (V, S, P) construct a CFG G’= (V, S, P’) with no -productions as follows: 1. Initialize P’ = P 2. Find all nullable variables in V, using FindNull. 3. For every production A x in P (x {V T} * ), where x contains nullable variables, add to P’ every production that can be obtained from this one by deleting from x one or more of the occurrences in x of nullable variables. 4. Delete all productions from P’. 5. In addition, delete any duplicates and delete productions of the form A A. 12/3/2015 9:55 PM
Implications of Theorem 6.3: Let G = (V, , S, P) be any context-fee grammar, and let G’ be the grammar obtained from G by the previous algorithm. Then: 1. G’ has no -productions, and 2. L(G’) = L(G) - { }. 3. Moreover, if G is unambiguous, then so is G’. 12/3/2015 9:55 PM
Example Given a context-free grammar with the following production rules, find the nullable variables: S ABC A B | a B C | b | λ C AB | D D Cd N 0 = {B} N 1 = {B, A} N 2 = {B, A, C} N 3 = {B, A, C, S} 12/3/2015 9:55 PM
Example (continued) S ABC A B | a B C | b | C AB | D D Cd N = {A, B, C, S} S ABC S ABC | BC | AC | AB | A | B | C C AB | D C AB | A | B | D D Cd D Cd | d 12/3/2015 9:55 PM
Example (continued) S ABC | AB | AC | BC | A | B | C A B | a B C | b C AB | A | B | D D Cd | d Note that we have gotten rid of all -productions. However, other beneficial changes can still be made. 12/3/2015 9:55 PM
Unit productions Definition 6.3: Any production of a context-free grammar of the form A B, where A, B V is called a unit-production. 12/3/2015 9:55 PM
Unit productions Theorem 6.4: Let G = (V, T, S, P) be any context- free grammar without λ-productions. Then there exists a context-free grammar G’ = (V’, T’, S, P’) that does not have any unit-productions and that is equivalent to G. Proof: See p. 159 in the Linz text. 12/3/2015 9:55 PM
Definition of A-derivable variables The set of “A-derivable variables” is the set of all variables B for which A * . 1. If A B is a production, then B is A-derivable. 2. If: C is A-derivable C B is a production B A then B is A-derivable. 3. No other variables are A-derivable. 12/3/2015 9:55 PM
Algorithm: Eliminating Unit Productions Given a context-free grammar G = (V, S, P) with no -productions, construct a grammar G’= (V, S, P’) having no unit productions as follows: 1. Initialize P’ to be P. 2. For each A V, find the set of A-derivable variables. 3. For every pair (A, B) such that B is A-derivable, and every non-unit production B x (where x {V T} + ), add the production A x to P’. 4. Delete all unit productions from P’. 12/3/2015 9:55 PM
Example Original grammar: S S+T | T T T*F | F F (S) | a {S -derivable} = {T} {T-derivable} = {F} {S-derivable} ={T, F} Resulting grammar: S S+T | T*F | (S) | a T T*F | (S) | a F (S) | a 12/3/2015 9:55 PM
Summary Theorem 6.5: Let L be a context-free language that does not contain λ. Then there exists a context-free language that generates L and that does not have any useless productions, λ- productions, or unit-productions. Proof: Find a CFG that generates L. Apply the procedures in theorems 6.2, 6.3, and 6.4. The result is an equivalent CFG that generates L but does not have any useless productions, λ- productions, or unit-productions.. 12/3/2015 9:55 PM
Summary Note that the procedure specified above must occur in a particular order. The procedure for removing λ-productions can create new unit-productions, and the procedure for eliminating unit- productions must start with a CFG that has no λ- productions. The required sequence is: 1. Remove λ-productions 2. Remove unit productions 3. Remove useless productions 12/3/2015 9:55 PM
Unit productions Given a context-free grammar G’ without unit productions, any production rule must either: Convert a non-terminal to a terminal, or Replace a non-terminal with at least two other symbols 12/3/2015 9:55 PM
Unit productions Let: l = length of the current string t = the number of terminals in the current string The value of l + t is 1 for the starting string S and 2k for a string (all terminals) of length k in the language. The value of l + t for an intermediate string of length k containing 1 or more variables would be < 2k. Any intermediate string with l + t > 2k cannot generate a string of length k in the language. 12/3/2015 9:55 PM
Simplified forms What does this mean for us? Given a grammar G and a language L(G), it means that if you have a string, x, in L(G) and |x| = k, then starting from S there are no more than 2k - 1 steps in the derivation of x. 12/3/2015 9:55 PM
Proof: At the beginning of the derivation of x, the length of the intermediate string, S, is 1. Somehow you need to generate a string of length k. If G has no - productions or unit-productions, then there are 2 possible kinds of rules: 1.The rule transforms one non-terminal into some combination of two or more non-terminals and/or terminals 2.The rule transforms one non-terminal into one terminal Rules of the first type will increase the length of the derivation string by at least one character at each step. So it will take no more than k-1 steps to increase the size of the string to k. 12/3/2015 9:55 PM
Proof: Once the intermediate string has k symbols in it, any additional rules involved in the derivation of x must simply replace variable symbols with terminals. The “worst-case scenario” is if all the symbols are variables; in that case, we will need at most k steps (of rules of the second type, which replace a single variable with a single terminal) to convert the intermediate string into a string of all terminals. It will take no more than 2k - 1 applications of the production rules to derive x. These rules can be applied in any order. (We don’t have to expand the string first and then convert it to terminals.) 12/3/2015 9:55 PM
Chomsky Normal Form There are other ways to limit the form a grammar can have. A context-free grammar in Chomsky Normal Form (CNF) has all of its rules restricted so that there are no more than two symbols, either one terminal or two variables, on the right-hand side of a production rule. This seems very restrictive, but actually every context-free grammar can be converted into Chomsky Normal Form. 12/3/2015 9:55 PM
Chomsky Normal Form Definition 6.4: A context-free grammar is in Chomsky Normal Form (CNF) if every production is one of these two types: A BC A a where A, B, and C are variables and a is a terminal symbol. 12/3/2015 9:55 PM
Chomsky normal form For languages that include the empty string λ, the rule S λ may also be allowed, where S is the start symbol, as long as S does not occur on the right-hand side of any rule 12/3/2015 9:55 PM
Chomsky Normal Form Theorem 6.6: Any context-free grammar G = (V, T, S, P) with λ L(G) has an equivalent grammar G’ = (V’, T’, S, P’) in Chomsky Normal Form. (Actually, for languages that include the empty string λ, the rule S λ may also be allowed, where S is the start symbol, as long as S does not occur on the right-hand side of any rule.) 12/3/2015 9:55 PM
Chomsky Normal Form: Proof by construction Given a CFG grammar G = (V, , S, P), to convert it to Chomsky Normal Form: 1. Eliminate -productions and unit-productions from G, producing a CFG G’= (V, , S, P’), such that L(G’) = L(G) - { }. 2. Convert G’ into G’’ = (V’’, , S, P’’) so that every production is either of the form A B 1 B 2 … B k (where k 2 and each B i is a variable in V’’), or of the form A a 12/3/2015 9:55 PM
Chomsky Normal Form Basically, what you are doing in step 2 is restricting the right sides of productions to be either single terminals or strings of two or more variables. What we don’t want is strings of length 2 that have one or more terminals in them. If we have strings like this, for every terminal a appearing in such a string: 1.Add a new variable, X a and add a new production, X a a 2. Replace a by X a in all the productions where it appears (except those in the form A a). 12/3/2015 9:55 PM
Chomsky Normal Form (continued) 3. Convert G’’ into G’’’ = (V’’’, , S, P’’’). To do this, replace each production having more than two variables on the right by an equivalent set of productions, each one having exactly two variables on the right. (Create new variables as necessary to accomplish this.) For example: the production A BCD would be replaced with A BZ 1 Z 1 CD Done! 12/3/2015 9:55 PM
Example Original grammar: S AB | ab A ABAB | BA B ab | b After step 2: S AB | X a X b X a a X b b A ABAB | BA B X a X b | b 12/3/2015 9:55 PM
Example After step 2: S AB | X a X b X a a X b b A ABAB | BA B X a X b | b After step 3: S AB | X a X b X a a X b b A AY 1 | BA Y 1 BY 2 Y 2 AB B X a X b | b 12/3/2015 9:55 PM
Example If you recognize that A ABAB has two copies of the same pair of variables, you could substitute the following instead: (but the first procedure works equally well) After step 3: S AB | X a X b X a a X b b A Y 1 Y 1 | BA Y 1 AB B X a X b | b 12/3/2015 9:55 PM
Proof (concluded) This constitutes a proof by construction that any CFG can be converted to CNF. Later, this will be used to prove that there are languages which are not context-free. 12/3/2015 9:55 PM
Greibach Normal Form Greibach Normal Form is similar to Chomsky Normal Form, except that every production is of the form A ax, where a is a terminal symbol and x is a string of zero or more variables. Note that GNF puts a limit on where terminals and variables can appear – restrictions on their relative positions – rather than on the number of symbols on the right-hand side of the production rules. 12/3/2015 9:55 PM
Greibach Normal Form Definition 6.5: A context-free grammar is said to be in Greibach Normal Form if all productions have the form A ax where a T and x V * 12/3/2015 9:55 PM
Greibach Normal Form Example: Convert the following grammar into GNF: S abSb | aa Introduce new variables A and B to stand for a and b respectively, and substitute: S aBSB | aA A a B b 12/3/2015 9:55 PM
Greibach Normal Form Theorem 6.7: Any context-free grammar G = (V, T, S, P) with λ L(G) has an equivalent grammar G’ = (V’, T’, S, P’) in Greibach Normal Form. It is hard to prove this, and it is hard to construct an easy-to implement algorithm for performing the conversion. 12/3/2015 9:55 PM
A membership algorithm for CFG’s The famous linguist Noam Chomsky showed that every context-free grammar can be converted to an equivalent grammar in Chomsky normal form. Why should you care about this? The fact that any CFG can be converted to Chomsky normal form lets us develop a parsing algorithm that shows that the membership problem can be solved for context-free languages (CFLs). 12/3/2015 9:55 PM
Some motivation Here is the idea of the algorithm: For a grammar in Chomsky normal form, any derivation of a string w has 2n-1 steps, where n is the length of w. (Why?) So, it is only necessary to check derivations of 2n-1 steps to decide whether G generates w. Of course, this parsing algorithm is inefficient! It would never be used in practice. But it solves the membership problem for CFLs. 12/3/2015 9:55 PM
The CYK algorithm The membership algorithm for CFG’s that is usually cited is the CYK algorithm, named for its three developers. It works by breaking down the problem into a sequence of smaller problems and solving them. Details may be found on pages of the Linz textbook. This algorithm can be shown to run in |w| 3 time. 12/3/2015 9:55 PM
LL grammars A top-down parser finds a leftmost derivation of a string. “Top-down” means to start with the start symbol and show how to derive the string from it. An LL(k) grammar allows a parser to perform left-to- right scan of the input to find a leftmost derivation, using k symbols of lookahead to select the next rule. Many compilers have been written using LL parsers. But LL grammars are not sufficiently general to generate all deterministic CFLs. This led to study of more general deterministic grammars, especially LR grammars. 12/3/2015 9:55 PM
LR grammars A bottom-up parser finds a rightmost derivation of a string. “Bottom-up” means to start with a string and “reduce” it to the start symbol. An LR(k) grammar allows a parser to perform left-to- right scan of the input to produce a rightmost derivation, using k symbols of lookahead to select the next rule. The class of languages generated by LR(1) grammars is exactly the deterministic CFLs. Two subclasses of LR(1) grammars, called SLR(1) (for “simple” LR) and LALR(1) (for “lookahead” LR) are commonly used for programming languages. 12/3/2015 9:55 PM
Parsing algorithms Parsing is an extremely important topic in the design and compilation of programming languages. You will study parsing algorithms based on various LL and LR grammars in a course on compiler design. Most of what we have studied in these chapters about regular and context-free languages provides the mathematical foundation for designing good compilers. (It has many other applications as well.) 12/3/2015 9:55 PM
Efficient parsing Programming languages are context-free languages, and parsing is central to any programming language compiler Many parsing algorithms for context-free grammars have been developed over the years. Most simulate pushdown automata. However, some PDAs cannot be simulated efficiently by computer programs because they are nondeterministic. Efficient parsers simulate deterministic PDAs. 12/3/2015 9:55 PM
Regular grammar CFG’s A word is a string of all terminals. A semiword is a string of 0 or more terminals concatenated with exactly one nonterminal on the right. So, for example, abcA is a semiword. A CFG is called a regular grammar if each of its productions is one of the two forms: Nonterminal semiword Nonterminal word 12/3/2015 9:55 PM
Regular grammars All regular languages can be generated by regular grammars. All regular grammars generate regular languages. Context-free grammars are more powerful than regular grammars. Regular languages are a proper subset of context-free languages, so CFG’s can generate all regular languages (as well as non-regular context-free languages). 12/3/2015 9:55 PM