Chapter 5 Context-free Languages

Slides:



Advertisements
Similar presentations
Lecture # 8 Chapter # 4: Syntax Analysis. Practice Context Free Grammars a) CFG generating alternating sequence of 0’s and 1’s b) CFG in which no consecutive.
Advertisements

LR-Grammars LR(0), LR(1), and LR(K).
Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.
CS5371 Theory of Computation
Transparency No. P2C4-1 Formal Language and Automata Theory Part II Chapter 4 Parse Trees and Parsing.
Foundations of (Theoretical) Computer Science Chapter 2 Lecture Notes (Section 2.1: Context-Free Grammars) David Martin With some.
CS 330 Programming Languages 09 / 13 / 2007 Instructor: Michael Eckmann.
Context-Free Grammars Lecture 7
Parsing — Part II (Ambiguity, Top-down parsing, Left-recursion Removal)
Normal forms for Context-Free Grammars
Transparency No. P2C1-1 Formal Language and Automata Theory Part II Pushdown Automata and Context-Free Languages.
Chapter 3: Formal Translation Models
COP4020 Programming Languages
Context-Free Grammars Chapter 3. 2 Context-Free Grammars and Languages n Defn A context-free grammar is a quadruple (V, , P, S), where  V is.
INHERENT LIMITATIONS OF COMPUTER PROGRAMS CSci 4011.
(2.1) Grammars  Definitions  Grammars  Backus-Naur Form  Derivation – terminology – trees  Grammars and ambiguity  Simple example  Grammar hierarchies.
CSE 413 Programming Languages & Implementation Hal Perkins Autumn 2012 Context-Free Grammars and Parsing 1.
EECS 6083 Intro to Parsing Context Free Grammars
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 7 Mälardalen University 2010.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
Chapter 4 Context-Free Languages Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.
Formal Grammars Denning, Sections 3.3 to 3.6. Formal Grammar, Defined A formal grammar G is a four-tuple G = (N,T,P,  ), where N is a finite nonempty.
Lecture 16 Oct 18 Context-Free Languages (CFL) - basic definitions Examples.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 5 Mälardalen University 2005.
CSE 3813 Introduction to Formal Languages and Automata Chapter 8 Properties of Context-free Languages These class notes are based on material from our.
BİL 744 Derleyici Gerçekleştirimi (Compiler Design)1 Syntax Analyzer Syntax Analyzer creates the syntactic structure of the given source program. This.
Context-free Grammars Example : S   Shortened notation : S  aSaS   | aSa | bSb S  bSb Which strings can be generated from S ? [Section 6.1]
The Pumping Lemma for Context Free Grammars. Chomsky Normal Form Chomsky Normal Form (CNF) is a simple and useful form of a CFG Every rule of a CNF grammar.
Syntax Analysis The recognition problem: given a grammar G and a string w, is w  L(G)? The parsing problem: if G is a grammar and w  L(G), how can w.
Languages, Grammars, and Regular Expressions Chuck Cusack Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5 th edition, by Kenneth.
1 Context-Free Languages Not all languages are regular. L 1 = {a n b n | n  0} is not regular. L 2 = {(), (()), ((())),...} is not regular.  some properties.
Classification of grammars Definition: A grammar G is said to be 1)Right-linear if each production in P is of the form A  xB or A  x where A and B are.
Context Free Grammars CIS 361. Introduction Finite Automata accept all regular languages and only regular languages Many simple languages are non regular:
Lecture # 19. Example Consider the following CFG ∑ = {a, b} Consider the following CFG ∑ = {a, b} 1. S  aSa | bSb | a | b | Λ The above CFG generates.
Chapter 5 Context-Free Grammars
Grammars CPSC 5135.
PART I: overview material
Languages & Grammars. Grammars  A set of rules which govern the structure of a language Fritz Fritz The dog The dog ate ate left left.
Lecture # 9 Chap 4: Ambiguous Grammar. 2 Chomsky Hierarchy: Language Classification A grammar G is said to be – Regular if it is right linear where each.
Context-Free Grammars Chapter 11. Languages and Machines.
Context Free Grammars. Context Free Languages (CFL) The pumping lemma showed there are languages that are not regular –There are many classes “larger”
CONTEXT FREE GRAMMAR presented by Mahender reddy.
Parsing Introduction Syntactic Analysis I. Parsing Introduction 2 The Role of the Parser The Syntactic Analyzer, or Parser, is the heart of the front.
Pushdown Automata Chapters Generators vs. Recognizers For Regular Languages: –regular expressions are generators –FAs are recognizers For Context-free.
CS 3240 – Chapter 5. LanguageMachineGrammar RegularFinite AutomatonRegular Expression, Regular Grammar Context-FreePushdown AutomatonContext-Free Grammar.
Closure Properties Lemma: Let A 1 and A 2 be two CF languages, then the union A 1  A 2 is context free as well. Proof: Assume that the two grammars are.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
Grammars Hopcroft, Motawi, Ullman, Chap 5. Grammars Describes underlying rules (syntax) of programming languages Compilers (parsers) are based on such.
Grammars CS 130: Theory of Computation HMU textbook, Chap 5.
Unit-3 Parsing Theory (Syntax Analyzer) PREPARED BY: PROF. HARISH I RATHOD COMPUTER ENGINEERING DEPARTMENT GUJARAT POWER ENGINEERING & RESEARCH INSTITUTE.
Grammars A grammar is a 4-tuple G = (V, T, P, S) where 1)V is a set of nonterminal symbols (also called variables or syntactic categories) 2)T is a finite.
Chapter 3 Context-Free Grammars Dr. Frank Lee. 3.1 CFG Definition The next phase of compilation after lexical analysis is syntax analysis. This phase.
Introduction Finite Automata accept all regular languages and only regular languages Even very simple languages are non regular (  = {a,b}): - {a n b.
Syntax Analyzer (Parser)
Context-Free Languages
Chapter 8 Properties of Context-free Languages These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata,
Chapter 4: Syntax analysis Syntax analysis is done by the parser. –Detects whether the program is written following the grammar rules and reports syntax.
Compiler Construction Lecture Five: Parsing - Part Two CSC 2103: Compiler Construction Lecture Five: Parsing - Part Two Joyce Nakatumba-Nabende 1.
Syntax Analysis By Noor Dhia Syntax analysis:- Syntax analysis or parsing is the most important phase of a compiler. The syntax analyzer considers.
Compiler Chapter 5. Context-free Grammar Dept. of Computer Engineering, Hansung University, Sung-Dong Kim.
1 Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5.
CONTEXT-FREE LANGUAGES
Formal Language & Automata Theory
CS510 Compiler Lecture 4.
PARSE TREES.
Context-Free Languages
Lecture 7: Introduction to Parsing (Syntax Analysis)
Finite Automata and Formal Languages
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
COMPILER CONSTRUCTION
Presentation transcript:

Chapter 5 Context-free Languages

Context-free Grammars Definition 5.1: A grammar G = (V, T, S, P) is said to be context-free if all production rules in P have the form A  x where A  V and x  (V  T)* A language is said to be context-free iff there is a context free grammar G such that L = L(G).

Context-free Grammars Context-free means that there is a single variable on the left side of each grammar rule. You can imagine other kinds of grammatical rules where this condition does not hold. For example: 1Z1  101 In this rule, the variable Z goes to 0 only in the context of a 1 on its left and a 1 to the right. This is a context-sensitive rule.

Non-regular languages There are non-regular languages that can be generated by context-free grammars. The language {anbn : n  0} is generated by the grammar S  aSb | λ The language L = {w : na(w) = nb(w)} S  SS | λ | aSb | bSa

Example of a Context-free Grammar The grammar G = ({S}, {a, b}, S, P), with production rules: S  aSa | bSb | λ is context-free. This grammar is linear (that is, there is at most a single variable on the right-hand side of every rule), but is neither right-linear nor left-linear (the variable is not always the rightmost [leftmost] character on the right-hand side of the rule), so it is not regular.

Example of a Context-free Grammar Given the grammar G = ({S}, {a, b}, S, P), with production rules: S  aSa | bSb | λ a typical derivation in this grammar might be: S  aSa  aaSaa  aabSbaa  aabbaa The language generated by this grammar is: L(G) = {wwR : w  {a,b}*}

Palindromes Palindromes are strings which are spelled the same way backwards and forwards. The language of palindromes, PAL, is not regular For any two strings x, y a string z can be found which distinguishes them. For x, y which are different strings and |x|=|y|, if z = xreverse is appended to each, then xz is accepted and yz is not accepted Therefore there must be an infinite number of states in any FA accepting PAL, so PAL is not regular

Example of a Non-linear Context-free Grammar Consider the grammar G = ({S}, {a, b}, S, P), with production rules: S  aSa | SS | λ This grammar is context-free. Why? Is this grammar linear? Why or why not?

Regular vs. context-free Are regular languages context-free? Yes, because context-free means that there is a single variable on the left side of each grammar rule. All regular languages are generated by grammars that have a single variable on the left side of each grammar rule. But, as we have seen, not all context-free grammars are regular. So regular languages are a proper subset of the class of context-free languages.

Derivation Given the grammar, S  aaSB | λ B  bB | b the string aab can be derived in different ways. S  aaSB  aaB  aab S  aaSB  aaSb  aab

Parse tree Both derivations on the previous slide correspond to the following parse (or derivation) tree. S a B b λ The tree structure shows the rule that is applied to each nonterminal, without showing the order of rule applications. Each internal node of the tree corresponds to a nonterminal, and the leaves of the derivation tree represent the string of terminals.

Derivation In the derivation S  aaSB  aaB  aab the first step was to replace S with λ, and then to replace B with b. We moved from left to right, replacing the leftmost variable at each step. This is called a leftmost derivation. Similarly, the derivation S  aaSB  aaSb  aab is called a rightmost derivation.

Leftmost (rightmost) derivation Definition 5.2: In a leftmost derivation, the leftmost nonterminal is replaced at each step. In a rightmost derivation, the rightmost nonterminal is replaced at each step. Many derivations are neither leftmost nor rightmost. If there is a single parse tree, there is also a single leftmost derivation.

Parse (derivation) trees Definition 5.3: Let G = (V, T, S, P) be a context-free grammar. An ordered tree is a derivation tree for G iff it has the following properties: 1. The root is labeled S 2. Every leaf has a label from T  {λ} 3. Every interior vertex (not a leaf) has a label from V. 4. If a vertex has label A  V, and its children are labeled (from left to right) a1, a2,..., an, then P must contain a production of the form A  a1a2...an 5. A leaf labeled λ has no siblings; that is, a vertex with a child labeled λ can have no other children

Parse (derivation) trees A partial derivation tree is one in which property 1 does not necessarily hold and in which property 2 is replaced by: 2a. Every leaf has a label from V  T  {λ} The yield of the tree is the string of symbols in the order they are encountered when the tree is traversed in a depth-first manner, always taking the leftmost unexplored branch.

Parse (derivation) trees A partial derivation tree yields a sentential form of the grammar G that the tree is associated with. A derivation tree yields a sentence of the grammar G that the tree is associated with.

Parse (derivation) trees Theorem 5.1: Let G = (V, T, S, P) be a context-free grammar. Then for every w  L(G) there exists a derivation tree of G whose yield is w. Conversely, the yield of any derivation tree of G is in L(G). If tG is any partial derivation tree for G whose root is labeled S, then the yield of tG is a sentential form of G. Any w  L(G) has a leftmost and a rightmost derivation. The leftmost derivation is obtained by always expanding the leftmost variable in the derivation tree at each step, and similarly for the rightmost derivation.

Ambiguity A grammar is ambiguous if there is a string with two possible parse trees. (A string has more than one parse tree if and only if it has more than one leftmost derivation.) English can be ambiguous. Example: “Disabled fly to see Carter.”

Example V = {S} T = {+, *, (, ), 0, 1} P = {S  S + S | S * S | (S) | 1 | 0} The string 0 * 0 + 1 has two different parse trees. The derivation begins like this: S What is the leftmost variable? What can we replace it with? S + S or S * S or (S) or 1 or 0 Pick one of these at random, say S + S

S  S + S | S * S | (S) | 1 | 0 Here is the parse tree: 0 0 Our string is 0 * 0 + 1. This parse corresponds to: compute 0 * 0 first, then add it to 1, which equals 1

Example S  S + S | S * S | (S) | 1 | 0 But there is another different parse tree that also generates the string 0 * 0 + 1 The derivation begins like this: S What is the leftmost variable? What can we replace it with? S + S or S * S or (S) or 1 or 0 Pick another one of these at random, say S * S

S  S + S | S * S | (S) | 1 | 0 Here is the parse tree: S S * S 0 S + S 0 1 Our string is still 0 * 0 + 1, but this parse corresponds to: take 0, and then multiply it by the sum of 0 + 1, which equals 0

S  S + S | S * S | (S) | 1 | 0 We can clearly indicate that the addition is to be done first. Here is the parse tree: S S * S 0 ( S ) S + S 0 1 Our string is now 0 * (0 + 1). This parse corresponds to: take 0, and then multiply it by the sum of 0 + 1, which equals 0

Equivalent grammars Here is a non-ambiguous grammar that generates the same language. S  S + A | A A  A * B | B B  (S) | 1 | 0 Two grammars that generate the same language are said to be equivalent. To make parsing easier, we prefer grammars that are not ambiguous.

Ambiguous grammars & equivalent grammars There is no general algorithm for determining whether a given CFG is ambiguous. There is no general algorithm for determining whether a given CFG is equivalent to another CFG.

Dangling else x = 3; if x > 2 then if x > 4 then x = 1; What value does x have at the end?

Ambiguous grammar <statement> := IF < expression> THEN <statement> | IF <expression> THEN <statement> ELSE <statement> | <otherstatement> Unambiguous grammar <statement> := <st1> | <st2> <st1> := IF <expression> THEN <st1> ELSE <st1> | <otherstatement> <st2> := IF <expression> THEN <statement> | IF <expression> THEN <st1> ELSE <st2>

Ambiguous grammars Definition 5.6: If L is a context-free language for which there an unambiguous grammar, then L is said to be unambiguous. If every grammar that generates L is ambiguous, then the language is called inherently ambiguous. Example: L = {anbncm}  {anbmcm} with n and m non-negative, is inherently ambiguous. See p. 144 in text for discussion.

Exercise Show that the following grammar is ambiguous. S  AB | aaB A  a | Aa B  b Construct an equivalent grammar that is unambiguous.

Parsing In practical applications, it is usually not enough to decide whether a string belongs to a language. It is also important to know how to derive the string from the language. Parsing uncovers the syntactical structure of a string, which is represented by a parse tree. (The syntactical structure is important for assigning semantics to the string -- for example, if it is a program)

Parsing Let G be a context-free grammar for C++. Let the string w be a C++ program. One thing a compiler does - in particular, the part of the compiler called the “parser” - is determine whether w is a syntactically correct C++ program. It also constructs a parse tree for the program that is used in code generation. There are many sophisticated and efficient algorithms for parsing. You may study them in more advanced classes (for example, on compilers).

The Decision question for CFL’s If a string w belongs to L(G) generated by a CFG, can we always decide that it does belong to L(G)? Yes. Just do top-down parsing, in which we list all the sequential forms that can be generated in one step, two steps, three steps, etc. This is a type of exhaustive search parsing. Eventually, w will be generated. What if w does not belong to L(G). Can we always decide that it doesn’t? Not unless we restrict the kinds of rules we can have in our grammar. Suppose we ask if w = aab is a string in L(G). If we have λ-rules, such as B  λ, in G, we might have a sentential form like aabB4000 and still be able to end up with aab.

The Decision question for CFL’s What we need to do is to restrict the kinds of rules in our CFG’s so that each rule, when it is applied, is guaranteed to either increase the length of the sentential form generated or to increase the number of terminals in the sentential form. That means that we don’t want rules of the following two forms in our CFG’s: A  λ A  B If we have a CFG that lacks these kinds of rules, then as soon as a sentential form is generated that is longer than our string, w, we can abandon any attempt to generate w from this sentential form.

The Decision question for CFL’s If the grammar does not have these two kinds of rules, then, in a finite number of steps, applying our exhaustive search parsing technique to G will generate all possible sentential forms of G with a length  |w|. If w has not been generated by this point, then w is not a string in the language, and we can stop generating sentential forms.

The Decision question for CFL’s Consider the grammar G = ({S}, {a, b}, S, P), where P is: S  SS | aSb | bSa | ab |ba Looking at the production rules, it is easy to see that the length of the sentential form produced by the application of any rule grows by at least one symbol during each derivation step. Thus, in  |w| derivation steps, G will produce either produce a string of all terminals, which may be compared directly to w, or a sentential form too long to be capable of producing w. Hence, given any w  {a, b}+, the exhaustive search parsing technique will always terminate in a finite number of steps.

The Decision question for CFL’s Theorem 5.2: Assume that G = (V, T, S, P) is a context-free grammar with no rules of the form A  λ or A  B, where A, B  V. Then the exhaustive search parsing technique can be made into an algorithm which, for any w  *, either produces a parsing for w or tells us that no parsing is possible.

The Decision question for CFL’s Since we don’t know ahead of time which derivation sequences to try, we have to try all of the possible applications of rules which result in one of two conditions: a string of all terminals of length |w|, or a sentential form of length |w| + 1. The application of any one rule must result in either: replacing a variable with one or more terminals, or increasing the length of a sentential form by one or more characters. The worst case scenario is applying |w| rules that increase the length of a sentential form to |w|, and then applying |w| rules that replace each variable with a terminal symbol, and ending up with a string of |w| terminals that doesn’t match w. This takes 2|w| operations.

The Decision question for CFL’s How many sentential forms will we have to examine? Restricting ourselves to leftmost derivations, it is obvious that, with |P| production rules, applying each rule one time to S gives us |P| sentential forms. Example: Given the 5 production rules S  SS | aSb | bSa | ab |ba, one round of leftmost derivations produces 5 sentential forms: S  SS S  aSb S  bSa S  ab S  ba

The Decision question for CFL’s The second round of leftmost derivations produces 15 sentential forms: SS  SSS SS  aSbS SS  bSaS SS  abS SS  baS aSb  aSSb aSb  aaSbb aSb  abSab aSb  aabb aSb  abab bSa  bSSa bSa  baSba bSa  bbSaa bSa  baba bSa  bbaa ab and ba don’t produce any new sentential forms, since they consist of all terminals. If they had contained variables, then the second round of leftmost derivations would have produced 25, or |P|2 sentential forms. Similarly, the third round of leftmost derivations can produce |P|3 sentential forms.

The Decision question for CFL’s We know from our worst case scenario that we never have to run through more than 2|w| rounds of rule applications in any one derivation sequence before being able to stop the derivation. Therefore, the total number of sentential forms that we may have to generate to decide whether string w belongs to L(G) generated by grammar G = (V, T, S, P) is  |P| + |P|2 + ... + |P|2|w| Unfortunately, this means that the work we might have to do to answer the decision question for CFG’s could grow exponentially with the length of the string.

The Decision question for CFL’s It can be shown that some more efficient parsing techniques for CFG’s exist. Theorem 5.3: For every context-free grammar there exists an algorithm that parses any w  L(G) in a number of steps proportional to |w|3. Your textbook does not offer a proof for this theorem. Anyway, what is needed is a linear-time parsing algorithm for CFG’s. Such an algorithm exists for some special cases of CFG’s but not for the class of CFG’s in general.

S-grammars Definition 5.4: A context-free grammar G = (V, T, S, P) is said to be a simple grammar or s-grammar if all of its productions are of the form A  ax, where A  V, a  T, x  V*, and any pair (A, a) occurs at most once in P. Example: The following grammar is an s-grammar: S  aS | bSS | c The following grammar is not an s-grammar. Why not? S  aS | bSS | aSS | c

S-grammars If G is an s-grammar, then any string w in L(G) can be parsed with an effort proportional to |w|.

S-grammars Let’s consider the grammar expressed by the following production rules: S  aS | bSS | c Since G is an s-grammar, all rules have the form A  ax. Assume that w = abcc. Due to the restrictive condition that any pair (A, a) may occur at most once in P, we know immediately which production rule must have generated the a in abcc – the rule S  aS. Similarly, there is only one way to produce the b and the two c’s. So we can parse w in no more than |w| steps.

Exercise Let G be the grammar S  abSc | A A  cAd | cd 1) Give a derivation of ababccddcc. 2) Build the parse tree for the derivation of (1). 3) Use set notation to define L(G).

Programming languages Programming languages are context-free, but not regular Programming languages have the following features that require infinite “stack memory” matching parentheses in algebraic expressions nested if .. then .. else statements, and nested loops block structure

Programming languages Programming languages are often defined using a convention for specifying grammars called Backus-Naur form, or BNF. Example: <expression> ::= <term> | <expression> + <term>

Programming languages Backus-Naur form is very similar to the standard CFG grammar form, but variables are listed within angular brackets, ::= is used instead of , and {X} is used to mean 0 or more occurrences of X. The | is still used to mean “or”. Pascal’s if statement: <if-statement> ::= if <expression> <then-clause> <else-clause>

Programming languages S-grammars are not sufficiently powerful to handle all the syntactic features of a typical programming language LL grammars and LR grammars (see next chapter) are normally used for specifying programming languages. They are more complicated than s-grammars, but still permit parsing in linear time. Some aspects of programming languages (i.e., semantics) cannot be handled by context-free grammars.