Context Free Grammars. Context Free Languages (CFL) The pumping lemma showed there are languages that are not regular –There are many classes “larger”

Slides:



Advertisements
Similar presentations
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Advertisements

1 Parsing The scanner recognizes words The parser recognizes syntactic units Parser operations: Check and verify syntax based on specified syntax rules.
Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.
Introduction to Computability Theory
CS5371 Theory of Computation
Context-Free Grammars Lecture 7
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Parsing — Part II (Ambiguity, Top-down parsing, Left-recursion Removal)
1 The Parser Its job: –Check and verify syntax based on specified syntax rules –Report errors –Build IR Good news –the process can be automated.
Normal forms for Context-Free Grammars
1 Context-Free Languages. 2 Regular Languages 3 Context-Free Languages.
COP4020 Programming Languages
Context-Free Grammars Chapter 3. 2 Context-Free Grammars and Languages n Defn A context-free grammar is a quadruple (V, , P, S), where  V is.
CSE 413 Programming Languages & Implementation Hal Perkins Autumn 2012 Context-Free Grammars and Parsing 1.
EECS 6083 Intro to Parsing Context Free Grammars
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 7 Mälardalen University 2010.
Chapter 4 Context-Free Languages Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.
BİL 744 Derleyici Gerçekleştirimi (Compiler Design)1 Syntax Analyzer Syntax Analyzer creates the syntactic structure of the given source program. This.
Introduction to Parsing Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Classification of grammars Definition: A grammar G is said to be 1)Right-linear if each production in P is of the form A  xB or A  x where A and B are.
Context Free Grammars CIS 361. Introduction Finite Automata accept all regular languages and only regular languages Many simple languages are non regular:
Grammars CPSC 5135.
Languages & Grammars. Grammars  A set of rules which govern the structure of a language Fritz Fritz The dog The dog ate ate left left.
1 Context-Free Languages. 2 Regular Languages 3 Context-Free Languages.
Lecture # 9 Chap 4: Ambiguous Grammar. 2 Chomsky Hierarchy: Language Classification A grammar G is said to be – Regular if it is right linear where each.
CS 3240: Languages and Computation Context-Free Languages.
Parsing Introduction Syntactic Analysis I. Parsing Introduction 2 The Role of the Parser The Syntactic Analyzer, or Parser, is the heart of the front.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 11 Midterm Exam 2 -Context-Free Languages Mälardalen University 2005.
Context Free Grammars 1. Context Free Languages (CFL) The pumping lemma showed there are languages that are not regular –There are many classes “larger”
CS 208: Computing Theory Assoc. Prof. Dr. Brahim Hnich Faculty of Computer Sciences Izmir University of Economics.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
Grammars Hopcroft, Motawi, Ullman, Chap 5. Grammars Describes underlying rules (syntax) of programming languages Compilers (parsers) are based on such.
Grammars CS 130: Theory of Computation HMU textbook, Chap 5.
11 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 7 School of Innovation, Design and Engineering Mälardalen University 2012.
Context-Free Languages. Regular Languages Context-Free Languages.
Chapter 3 Context-Free Grammars Dr. Frank Lee. 3.1 CFG Definition The next phase of compilation after lexical analysis is syntax analysis. This phase.
Syntax Analysis – Part I EECS 483 – Lecture 4 University of Michigan Monday, September 17, 2006.
Syntax Analyzer (Parser)
CSC312 Automata Theory Lecture # 26 Chapter # 12 by Cohen Context Free Grammars.
Overview of Previous Lesson(s) Over View 3 Model of a Compiler Front End.
LECTURE 4 Syntax. SPECIFYING SYNTAX Programming languages must be very well defined – there’s no room for ambiguity. Language designers must use formal.
1 Topic #4: Syntactic Analysis (Parsing) CSC 338 – Compiler Design and implementation Dr. Mohamed Ben Othman ( )
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen Department of Computer Science University of Texas-Pan American.
Compiler Construction Lecture Five: Parsing - Part Two CSC 2103: Compiler Construction Lecture Five: Parsing - Part Two Joyce Nakatumba-Nabende 1.
CS 461 – Sept. 23 Context-free grammars Derivations Ambiguity Proving correctness.
Syntax Analysis By Noor Dhia Syntax analysis:- Syntax analysis or parsing is the most important phase of a compiler. The syntax analyzer considers.
Chapter 3 – Describing Syntax CSCE 343. Syntax vs. Semantics Syntax: The form or structure of the expressions, statements, and program units. Semantics:
1 Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5.
Introduction to Parsing
Context-Free Languages & Grammars (CFLs & CFGs) (part 2)
5. Context-Free Grammars and Languages
Chapter 3 – Describing Syntax
Context-Free Languages & Grammars (CFLs & CFGs)
Context-Free Grammars: an overview
CS 404 Introduction to Compiler Design
G. Pullaiah College of Engineering and Technology
Programming Languages Translator
CS510 Compiler Lecture 4.
Fall Compiler Principles Context-free Grammars Refresher
Introduction to Parsing (adapted from CS 164 at Berkeley)
PARSE TREES.
Context-Free Languages
5. Context-Free Grammars and Languages
Lecture 7: Introduction to Parsing (Syntax Analysis)
CHAPTER 2 Context-Free Languages
Finite Automata and Formal Languages
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
CFGs: Formal Definition
Fall Compiler Principles Context-free Grammars Refresher
COMPILER CONSTRUCTION
Presentation transcript:

Context Free Grammars

Context Free Languages (CFL) The pumping lemma showed there are languages that are not regular –There are many classes “larger” than that of regular languages –One of these classes are called “Context Free” languages Described by Context-Free Grammars (CFG) –Why named context-free? –Property that we can substitute strings for variables regardless of context (implies context sensitive languages exist) CFG’s are useful in many applications –Describing syntax of programming languages –Parsing –Structure of documents, e.g.XML Analogy of the day: –DFA:Regular Expression as Pushdown Automata : CFG

CFG Example Language of palindromes –We can easily show using the pumping lemma that the language L = { w | w = w R } is not regular. –However, we can describe this language by the following context-free grammar over the alphabet {0,1}: P   P  0 P  1 P  0P0 P  1P1 Inductive definition More compactly: P   | 0 | 1 | 0P0 | 1P1

Formal Definition of a CFG There is a finite set of symbols that form the strings, i.e. there is a finite alphabet. The alphabet symbols are called terminals (think of a parse tree) There is a finite set of variables, sometimes called non- terminals or syntactic categories. Each variable represents a language (i.e. a set of strings). –In the palindrome example, the only variable is P. One of the variables is the start symbol. Other variables may exist to help define the language. There is a finite set of productions or production rules that represent the recursive definition of the language. Each production is defined: 1.Has a single variable that is being defined to the left of the production 2.Has the production symbol  3.Has a string of zero or more terminals or variables, called the body of the production. To form strings we can substitute each variable’s production in for the body where it appears.

CFG Notation A CFG G may then be represented by these four components, denoted G=(V,T,P,S) –V is the set of variables –T is the set of terminals –P is the set of productions –S is the start symbol.

Sample CFG 1.E  I// Expression is an identifier 2.E  E+E// Add two expressions 3.E  E*E// Multiply two expressions 4.E  (E)// Add parenthesis 5.I  L// Identifier is a Letter 6.I  ID// Identifier + Digit 7.I  IL// Identifier + Letter 8.D  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9// Digits 9.L  a | b | c | … A | B | … Z// Letters Note Identifiers are regular; could describe as (letter)(letter + digit)*

Recursive Inference The process of coming up with strings that satisfy individual productions and then concatenating them together according to more general rules is called recursive inference. This is a bottom-up process For example, parsing the identifier “r5” –Rule 8 tells us that D  5 –Rule 9 tells us that L  r –Rule 5 tells us that I  L so I  r –Apply recursive inference using rule 6 for I  ID and get I  rD. Use D  5 to get I  r5. –Finally, we know from rule 1 that E  I, so r5 is also an expression.

Recursive Inference Exercise Show the recursive inference for arriving at (x+int1)*10 is an expression

Derivation Similar to recursive inference, but top-down instead of bottom-up –Expand start symbol first and work way down in such a way that it matches the input string For example, given a*(a+b1) we can derive this by: –E  E*E  I*E  L*E  a*E  a*(E)  a*(E+E)  a*(I+E)  a*(L+E)  a*(a+E)  a*(a+I)  a*(a+ID)  a*(a+LD)  a*(a+bD)  a*(a+b1) Note that at each step of the productions we could have chosen any one of the variables to replace with a more specific rule.

Formal Description of Derivation First we need some new terminology! The process of deriving a string by applying a production from head to body is denoted by  If  and  are strings consisting of terminals and variables, and A is a variable, then let A   be a production of grammar G. –We can then say  A  G  –Often we will assume we are working with grammar G, and leave it off:  A  

Multiple Derivation Steps Just as we defined  ^, the extended transition function that accepts a string, we can also define a similar notion for the derivation  If we process multiple derivation steps, we use a  * to indicate “zero or more steps” as follows inductively: –Basis: For any string  of terminals and variables, we can say  * . That is, any string derives itself. –Induction: If  *  and , then  * . That is, if alpha can become beta in zero or more steps, then we can take one more step to gamma meaning alpha derives gamma. The proof is straightforward.

Multiple Derivation We already saw an example of  in deriving a*(a+b1) We could have used  * to condense the derivation. –E.g. we could just go straight to E  * E*(E+E) or even straight to the final step E  * a*(a+b1) Going straight to the end is not recommended on a homework or exam problem if you are supposed to show the derivation

Leftmost Derivation In the previous example we used a derivation called a leftmost derivation. We can specifically denote a leftmost derivation using the subscript “lm”, as in:  lm or  * lm A leftmost derivation is simply one in which we replace the leftmost variable in a production body by one of its production bodies first, and then work our way from left to right.

Rightmost Derivation Not surprisingly, we also have a rightmost derivation which we can specifically denote via:  rm or  * rm A rightmost derivation is one in which we replace the rightmost variable by one of its production bodies first, and then work our way from right to left.

Rightmost Derivation Example a*(a+b1) was already shown previously using a leftmost derivation. We can also come up with a rightmost derivation, but we must make replacements in different order: –E  rm E*E  rm E * (E)  rm E*(E+E)  rm E*(E+I)  rm E*(E+ID)  rm E*(E+I1)  rm E*(E+L1)  rm E*(E+b1)  rm E*(I+b1)  rm E*(L+b1)  rm E*(a+b1)  rm I*(a+b1)  rm L*(a+b1)  rm a*(a+b1)

Left or Right? Does it matter which method you use? Answer: No Any derivation has an equivalent leftmost and rightmost derivation. That is, A  * . iff A  * lm  and A  * rm .

Language of a Context Free Grammar The language that is represented by a CFG G(V,T,P,S) may be denoted by L(G), is a Context Free Language (CFL) and consists of terminal strings that have derivations from the start symbol: L(G) = { w in T | S  * G w } Note that the CFL L(G) consists solely of terminals from G.

Sentential Forms A sentential form is a special name given to derivations from the start symbol. If we have a string  that consists entirely of terminals or variables, then S  *  where S is the start symbol is a sentential form. Note that we can have leftmost or rightmost sentential forms based on which type of derivation we are using.

CFG Exercises

Parse Trees A parse tree is a top-down representation of a derivation –Good way to visualize the derivation process –Will also be useful for some proofs coming up! If we can generate multiple parse trees then that means that there is ambiguity in the language –This is often undesirable, for example, in a programming language we would not like the computer to interpret a line of code in a way different than what the programmer intends. –But sometimes an unambiguous language is difficult or impossible to avoid.

Parse Tree Construction

Sample Parse Tree Sample parse tree for the palindrome CFG for : P   | 0 | 1 | 0P0 | 1P1

Sample Parse Tree Using a leftmost derivation generates the parse tree for a*(a+b1) Does using a rightmost derivation produce a different tree? The yield of the parse tree is the string that results when we concatenate the leaves from left to right (e.g., doing a leftmost depth first search). –The yield is always a string that is derived from the root and is guaranteed to be a string in the language L.

Inference, Derivations, and Parse Trees We have used the following forms to describe the processing of CFG’s to describe whether or not a string s is in the language given a CFG with start symbol A: 1.The recursive inference procedure run on s can determine that s is in the language 2.A  * s 3.A  * lm s 4.A  * rm s 5.The parse tree rooted at A contains s as its yield All of these forms are equivalent for strings consisting of terminal symbols. All of these forms except for #1 are equivalent for strings consisting of terminals or variables (this is because we only defined recursive inference for terminal symbols). –However, derivations and parse trees are equivalent even including variables. This means that if we can create a parse tree of some sort, we can create a corresponding derivation, either leftmost, rightmost, or mixed, that expresses the same behavior as the parse tree.

Proof of Equivalence between Derivation, Recursive Inference, Parse Trees Skipping equivalences; proven in text. General strategy: Recursive Inferences  Parse Tree  (Left | Right derivation)  derivation  Recursive Inference The loop back to recursive inferences completes the equivalence. –To go from recursive inferences to parse trees, we create a child/parent relationship each time we make a recursive inference. –The parse tree can generate a leftmost derivation by following leftmost children in the tree first, while the rightmost derivation examines rightmost children in the tree first. –A derivation to recursive inference is done by showing that individual productions of the form A  w can be built into A  *w.

Ambiguous Grammars A CFG is ambiguous if one or more terminal strings have multiple leftmost derivations from the start symbol. –Equivalently: multiple rightmost derivations, or multiple parse trees. Examples –E  E+E | E*E –E+E*E can be parsed as E  E+E  E+E*E E  E*E  E+E*E E E+E E*E E E E+E *E

Ambiguous Grammar Is the following grammar ambiguous? –S  AS | ε –A  A1 | 0A1 | 01 –Try for S  AS  A1S  0A11S  00111S  00111ε S  AS  0A1S  0A11S  00111S  00111ε

Removing Ambiguity No algorithm can tell us if an arbitrary CFG is ambiguous in the first place –Halting / Post Correspondence Problem Why care? –Ambiguity can be a problem in things like programming languages where we want agreement between the programmer and compiler over what happens Solutions –Apply precedence –e.g. Instead of: E  E+E | E*E –Use:E  T | E + T, T  F | T * F This rule says we apply + rule before the * rule (which means we multiply first before adding)

Inherent Ambiguity A CFL is said to be inherently ambiguous if all its grammars are ambiguous –Obviously these would be bad choices for programming languages –Such things exist, see book for some details