Finite State Automata are Limited Let us use (context-free) grammars!

Slides:



Advertisements
Similar presentations
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Advertisements

Exercise: Balanced Parentheses
Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.
Exercise: Build Lexical Analyzer Part For these two tokens, using longest match, where first has the priority: binaryToken ::= ( z |1) * ternaryToken ::=
CS5371 Theory of Computation
Context-Free Grammars Lecture 7
ISBN Chapter 4 Lexical and Syntax Analysis The Parsing Problem Recursive-Descent Parsing.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Normal forms for Context-Free Grammars
Chapter 3: Formal Translation Models
COP4020 Programming Languages
1 Chapter 3 Context-Free Grammars and Parsing. 2 Parsing: Syntax Analysis decides which part of the incoming token stream should be grouped together.
INHERENT LIMITATIONS OF COMPUTER PROGRAMS CSci 4011.
Problem of the DAY Create a regular context-free grammar that generates L= {w  {a,b}* : the number of a’s in w is not divisible by 3} Hint: start by designing.
1 Syntax and Semantics The Purpose of Syntax Problem of Describing Syntax Formal Methods of Describing Syntax Derivations and Parse Trees Sebesta Chapter.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 7 Mälardalen University 2010.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
Chapter 4 Context-Free Languages Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.
Formal Grammars Denning, Sections 3.3 to 3.6. Formal Grammar, Defined A formal grammar G is a four-tuple G = (N,T,P,  ), where N is a finite nonempty.
Lecture 16 Oct 18 Context-Free Languages (CFL) - basic definitions Examples.
CSE 3813 Introduction to Formal Languages and Automata Chapter 8 Properties of Context-free Languages These class notes are based on material from our.
BİL 744 Derleyici Gerçekleştirimi (Compiler Design)1 Syntax Analyzer Syntax Analyzer creates the syntactic structure of the given source program. This.
Context-free Grammars Example : S   Shortened notation : S  aSaS   | aSa | bSb S  bSb Which strings can be generated from S ? [Section 6.1]
Context-Free Grammars Normal Forms Chapter 11. Normal Forms A normal form F for a set C of data objects is a form, i.e., a set of syntactically valid.
The Pumping Lemma for Context Free Grammars. Chomsky Normal Form Chomsky Normal Form (CNF) is a simple and useful form of a CFG Every rule of a CNF grammar.
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Exercise 1 Consider a language with the following tokens and token classes: ident ::= letter (letter|digit)* LT ::= " " shiftL ::= " >" dot ::= "." LP.
Languages, Grammars, and Regular Expressions Chuck Cusack Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5 th edition, by Kenneth.
1 Context-Free Languages Not all languages are regular. L 1 = {a n b n | n  0} is not regular. L 2 = {(), (()), ((())),...} is not regular.  some properties.
Classification of grammars Definition: A grammar G is said to be 1)Right-linear if each production in P is of the form A  xB or A  x where A and B are.
Context Free Grammars CIS 361. Introduction Finite Automata accept all regular languages and only regular languages Many simple languages are non regular:
Chapter 5 Context-Free Grammars
Grammars CPSC 5135.
PART I: overview material
Languages & Grammars. Grammars  A set of rules which govern the structure of a language Fritz Fritz The dog The dog ate ate left left.
Lecture # 9 Chap 4: Ambiguous Grammar. 2 Chomsky Hierarchy: Language Classification A grammar G is said to be – Regular if it is right linear where each.
Lecture # 5 Pumping Lemma & Grammar
Context Free Grammars. Context Free Languages (CFL) The pumping lemma showed there are languages that are not regular –There are many classes “larger”
A Rule of While Language Syntax // Where things work very nicely for recursive descent! statmt ::= println ( stringConst, ident ) | ident = expr | if (
Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.
Lexer input and Output i d 3 = 0 LF w i d 3 = 0 LF w id3 = 0 while ( id3 < 10 ) id3 = 0 while ( id3 < 10 ) lexer Stream of Char-s ( lazy List[Char] ) class.
Introduction to Parsing
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
1 A well-parenthesized string is a string with the same number of (‘s as )’s which has the property that every prefix of the string has at least as many.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
Overview of Previous Lesson(s) Over View  In our compiler model, the parser obtains a string of tokens from the lexical analyzer & verifies that the.
Grammars CS 130: Theory of Computation HMU textbook, Chap 5.
Unit-3 Parsing Theory (Syntax Analyzer) PREPARED BY: PROF. HARISH I RATHOD COMPUTER ENGINEERING DEPARTMENT GUJARAT POWER ENGINEERING & RESEARCH INSTITUTE.
CS 203: Introduction to Formal Languages and Automata
Introduction Finite Automata accept all regular languages and only regular languages Even very simple languages are non regular (  = {a,b}): - {a n b.
Chapter 5 Context-free Languages
CSC312 Automata Theory Lecture # 26 Chapter # 12 by Cohen Context Free Grammars.
Given NFA A, find first(L(A))
Chapter 8 Properties of Context-free Languages These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata,
Overview of Previous Lesson(s) Over View 3 Model of a Compiler Front End.
LECTURE 4 Syntax. SPECIFYING SYNTAX Programming languages must be very well defined – there’s no room for ambiguity. Language designers must use formal.
Chapter 4: Syntax analysis Syntax analysis is done by the parser. –Detects whether the program is written following the grammar rules and reports syntax.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen Department of Computer Science University of Texas-Pan American.
1 A well-parenthesized string is a string with the same number of (‘s as )’s which has the property that every prefix of the string has at least as many.
CMSC 330: Organization of Programming Languages Pushdown Automata Parsing.
Syntax Analysis By Noor Dhia Syntax analysis:- Syntax analysis or parsing is the most important phase of a compiler. The syntax analyzer considers.
1 Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5.
Last Chapter Review Source code characters combination lexemes tokens pattern Non-Formalization Description Formalization Description Regular Expression.
Chapter 3 – Describing Syntax
CS510 Compiler Lecture 4.
Finite Automata and Formal Languages
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
COMPILER CONSTRUCTION
Presentation transcript:

Finite State Automata are Limited Let us use (context-free) grammars!

Context Free Grammar for a n b n S ::=  - a grammar rule S ::= a S b - another grammar rule Example of a derivation S => aSb => a aSb b => aa aSb bb => aaabbb Parse tree: leaves give us the result

Context-Free Grammars G = (A, N, S, R) A - terminals (alphabet for generated words w  A*) N - non-terminals – symbols with (recursive) definitions Grammar rules in R are pairs (n,v), written n ::= vwhere n  N is a non-terminal v  (A U N)* - sequence of terminals and non-terminals A derivation in G starts from the starting symbol S Each step replaces a non-terminal with one of its right hand sides Example from before: G = ({a,b}, {S}, S, {(S,  ), (S,aSb)})

Parse Tree Given a grammar G = (A, N, S, R), t is a parse tree of G iff t is a node-labelled tree with ordered children that satisfies: root is labeled by S leaves are labelled by elements of A each non-leaf node is labelled by an element of N for each non-leaf node labelled by n whose children left to right are labelled by p 1 …p n, we have a rule (n::= p 1 …p n )  R Yield of a parse tree t is the unique word in A* obtained by reading the leaves of t from left to right Language of a grammar G = words of all yields of parse trees of G L(G) = {yield(t) | isParseTree(G,t)} isParseTree - easy to check condition Harder: know if a word has a parse tree

Grammar Derivation A derivation for G is any sequence of words p i  (A U N)*,whose: first word is S each subsequent word is obtained from the previous one by replacing one of its letters by right-hand side of a rule in R : p i = unv, (n::=q)  R, p i+1 = uqv Last word has only letters from A Each parse tree of a grammar has one or more derivations, which result in expanding tree gradually from S Different orders of expanding non-terminals may generate the same tree

Remark We abbreviate S ::= p S ::= q as S ::= p | q

Example: Parse Tree vs Derivation Consider this grammar G = ({a,b}, {S,P,Q}, S, R) where R is: S ::= PQ P ::= a | aP Q ::=  | aQb Show a derivation tree for aaaabb Show at least two derivations that correspond to that tree.

Balanced Parentheses Grammar Consider the language L consisting of precisely those words consisting of parentheses “(“ and “)” that are balanced (each parenthesis has the matching one) Example sequence of parentheses ( ( () ) ())- balanced, belongs to the language ( ) ) ( ( ) - not balanced, does not belong Exercise: give the grammar and example derivation for the first string.

Balanced Parentheses Grammar

Proving that a Grammar Defines a Language Grammar G:S ::=  | (S)S defines language L(G) Theorem: L(G) = L b where L b = { w | for every pair u,v of words such that uv=w, the number of ( symbols in u is greater or equal than the number of ) symbols in u. These numbers are equal in w }

L(G)  L b : If w  L(G), then it has a parse tree. We show w  L b by induction on size of the parse tree deriving w using G. If tree has one node, it is , and   L b, so we are done. Suppose property holds for trees up size n. Consider tree of size n. The root of the tree is given by rule (S)S. The derivation of sub-trees for the first and second S belong to L b by induction hypothesis. The derived word w is of the form (p)q where p,q  L b. Let us check if (p)q  L b. Let (p)q = uv and count the number of ( and ) in u. If u=  then it satisfies the property. If it is shorter than |p|+1 then it has at least one more ( than ). Otherwise it is of the form (p)q 1 where q 1 is a prefix of q. Because the parentheses balance out in p and thus in (p), the difference in the number of ( and ) is equal to the one in q 1 which is a prefix of q so it satisfies the property. Thus u satisfies the property as well.

L b  L(G): If w  L b, we need to show that it has a parse tree. We do so by induction on |w|. If w=  then it has a tree of size one (only root). Otherwise, suppose all words of length 0. (Please refer to the figure counting the difference between the number of ( and ). We split w in the following way: let p 1 be the shortest non-empty prefix of w such that the number of ( equals to the number of ). Such prefix always exists and is non-empty, but could be equal to w itself. Note that it must be that p 1 = (p) for some p because p 1 is a prefix of a word in L b, so the first symbol must be ( and, because the final counts are equal, the last symbol must be ). Therefore, w = (p)q for some shorter words p,q. Because we chose p to be the shortest, prefixes of (p always have at least one more (. Therefore, prefixes of p always have at greater or equal number of (, so p is in L b. Next, for prefixes of the form (p)v the difference between ( and ) equals this difference in v itself, since (p) is balanced. Thus, v has at least as many ( as ). We have thus shown that w is of the form (p)q where p,q are in L b. By IH p,q have parse trees, so there is parse tree for w.

Exercise: Grammar Equivalence Show that each string that can be derived by grammar G 1 B ::=  | ( B ) | B B can also be derived by grammar G 2 B ::=  | ( B ) B and vice versa. In other words, L(G 1 ) = L(G 2 ) Remark: there is no algorithm to check for equivalence of arbitrary grammars. We must be clever.

Grammar Equivalence G 1 : B ::=  | ( B ) | B B G 2 : B ::=  | ( B ) B (Easy) Lemma: Each word in alphabet A={(,)} that can be derived by G 2 can also be derived by G 1. Proof. Consider a derivation of a word w from G 2. We construct a derivation of w in G 1 by showing one or more steps that produce the same effect. We have several cases depending on steps in G 2 derivation: uBv => uv replace by (same) uBv => uv uBv => u(B)Bv replace byuBv => uBBv => u(B)Bv This constructs a valid derivation in G 1. Corollary: L(G 2 )  L(G 1 ) Lemma: L(G 1 )  L b (words derived by G 1 are balanced parentheses). Proof: very similar to proof of L(G 2 )  L b from before. Lemma: L b  L(G 2 ) – this was one direction of proof that L b = L(G 2 ) before. Corollary: L(G 2 ) = L(G 1 ) = L(L b )

Regular Languages and Grammars Exercise: give grammar describing the same language as this regular expression: (a|b) (ab)*b*

Translating Regular Expression into a Grammar Suppose we first allow regular expression operators * and | within grammars Then R becomes simply S ::= R Then give rules to remove *, | by introducing new non-terminal symbols

Eliminating Additional Notation Alternatives s ::= P | Q becomess ::= P s ::= Q Parenthesis notation – introduce fresh non-terminal expr (&& | < | == | + | - | * | / | % ) expr Kleene star { statmt* } Option – use an alternative with epsilon if ( expr ) statmt (else statmt) ?

Grammars for Natural Language Statement = Sentence "." Sentence ::= Simple | Belief Simple ::= Person liking Person liking ::= "likes" | "does" "not" "like" Person ::= "Barack" | "Helga" | "John" | "Snoopy" Belief ::= Person believing "that" Sentence but believing ::= "believes" | "does" "not" "believe" but ::= "" | "," "but" Sentence Exercise: draw the derivation tree for: John does not believe that Barack believes that Helga likes Snoopy, but Snoopy believes that Helga likes Barack.  can also be used to automatically generate essays

While Language Syntax This syntax is given by a context-free grammar: program ::= statmt* statmt ::= println( stringConst, ident ) | ident = expr | if ( expr ) statmt (else statmt) ? | while ( expr ) statmt | { statmt* } expr ::= intLiteral | ident | expr (&& | < | == | + | - | * | / | % ) expr | ! expr | - expr

Compiler (scalac, gcc) Compiler (scalac, gcc) Id3 = 0 while (id3 < 10) { println(“”,id3); id3 = id3 + 1 } Id3 = 0 while (id3 < 10) { println(“”,id3); id3 = id3 + 1 } source code Compiler i d 3 = 0 LF w i d 3 = 0 LF w id3 = 0 while ( id3 < 10 ) id3 = 0 while ( id3 < 10 ) lexer characters words (tokens) trees parser assign while i 0 + * 3 7i assign a[i] < i 10

Recursive Descent Parsing - Manually - weak, but useful parsing technique - to make it work, we might need to transform the grammar

Recursive Descent is Decent descent = a movement downward decent = adequate, good enough Recursive descent is a decent parsing technique –can be easily implemented manually based on the grammar (which may require transformation) –efficient (linear) in the size of the token sequence Correspondence between grammar and code –concatenation  ; –alternative (|)  if –repetition (*)  while –nonterminal  recursive procedure

A Rule of While Language Syntax // Where things work very nicely for recursive descent! statmt ::= println ( stringConst, ident ) | ident = expr | if ( expr ) statmt (else statmt) ? | while ( expr ) statmt | { statmt* }

Parser for the statmt (rule -> code) def skip(t : Token) = if (lexer.token == t) lexer.next else error(“Expected”+ t) // statmt ::= def statmt = { // println ( stringConst, ident ) if (lexer.token == Println) { lexer.next; skip(openParen); skip(stringConst); skip(comma); skip(identifier); skip(closedParen) // | ident = expr } else if (lexer.token == Ident) { lexer.next; skip(equality); expr // | if ( expr ) statmt (else statmt) ? } else if (lexer.token == ifKeyword) { lexer.next; skip(openParen); expr; skip(closedParen); statmt; if (lexer.token == elseKeyword) { lexer.next; statmt } // | while ( expr ) statmt

Continuing Parser for the Rule // | while ( expr ) statmt // | { statmt* } } else if (lexer.token == whileKeyword) { lexer.next; skip(openParen); expr; skip(closedParen); statmt } else if (lexer.token == openBrace) { lexer.next; while (isFirstOfStatmt) { statmt } skip(closedBrace) } else { error(“Unknown statement, found token ” + lexer.token) }

How to construct if conditions? statmt ::= println ( stringConst, ident ) | ident = expr | if ( expr ) statmt (else statmt) ? | while ( expr ) statmt | { statmt* } Look what each alternative starts with to decide what to parse Here: we have terminals at the beginning of each alternative More generally, we have ‘first’computation, as for regular expresssions Consider a grammar G and non-terminal N L G (N) = { set of strings that N can derive } e.g. L(statmt) – all statements of while language first(N) = { a | aw in L G (N), a – terminal, w – string of terminals} first(statmt) = { println, ident, if, while, { } first(while ( expr ) statmt) = { while }