compiler Constreuction 1 Chapter 4 Syntax Analysis Topics to cover: Context-Free Grammars: Concepts and Notation Writing and rewriting a grammar Syntax Error Handling and Recovery
compiler Constreuction2 Introduction Why CFG CFG gives a precise syntactic specification of a programming language. Automatic efficient parser generator Enabling automatic translator generator Language extension becomes easier The role of the parser Taking tokens from scanner, parsing, reporting syntax errors Not just parsing, in a syntax-directed translator, the parser also conducts type checking, semantic analysis and IR generation.
compiler Constreuction3 Example of CFG A C– program is made out of functions, a function out of declarations and blocks, a block out of statements, a statement out of expressions, … etc | e | e id ( ) { } id ( ) { } | e | e | | e | | e | | void | int | float void | int | float …. …. { } { }
compiler Constreuction4 Notational Conventions Following symbols are terminals Lower case letters such as a,b,c. Operators (+,-, etc) and punctuation symbols (parentheses, commas, etc) Digits such as 0,1,2,etc Boldface strings such as id or if
compiler Constreuction5 Notational Conventions Nonterminals Upper case letters such as A,B,C The letter S – the start symbol Lower case italic names such as expr or stmt Grammar symbols upper case, late in the alphabet, such as X,Y,Z,. Strings of terminals lower case letters late in the alphabet, such as u,v,.. z Strings of grammar symbols Lower-case Greek letters, such as
compiler Constreuction6 Example expr expr op expr expr (expr) expr - expr expr id op + op - op * op / op Using the notational shorthand E E A E | (E) | -E | id A + | - | * | / | Non-terminals: E and A Start symbol: E
compiler Constreuction7 Derivation Given a string A If is a production, then we can replace A by , written as A means derives in one-step + means derive in one or more steps * means drive in zero or more steps The language L(G) generated by G is the set of terminal strings w such that S + w. The string w is called a sentence of G. If S * where may contain nonterminals, we say is a sentential form of G
compiler Constreuction8 Exercise What is a sentence of language L defined by the C++ grammar G? Is the following string a sentence or a sentential form? int parse( ) {} a C++ program A sentential form
compiler Constreuction9 Derivation (cont.) Consider the following grammar G0 E E + E | E * E | (E) | -E | id The string -(id + id) is a sentence of G0 because there is a derivation E - E - (E) - (E+E) - (id +E) -(id + id) Leftmost derivation: only the leftmost nonterminal is replaced Rightmost derivation: only the rightmost nonterminal is replaced Exercise: is id-id a sentence of G0? Is –id+id a sentence? No Yes
compiler Constreuction10 Parse Tree and Derivation A Parse tree can be viewed as a graphical representation for a derivation that ignore replacement order. E - E - (E) - (E+E) - (id +E) -(id + id) E -E (E) E+E id Interior node: non-terminal Leaves: terminal Children: right-hand side
compiler Constreuction11 CFG is more powerful than RE Every RE can be described by a CFG Example(a|b)*abb A aA | bA | abb Converting a NFA into a CFG For each state I of the NFA, create a nonterminal symbol Ai If state i goes to stat j on input a, add production Ai aAj Ai Aj if state i goes to j on e Ai e if state i is an accepting state
compiler Constreuction12 Why do we need RE? RE is sufficiently powerful for lexical rules RE is more concise and easier to understand More efficient lexical analyzer can be constructed from RE than from CFG Separating lexical from nonlexical part has a few advantages such as modularization, easier to port, etc. Exercise: what if we don’t have token definition?
compiler Constreuction13 Defects in CFG Defects in CFG Useless nonterminals S A | B A a A a B Bb B Bb C c C c Ambiguity Top-Down parsing issues Left recursion Left factoring
compiler Constreuction14 Ambiguity A grammar is ambiguous if it produces more than one parse tree for some sentences example 1: A+B+C ( is it (A+B)+C or A+(B+C) ) Improper production: expr expr + expr | id example 2: A+B*C ( is it (A+B)*C or A+(B*C) ) Improper production: expr expr + expr | expr * expr example 3: if E1 then if E2 then S1 else S2 (which then does the else match with) Improper production: stmt if expr then stmt | if expr then stmt else stmt | if expr then stmt else stmt
compiler Constreuction15 Two parse trees of example 3 stmt ifE1thenstmt ifE2thenS1elseS2 stmt ifE1thenstmtelseS2 ifE2thenS1
compiler Constreuction16 Eliminating Ambiguity Operator Associativity expr expr + term | term Operator Precedence expr expr + term | term term term * factor | factor term term * factor | factor Dangling Else stmt matched | unmatched matched if expr then matched else matched matched if expr then matched else matched unmatched if expr then stmt unmatched if expr then stmt | if expr then matched else unmatched | if expr then matched else unmatched
compiler Constreuction17 Eliminating Left Recursion Immediate left recursion Example: A A | Transformation A A 1 | A | … | | 2 | … Where no begins with A, we replace A productions by A 1A’ | 2A’ | …. A’ 1A’ | 2A’ | … |
compiler Constreuction18 Indirect Left Recursion Example: S Aa | b A Ac | Sd | Transformation (assuming no cycles A + A) 1. Arrange nonterminals in order A1, A2, … An 2. for i := 1 to n do for j := 1 to i-1 do begin Replace Ai Aj by i .. Replace Ai Aj by i .. where Aj | … are current Aj prod where Aj | … are current Aj prod end end Eliminate the immediate left recursion among Ai Eliminate the immediate left recursion among Aiend
compiler Constreuction19 In the above example, S Aa | b A Ac | Sd | A Sd will be replaced by A Ac | Aad | bd | , then eliminates immediate recursion among A productions and yields the following S Aa | b A bdA’ | A’ A’ cA’ | adA’ |
compiler Constreuction20 Algorithm 4.1 Eliminating Left Recursion This algorithm will systematically eliminate left recursions from a grammar. This is about how to remove indirect left recursions. Precondition: the grammar has no cycles or - productions. A cycle means: A + A To avoid getting A A type of productions during nonterminal replacement. For example, A BA, B Ab | when A BA is derived to A A a cycle shows up. -production also makes the algorithm more complex because A BCD may be derived to A CD so handling the leftmost non-terminal only is not sufficient -production also makes the algorithm more complex because A BCD may be derived to A CD so handling the leftmost non-terminal only is not sufficient
compiler Constreuction21 Indirect Left Recursion A Bb | a B Cc | b C Dd | c D Aa | d A Bb Ccb Ddcb Aadcb C Dd Aad Bbad Ccbad Need to expose immediate left recursions and then eliminate them. Some ordering is needed. Suppose we replace A Bb by A Ccb and then start with B Cc Ddc Aadc Ccbabc, this would never expose the immediate left recursion in this example. Need to expose immediate left recursions and then eliminate them. Some ordering is needed. Suppose we replace A Bb by A Ccb and then start with B Cc Ddc Aadc Ccbabc, this would never expose the immediate left recursion in this example.
compiler Constreuction22 Algorithm 4.1 For i:= 1 to n do begin For j:= 1 to i-1 do begin replace each production of the form Ai Aj by the productions i .. where Aj | … are current Aj production End End eliminate the immediate left recursion among Ai- productions End Key idea: For each non-terminal Ai, all references to lower numbered non-terminal Aj, (where j < i) will be replaced by higher numbered non-terminals.
compiler Constreuction23. A1 … A2 Ai-1 Ai+k … … Ai Ai-1 | A2 … …An After replacement, there will be no backward references
compiler Constreuction24 Left Factoring Consider the following grammar A 1 | It is not easy to determine whether to expand A to or A transformation called left factoring can be applied. It becomes: A A’ A’
compiler Constreuction25 Exercise stmt if expr then stmt | if expr then stmt else stmt | if expr then stmt else stmt For the following grammar form: A 1 | 2 What is ? 1? 2? : if expr then stmt else stmt
compiler Constreuction26 Syntax Error Handling Different type of errors Lexical Syntactic Semantic Logical Error handling goals Report errors clearly and accurately Recover quickly Fast
compiler Constreuction27 Error Handling Strategies Don’t quit after detecting the 1 st error. Avoid introducing “spurious” errors Inhibit error messages that stem from errors uncovered too close together Simple error repair will be sufficient due to the increasing emphasis on interactive computing and good programming environment.
compiler Constreuction28 Error Recovery Strategies Panic mode Deleting input tokens until one of a designated set of synchronizing tokens is found. Phrase level Local correction to repair punctuation errors Error productions Augment the grammar with error productions Global correction Globally least-cost correction to a string, costly to implement.