Bernd Fischer RW713: Compiler and Software Language Engineering
Top-Down Parsing
Top-down parsing searches for the (leftmost) derivation. context-free grammars: derive individual words by recursively applying productions –start with ω = S –pick an occurrence of a non-terminal A in ω –pick a production A → α in P –replace A by α in ω –repeat until ω ∈ T* = x leftmost ? ? but which one…? ? ? how to do this efficiently?
Top-down parsing searches for the (leftmost) derivation using a stack. Use a parse stack to represent the derivation: initialize s = S if x = ε –if s = ε then accept else reject if tos ∈ T –if x i = tos then pop; skip x i else reject if tos ∈ N –pick a production tos → α in P; pop; push(α) The parser stack can be explicit or implicit. symbol by symbol, in reverse order.
Top-down parsing searches for the (leftmost) derivation using a stack. Pop-Quiz: Consider the following grammar stmt →(stmt tail | if (expr) stmt else stmt | print expr tail→; stmt tail | ) expr →ID Show the evolution of the parse stack for the input (if(x)print y else (print x; print y))
Top-down parsing searches for the (leftmost) derivation using a stack. How do you pick the right production: based on already read input? based on remaining input? based on stack? all of it? subset of it?
Top-down parsing searches for the (leftmost) derivation using a stack. How do you pick the right production: based on already read input? based on prefix of remaining input based on top of stack all of it? subset of it?
Table-driven top-down parser (if(x)print y else (print x; print y)) stmt →(stmt tail | if (expr) stmt else stmt | print expr tail→; stmt tail | ) expr →ID generic table interpreter tail stmt else expr print parsing table previously read current tokenstill unread output (AST / syntax error) tos parse stack input tape
Table-driven top-down parser (if(x)print y else (print x; print y)) stmt →(stmt tail | if (expr) stmt else stmt | print expr tail→; stmt tail | ) expr →ID generic table interpreter tail stmt else expr parsing table output (AST / syntax error)
Table-driven top-down parser (if(x)print y else (print x; print y)) stmt →(stmt tail | if (expr) stmt else stmt | print expr tail→; stmt tail | ) expr →ID generic table interpreter tail stmt else ID parsing table output (AST / syntax error)
Table-driven top-down parser (if(x)print y else (print x; print y)) stmt →(stmt tail | if (expr) stmt else stmt | print expr tail→; stmt tail | ) expr →ID generic table interpreter tail stmt else parsing table output (AST / syntax error)
Table-driven top-down parser (if(x)print y else (print x; print y)) generic table interpreter tail parsing table output (AST / syntax error) stmt stmt →(stmt tail | if (expr) stmt else stmt | print expr tail→; stmt tail | ) expr →ID
Table-driven top-down parser (if(x)print y else (print x; print y)) stmt →(stmt tail | if (expr) stmt else stmt | print expr tail→; stmt tail | ) expr →ID generic table interpreter tail parsing table output (AST / syntax error) stmt (
LL-Parsing
LL(k) grammars allow a deterministic top-down parser. Definition: A context-free grammar G = (N, T, P, S) is LL(k) for k ∈ Nat if for any two leftmost derivations S ⇒ l * uAα ⇒ l uβα ⇒ l * ux and S ⇒ l * uAα ⇒ l uγα ⇒ l * uy the following holds: If prefix k (x) = prefix k (y), then β = γ. A language L is LL(k) if there exists an LL(k) grammar G such that L(G) = L. Defined in terms of derivations, not rules! next k tokens determine rule
Not all context-free grammars are LL(k). Consider the following left-recursive grammar G: S→ES→E E→E + T | T T→T * F | F F→(E) | id Fix by (immediate) left-recursion elimination: replace each rule A → Aα | β (where β ≠ Aγ) by two new rules A → βA’ and A’ → αA’ | ε (where A’ is a fresh non-terminal) need to “look over E” to see whether a + follows (and the first alternative must be chosen) L(G) = L(G’), but changes parse trees generalizes to multiple alternatives algorithm for indirect case exists as well
Not all context-free grammars are LL(k). Pop-Quiz: apply left-recursion elimination to G: S→ES→E E→E + T | E - T | T T→T * F | F F→(E) | id Pop-Quiz: use G and G’ to construct parse trees for the input string “a-b-c”. S→E E→TE’ E’→ + TE’ | - TE’ | ε T→FT’ T’→ * FT’ | ε F→(E) | id
Not all context-free grammars are LL(k). Consider the following grammar G: stmt →if expr then stmt end; | if expr then stmt else stmt end ; Fix by left factoring: replace each rule A → αβ | αγ by two new rules A → αA’ and A’ → β | γ (where A’ is a fresh non-terminal) need to look over arbitrarily long common prefix to find distinguishing token
Not all context-free grammars are LL(k).
Developing an LL(k) check on rules (k=1) A grammar is trivially LL(1) if all alternatives start with a different token: stmt →(stmt tail | if (expr) stmt else stmt | print expr tail→; stmt tail | ) expr →ID
Developing an LL(k) check on rules (k=1) But what happens if there are non-terminals? stmt →(stmt tail | if (expr) stmt else stmt | call tail→; stmt tail | ) expr →ID call→print expr | open ID | close ID need to check first token of all possible right-hand sides for call. disjoint from other alternatives, so ok.
Computing FIRST( ) For a grammar without ε-productions, this is straightforward: –FIRST(s) = { s } for s T –FIRST(A) = U FIRST( β ) for all A β and FIRST(A α ) = FIRST(A), FIRST( α A) = FIRST( α ) ε-productions make things more complicated E.g. S A x A z A | ε FIRST(A) = { z } but FIRST(S) = { x, z }
Another problem with ε FIRST no longer sufficient! –Derivation FDef Fun ( Arg ) : Type ; function ID ( Arg ) : Type ; –Which production for Arg ? FIRST(ID : Type ; Arg) = {ID}, FIRST( ) = {} Report error in input ?? –No: Arg is nullable and “)” can follow it FDef Fun ( Arg ) : Type ; Arg ID : Type ; Arg Arg Type integer Type char Type boolean Fun function ID function ID ( ) : char ;
Ingredients for RD parsing: nullable(X): –true if non-terminal X can produce FIRST( ): –set of terminal symbols that can begin any string produced by FOLLOW(X): –set of terminal symbols t that can immediately follow X; i.e., there exists a derivation from the start symbol to a string X t
Computing nullable(X) for all symbols X do nullable(X) := false; repeat change := false; for every production X s 1 … s k do if nullable(X)=false and s 1 … s k are all nullable then {nullable(X) := true; change := true;} until change = false; Z dY X Y Z XYZY cX a true if k=0 !
Extending nullable to strings Very straightforward: –nullable(s) = false for s T –nullable( ε ) = true –nullable(s 1 s 2 … s k ) = nullable(s 1 ) nullable(s 2 ) … nullable(s k )
Computing FIRST for all non-terminals X do FIRST(X) := {}; for all terminals t do FIRST(t) := {t}; repeat for every production X s 1 … s k do { FIRST(X) := FIRST(X) FIRST(s 1 ); for i := 1 to k-1 do if nullable(s i ) then FIRST(X) := FIRST(X) FIRST(s i+1 ); else exit; } until no more changes; Z dY X Y Z XYZY cX a Y: {c} X: {a,c} Z: {a,c,d}
Extending FIRST( ) to strings FIRST( X ) = FIRST( X ) if not nullable( X ) FIRST( X ) = FIRST( X ) FIRST( ) if nullable( X ) Example: FIRST( XYZ ) = –{a,c} FIRST( YZ ) = –{a,c} {c} FIRST( Z ) = –{a,c} {c} {a,c,d} = {a,c,d} Z dY X Y Z XYZY cX a Y: {c} X: {a,c} Z: {a,c,d}
Computing FOLLOW for all non-terminals X do FOLLOW(X) := {}; repeat for every non-terminal X do for each production of the form A α X β do { FOLLOW(X) := FOLLOW(X) FIRST( β ); if nullable( β )then FOLLOW(X) := FOLLOW(X) FOLLOW(A); } until no more changes; Z dY X Y Z XYZY cX a Y: {a,c,d} X: {a,c,d} Z: {} FIRST: Y: {c} X: {a,c} Z: {a,c,d} [where α and β are possibly empty strings of terminals and non-terminals]
Constructing the parser Make table of applicable productions: –Rows: non-terminals X –Columns: terminal symbols c (= next input token) –Production X is applicable iff c FIRST( ) or Nullable( ) and c FOLLOW( X ) If more than one applicable production for some pair (X, c), grammar is not LL(1). c t 1 …t k... t 1 …t k X S
acdX...Y...Z...acdX...Y...Z... acd XX aX YX Y X Y YY Y Y Y c Z Z XYZ Z XYZZ XYZ Z d Example Predictive parsing table: FOLLOW Y: {a,c,d} X: {a,c,d} Z: {} FIRST Y: {c} X: {a,c} Z: {a,c,d} Z dY X Y Z XYZY cX a Conflicts !! => not LL(1)
Panic Mode Error Recovery
Recursive Descent Parsing
Recursive descent parsing implements the productions directly as methods. For every non-terminal –One procedure, with a switch on next token –One case per production S if E then S else S S begin S L S print E L end L ; S L E num = num void S() {switch(tok) { case IF: eat(IF); E(); eat(THEN); S(); eat(ELSE); S(); break; case BEGIN: eat(BEGIN); S(); L(); break; case PRINT: eat(PRINT); E(); break; default: error(); }} void eat(ex) { if (tok == ex) tok = System.in.read(); else {System.out.print(“expected:“); System.out.print(ex);...} }
Further Reading Terence Parr, Kathleen Fisher: LL(*): the foundation of the ANTLR parser generator. PLDI 2011: ⇒ describes theory behind ANTLR