Bernd Fischer RW713: Compiler and Software Language Engineering.

Bernd Fischer bfischer@cs.sun.ac.za RW713: Compiler and Software Language Engineering

Top-Down Parsing

Top-down parsing searches for the (leftmost) derivation. context-free grammars: derive individual words by recursively applying productions –start with ω = S –pick an occurrence of a non-terminal A in ω –pick a production A → α in P –replace A by α in ω –repeat until ω ∈ T*     = x leftmost ? ? but which one…? ? ? how to do this efficiently?

Top-down parsing searches for the (leftmost) derivation using a stack. Use a parse stack to represent the derivation: initialize s = S if x = ε –if s = ε then accept else reject if tos ∈ T –if x i = tos then pop; skip x i else reject if tos ∈ N –pick a production tos → α in P; pop; push(α) The parser stack can be explicit or implicit. symbol by symbol, in reverse order.

Top-down parsing searches for the (leftmost) derivation using a stack. Pop-Quiz: Consider the following grammar stmt →(stmt tail | if (expr) stmt else stmt | print expr tail→; stmt tail | ) expr →ID Show the evolution of the parse stack for the input (if(x)print y else (print x; print y))

Top-down parsing searches for the (leftmost) derivation using a stack. How do you pick the right production: based on already read input? based on remaining input? based on stack? all of it? subset of it?

Top-down parsing searches for the (leftmost) derivation using a stack. How do you pick the right production: based on already read input? based on prefix of remaining input based on top of stack all of it? subset of it?

Table-driven top-down parser (if(x)print y else (print x; print y)) stmt →(stmt tail | if (expr) stmt else stmt | print expr tail→; stmt tail | ) expr →ID generic table interpreter tail stmt else expr print parsing table previously read current tokenstill unread output (AST / syntax error) tos parse stack input tape

Table-driven top-down parser (if(x)print y else (print x; print y)) stmt →(stmt tail | if (expr) stmt else stmt | print expr tail→; stmt tail | ) expr →ID generic table interpreter tail stmt else expr parsing table output (AST / syntax error)

Table-driven top-down parser (if(x)print y else (print x; print y)) stmt →(stmt tail | if (expr) stmt else stmt | print expr tail→; stmt tail | ) expr →ID generic table interpreter tail stmt else ID parsing table output (AST / syntax error)

Table-driven top-down parser (if(x)print y else (print x; print y)) stmt →(stmt tail | if (expr) stmt else stmt | print expr tail→; stmt tail | ) expr →ID generic table interpreter tail stmt else parsing table output (AST / syntax error)

Table-driven top-down parser (if(x)print y else (print x; print y)) generic table interpreter tail parsing table output (AST / syntax error) stmt stmt →(stmt tail | if (expr) stmt else stmt | print expr tail→; stmt tail | ) expr →ID

Table-driven top-down parser (if(x)print y else (print x; print y)) stmt →(stmt tail | if (expr) stmt else stmt | print expr tail→; stmt tail | ) expr →ID generic table interpreter tail parsing table output (AST / syntax error) stmt (

LL-Parsing

LL(k) grammars allow a deterministic top-down parser. Definition: A context-free grammar G = (N, T, P, S) is LL(k) for k ∈ Nat if for any two leftmost derivations S ⇒ l * uAα ⇒ l uβα ⇒ l * ux and S ⇒ l * uAα ⇒ l uγα ⇒ l * uy the following holds: If prefix k (x) = prefix k (y), then β = γ. A language L is LL(k) if there exists an LL(k) grammar G such that L(G) = L. Defined in terms of derivations, not rules! next k tokens determine rule

Not all context-free grammars are LL(k). Consider the following left-recursive grammar G: S→ES→E E→E + T | T T→T * F | F F→(E) | id Fix by (immediate) left-recursion elimination: replace each rule A → Aα | β (where β ≠ Aγ) by two new rules A → βA’ and A’ → αA’ | ε (where A’ is a fresh non-terminal) need to “look over E” to see whether a + follows (and the first alternative must be chosen) L(G) = L(G’), but changes parse trees generalizes to multiple alternatives algorithm for indirect case exists as well

Not all context-free grammars are LL(k). Pop-Quiz: apply left-recursion elimination to G: S→ES→E E→E + T | E - T | T T→T * F | F F→(E) | id Pop-Quiz: use G and G’ to construct parse trees for the input string “a-b-c”. S→E E→TE’ E’→ + TE’ | - TE’ | ε T→FT’ T’→ * FT’ | ε F→(E) | id

Not all context-free grammars are LL(k). Consider the following grammar G: stmt →if expr then stmt end; | if expr then stmt else stmt end ; Fix by left factoring: replace each rule A → αβ | αγ by two new rules A → αA’ and A’ → β | γ (where A’ is a fresh non-terminal) need to look over arbitrarily long common prefix to find distinguishing token

Not all context-free grammars are LL(k).

Developing an LL(k) check on rules (k=1) A grammar is trivially LL(1) if all alternatives start with a different token: stmt →(stmt tail | if (expr) stmt else stmt | print expr tail→; stmt tail | ) expr →ID

Developing an LL(k) check on rules (k=1) But what happens if there are non-terminals? stmt →(stmt tail | if (expr) stmt else stmt | call tail→; stmt tail | ) expr →ID call→print expr | open ID | close ID need to check first token of all possible right-hand sides for call. disjoint from other alternatives, so ok.

Computing FIRST(  ) For a grammar without ε-productions, this is straightforward: –FIRST(s) = { s } for s  T –FIRST(A) = U FIRST( β ) for all A  β and FIRST(A α ) = FIRST(A), FIRST( α A) = FIRST( α ) ε-productions make things more complicated E.g. S  A x A  z A | ε FIRST(A) = { z } but FIRST(S) = { x, z }

Another problem with ε FIRST no longer sufficient! –Derivation  FDef  Fun ( Arg ) : Type ;  function ID ( Arg ) : Type ; –Which production for Arg ?  FIRST(ID : Type ; Arg) = {ID}, FIRST(  ) = {}  Report error in input ?? –No: Arg is nullable and “)” can follow it FDef  Fun ( Arg ) : Type ; Arg  ID : Type ; Arg Arg   Type  integer Type  char Type  boolean Fun  function ID function ID ( ) : char ;

Ingredients for RD parsing: nullable(X): –true if non-terminal X can produce  FIRST(  ): –set of terminal symbols that can begin any string produced by  FOLLOW(X): –set of terminal symbols t that can immediately follow X; i.e., there exists a derivation from the start symbol to a string  X t 

Computing nullable(X) for all symbols X do nullable(X) := false; repeat change := false; for every production X  s 1 … s k do if nullable(X)=false and s 1 … s k are all nullable then {nullable(X) := true; change := true;} until change = false; Z  dY   X  Y Z  XYZY  cX  a true if k=0 !

Extending nullable to strings Very straightforward: –nullable(s) = false for s  T –nullable( ε ) = true –nullable(s 1 s 2 … s k ) = nullable(s 1 )  nullable(s 2 )  …  nullable(s k )

Computing FIRST for all non-terminals X do FIRST(X) := {}; for all terminals t do FIRST(t) := {t}; repeat for every production X  s 1 … s k do { FIRST(X) := FIRST(X)  FIRST(s 1 ); for i := 1 to k-1 do if nullable(s i ) then FIRST(X) := FIRST(X)  FIRST(s i+1 ); else exit; } until no more changes; Z  dY   X  Y Z  XYZY  cX  a Y: {c} X: {a,c} Z: {a,c,d}

Extending FIRST(  ) to strings FIRST( X  ) = FIRST( X ) if not nullable( X ) FIRST( X  ) = FIRST( X )  FIRST(  ) if nullable( X ) Example: FIRST( XYZ ) = –{a,c}  FIRST( YZ ) = –{a,c}  {c}  FIRST( Z ) = –{a,c}  {c}  {a,c,d} = {a,c,d} Z  dY   X  Y Z  XYZY  cX  a Y: {c} X: {a,c} Z: {a,c,d}

Computing FOLLOW for all non-terminals X do FOLLOW(X) := {}; repeat for every non-terminal X do for each production of the form A  α X β do { FOLLOW(X) := FOLLOW(X)  FIRST( β ); if nullable( β )then FOLLOW(X) := FOLLOW(X)  FOLLOW(A); } until no more changes; Z  dY   X  Y Z  XYZY  cX  a Y: {a,c,d} X: {a,c,d} Z: {} FIRST: Y: {c} X: {a,c} Z: {a,c,d} [where α and β are possibly empty strings of terminals and non-terminals]

Constructing the parser Make table of applicable productions: –Rows: non-terminals X –Columns: terminal symbols c (= next input token) –Production X   is applicable iff  c  FIRST(  ) or Nullable(  ) and c  FOLLOW( X ) If more than one applicable production for some pair (X, c), grammar is not LL(1). c t 1 …t k... t 1 …t k X  S

acdX...Y...Z...acdX...Y...Z... acd XX  aX  YX  Y X  Y YY   Y   Y   Y  c Z Z  XYZ Z  XYZZ  XYZ Z  d Example Predictive parsing table: FOLLOW Y: {a,c,d} X: {a,c,d} Z: {} FIRST Y: {c} X: {a,c} Z: {a,c,d} Z  dY   X  Y Z  XYZY  cX  a Conflicts !! => not LL(1)

Panic Mode Error Recovery

Recursive Descent Parsing

Recursive descent parsing implements the productions directly as methods. For every non-terminal –One procedure, with a switch on next token –One case per production S  if E then S else S S  begin S L S  print E L  end L  ; S L E  num = num void S() {switch(tok) { case IF: eat(IF); E(); eat(THEN); S(); eat(ELSE); S(); break; case BEGIN: eat(BEGIN); S(); L(); break; case PRINT: eat(PRINT); E(); break; default: error(); }} void eat(ex) { if (tok == ex) tok = System.in.read(); else {System.out.print(“expected:“); System.out.print(ex);...} }

Further Reading Terence Parr, Kathleen Fisher: LL(*): the foundation of the ANTLR parser generator. PLDI 2011: 425-436. ⇒ describes theory behind ANTLR

Bernd Fischer RW713: Compiler and Software Language Engineering.

Similar presentations

Presentation on theme: "Bernd Fischer RW713: Compiler and Software Language Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bernd Fischer RW713: Compiler and Software Language Engineering.

Similar presentations

Presentation on theme: "Bernd Fischer RW713: Compiler and Software Language Engineering."— Presentation transcript:

Similar presentations

About project

Feedback