COMP3190: Principle of Programming Languages Formal Language Syntax
- 1 - Motivation The problem of parsing structured text is very common Consider the structure of addresses (using a grammar): := := |. Describe and recognize addresses in arbitrary text.
- 2 - Outline v DFA & NFA v Regular expression v Regular languages v Context free languages &PDA v Scanner v Parser
- 3 - Deterministic Finite Automata (DFA) v Q: finite set of states v Σ: finite set of “letters” (alphabet) v δ: QxΣ -> Q (transition function) v q 0 : start state (in Q) v F : set of accept states (subset of Q) v Acceptance: input consumed with the automata in a final state.
- 4 - Example of DFA q1 q δ01 q1 q2 q1q2 Accepts all strings that end in 1
- 5 - Another Example of a DFA S q1 q2 r1 r2 a b a ab b b ab a Accepts all strings that start and end with “a” OR start and end with “b”
- 6 - Non-deterministic Finite Automata (NFA) Transition function is different v δ: QxΣ ε -> P(Q) v P(Q) is the powerset of Q (set of all subsets) v Σ ε is the union of Σ and the special symbol ε (denoting empty) String is accepted if there is at least one path leading to an accept state, and input consumed.
- 7 - Example of an NFA q1q2q3q4 0, 1 1 0, ε1 0, 1 δ01ε q1{q1}{q1, q2} q2{q3} q3{q4} q4{q4} What strings does this NFA accept?
- 8 - Outline v DFA & NFA v Regular expression v Regular languages v Context free languages &PDA v Scanner v Parser
- 9 - Regular Expressions R is a regular expression if R is v “a” for some a in Σ. v ε (the empty string). v member of the empty language. v the union of two regular expressions. v the concatenation of two regular expr. v R 1 * (Kleene closure: zero or more repetitions of R 1 ).
Regular Expression Notation v a: an ordinary letter v ε: the empty string v M | N: choosing from M or N v MN: concatenation of M and N v M*: zero or more times (Kleene star) v M + : one or more times v M?: zero or one occurence v [a-zA-Z] character set alternation (choice) v. period stands for any single char exc. newline
Examples of Regular Expressions {0, 1}* 0 all strings that end in 0 {0, 1} 0* string that start with 1 or 0 followed by zero or more 0s. {0, 1}* all strings {0 n 1 n, n >=0} not a regular expression!!!
Converting a Regular Expression to an NFA ε ε ε ε ε M N M M N ε a M|N MN M*
Regular expression->NFA Language: Strings of 0s and 1s in which the number of 0s is even Regular expression: (1*01*0)*1*
Converting an NFA to a DFA v For set of states S, closure(S) is the set of states that can be reached from S without consuming any input. v For a set of states S, DFAedge(s, c) is the set of states that can be reached from S by consuming input symbol c. v Each set of NFA states corresponds to one DFA state (hence at most 2 n states).
NFA -> DFA Initial classes: {A, B, E}, {C, D} No class requires partitioning! Hence a two-state DFA is obtained.
Obtaining the minimal equivalent DFA v Initially two equivalence classes: final and nonfinal states. Search for an equivalence class C and an input letter a such that with a as input, the states in C make transitions to states in k>1 different equivalence classes. v Partition C into k classes accordingly v Repeat until unable to find a class to partition.
Example (cont.)
Outline v DFA & NFA v Regular expression v Regular languages v Context free languages &PDA v Scanner v Parser
Regular Grammar v Later definitions build on earlier ones v Nothing defined in terms of itself (no recursion) Regular grammar for numeric literals in Pascal: digit -> 0|1|2|...|8|9 unsigned_integer -> digit digit* unsigned_number -> unsigned_integer ((. unsigned_integer) | ε ) (( e (+ | - | ε ) unsigned_integer ) | ε )
Languages and Automata in Programming Languages v Regular languages »Recognized(accepted) by finite automata »Useful for tokenizing program text (lexical analysis) v Context-free languages »Recognized(accepted) by pushdown automata »Useful for parsing the syntax of a program
Important Theorems v A language is regular if a regular expression describes it. v A language is regular if a finite automata recognizes it. v DFAs and NFAs are equally powerful.
Outline v DFA & NFA v Regular expression v Regular languages v Context free languages &PDA v Scanner v Parser
Context-free Grammars Context-free grammars are defined by substitution rules Big Jim ate gree cheese green Jim ate green cheese Jim ate cheese Cheese ate Jim P -> N P -> AP S -> PVP A -> big|green N -> cheese|Jim V -> ate
Context-free Grammars v Context-free grammars are used to formally describe the syntax of programming languages. v Every syntactically correct program is derived using the context-free grammar of the language. v Parsing a program involves tracing such derivation, given the context-free grammar and the program.
Context-free Grammars A context-free grammar consists of v V: a finite set of variables v Σ: a finite set of terminals v R: a finite set of rules of the form variable -> {variable, terminal}* v S: the start variable
Pushdown Automata (PDA) v A pushdown automata consists of v Q: a set of states v Σ: input alphabet (of terminals) v Γ: stack alphabet v δ: a set of transition rules Q x Σ ε x Γ ε -> P(Q x Γ ε ) currentState, inputSymbol, headOfStack ->newState, pushSymbolOnStack v q 0 : the start state v F: the set of accept states (subset of Q) Deterministic: At most one move is possible from any configuration
How does a PDA accept? v By final state: »Consume all the input while »Reaching a final state v By empty stack: »Consume all the input while »Having an empty stack »Set of final states is irrelevant
Example of a PDA q1 q2 q3 q4 ε, ε ->$ 0, ε->0 1, 0->ε ε, $->ε Notation: a, b->c: when PDA reads “a” from input, it replaces “b” at the top of stack with “c”. What does this PDA accept?
Important Theorems v A language is context-free iff a pushdown automata recognizes it v Non-deterministic PDA are more powerful than deterministic ones
Example of Context-free Language That Requires a Non-deterministic PDA {w w R | w belongs to {0, 1}*} i.e. w R is w written backwards Idea: Non-deterministically guess the middle of the input string
The Solution q1 q2 q3 q4 ε, ε ->$ 0, ε->0 1, ε->1 ε, ε->ε 1, 1->ε 0, 0->ε ε, $->ε
Derivations and Parse Trees Nested constructs require recursion, i.e. context-free grammars CFG for arithmetic expressions expression -> identifier | number | - expression | (expression) | expression operator expression operator -> + | - | * | /
Parse Tree for Slope*x + Intercept Is this the only parse tree for this expression and grammar?
A Better Expression Grammar 1. expression -> term | expression add_op term 2. term -> factor | term mult_op factor 3. factor -> identifier | number | - factor | (expression) 4. add_op -> + | - 5. mult_op -> * | / A good grammar reflects the internal structure of programs. This grammar is unambiguous and captures (HOW?): - operator precedence (*,/ bind tighter than +,- ) - associativity (ops group left to right)
And Better Parse Trees *
Syntax-directed Compilation v Parser calls scanner to obtain tokens. v Assembles tokens into parse tree. v Passes tree to later phases of compilation. v Scanner: deterministic finite automata. v Parser: pushdown automata. v Scanners and parsers can be generated automatically from regular expressions and CFGs (e.G. lex/yacc).
Outline v DFA & NFA v Regular expression v Regular languages v Context free languages &PDA v Scanner v Parser
Scanning v Accept the longest possible token in each invocation of the scanner. v Implementation. »Capture finite automata. Case(switch) statements. Table and driver.
Scanner for Pascal
Scanner for Pascal(case Statements)
Scanner (Table&driver)
Scanner Generators v Start with a regular expression. v Construct an NFA from it. v Use a set of subsets construction to obtain an equivalent DFA. v Construct the minimal equivalent DFA.
Outline v DFA & NFA v Regular expression v Regular languages v Context free languages &PDA v Scanner v Parser »Top-down parsing »Bottom-up Parsing »Comparison
Parsing approaches v Parsing in general has O(n 3 ) cost. v Need classes of grammars that can be parsed in linear time »Top-down or predictive parsing or recursive descent parsing or LL parsing (Left-to-right Left-most) »Bottom-up or shift-reduce parsing or LR parsing (Left-to-right Right-most)
A Simple Grammar for a Comma-separated List of Identifiers id_list -> id id_list_tail id_list_tail ->, id id_list_tail id_list_tail -> ; _________________________ String to be parsed: A, B, C;
Top-down/bottom-up Parsing
Outline v DFA & NFA v Regular expression v Regular languages v Context free languages &PDA v Scanner v Parser »Top-down parsing »Bottom-up Parsing »Comparison
Top-down Parsing v Predicts a derivation v Matches non-terminal against token observed in input
LL(1) Grammar v A grammar for which a top-down deterministic parser can be produced with one token of look- ahead. v LL(1) grammar: »For a given non-terminal, the lookahead symbol uniquely determines the production to apply »Top-down parsing = predictive parsing »Driven by predictive parsing table of non-terminals x terminals productions
From Last Time: Parsing with Table Partly-derived StringLookaheadparsed part unparsed part ES’((1+2+(3+4))+5 (S)S’1(1+2+(3+4))+5 (ES’)S’1(1+2+(3+4))+5 (1S’)S’+(1+2+(3+4))+5 (1+ES’)S’2(1+2+(3+4))+5 (1+2S’)S’+(1+2+(3+4))+5 S ES’S’ | +SE num | (S) num+()$ S ES’ ES’ S’ +S E num (S)
How to Construct Parsing Tables? Needed: Algorithm for automatically generating a predictive parse table from a grammar S ES’ S’ | +S E number | (S) num+()$ SES’ES’ S’+S Enum (S) ??
Constructing Parse Tables v Can construct predictive parser if: »For every non-terminal, every lookahead symbol can be handled by at most 1 production v FIRST( ) for an arbitrary string of terminals and non-terminals is: »Set of symbols that might begin the fully expanded version of v FOLLOW(X) for a non-terminal X is: »Set of symbols that might follow the derivation of X in the input stream FIRSTFOLLOW X
Parse Table Entries v Consider a production X v Add to the X row for each symbol in FIRST( ) v If can derive ( is nullable), add for each symbol in FOLLOW(X) v Grammar is LL(1) if no conflicting entries num+()$ SES’ES’ S’+S Enum (S) S ES’ S’ | +S E number | (S)
Computing Nullable v X is nullable if it can derive the empty string: »If it derives directly (X ) »If it has a production X YZ... where all RHS symbols (Y,Z) are nullable v Algorithm: assume all non-terminals are non- nullable, apply rules repeatedly until no change S ES’ S’ | +S E number | (S) Only S’ is nullable
Computing FIRST v Determining FIRST(X) 1.if X is a terminal, then add X to FIRST(X) 2.if X then add to FIRST(X) 3.if X is a nonterminal and X Y1Y2...Yk then a is in FIRST(X) if a is in FIRST(Yi) and is in FIRST(Yj) for j = 1...i-1 (i.e., its possible to have an empty prefix Y1... Yi-1 4.if is in FIRST(Y1Y2...Yk) then is in FIRST(X)
FIRST Example S ES’ S’ | +S E number | (S) Apply rule 1: FIRST(num) = {num}, FIRST(+) = {+}, etc. Apply rule 2: FIRST(S’) = { } Apply rule 3: FIRST(S) = FIRST(E) = {} FIRST(S’) = FIRST(‘+’) + { } = { , + } FIRST(E) = FIRST(num) + FIRST(‘(‘) = {num, ( } Rule 3 again: FIRST(S) = FIRST(E) = {num, ( } FIRST(S’) = { , + } FIRST(E) = {num, ( }
Computing FOLLOW v Determining FOLLOW(X) 1.if S is the start symbol then $ is in FOLLOW(S) 2.if A B then add all FIRST( ) != to FOLLOW(B) 3.if A B or B and is in FIRST( ) then add FOLLOW(A) to FOLLOW(B)
FOLLOW Example S ES’ S’ | +S E number | (S) FIRST(S) = {num, ( } FIRST(S’) = { , + } FIRST(E) = { num, ( } Apply rule 1: FOL(S) = {$} Apply rule 2: S ES’FOL(E) += {FIRST(S’) - } = {+} S’ | +S- E num | (S) FOL(S) += {FIRST(‘)’) - } = {$,) } Apply rule 3:S ES’FOL(E) += FOL(S) = {+,$,)} (because S’ is nullable) FOL(S’) += FOL(S) = {$,)}
Putting it all Together FOLLOW(S) = { $, ) } FOLLOW(S’) = { $, ) } FOLLOW(E) = { +, ), $ } FIRST(S) = {num, ( } FIRST(S’) = { , + } FIRST(E) = { num, ( } v Consider a production X v Add to the X row for each symbol in FIRST( ) v If can derive ( is nullable), add for each symbol in FOLLOW(X) num+()$ SES’ES’ S’+S Enum (S) S ES’ S’ | +S E number | (S)
Ambiguous Grammars Construction of predictive parse table for ambiguous grammar results in conflicts in the table (ie 2 or more productions to apply in same cell) S S + S | S * S | num FIRST(S+S) = FIRST(S*S) = FIRST(num) = { num }
Class Problem E E + T | T T T * F | F F (E) | num | 1. Compute FIRST and FOLLOW sets for this G 2. Compute parse table entries
Top-Down Parsing Up to This Point v Now we know »How to build parsing table for an LL(1) grammar (ie FIRST/FOLLOW) »How to construct recursive-descent parser from parsing table »Call tree = parse tree v Open question – Can we generate the AST?
Creating the Abstract Syntax Tree v Some class definitions to assist with AST construction v class Expr {} v class Add extends Expr { »Expr left, right; »Add(Expr L, Expr R) { left = L; right = R; »} v } v class Num extends Expr { »int value; »Num(int v) {value = v;} v } Expr NumAdd Class Hierarchy
Creating the AST ( (3 + 4)) + 5 S E+S ( S )E E + S 5 1 2E ( S ) E + S E34 We got the parse tree from the call tree Just add code to each parsing routine to create the appropriate nodes Works because parse tree and call tree are the same shape, and AST is just a compressed form of the parse tree
AST Creation: parse_E v Expr parse_E() { »switch (token) { case num:// E number u Expr result = Num(token.value); u token = input.read(); return result; case ‘(‘:// E (S) u token = input.read(); u Expr result = parse_S(); u if (token != ‘)’) ParseError(); u token = input.read(); return result; default: ParseError(); »} v } Remember, this is lookahead token S ES’ S’ | +S E number | (S)
AST Creation: parse_S v Expr parse_S() { »switch (token) { case num: case ‘(‘:// S ES’ u Expr left = parse_E(); u Expr right = parse_S’(); u if (right == NULL) return left; u else return new Add(left,right); default: ParseError(); »} v } S ES’ S’ | +S E number | (S)
Grammars v Have been using grammar for language “sums with parentheses” (1+2+(3+4))+5 v Started with simple, right-associative grammar »S E + S | E »E num | (S) v Transformed it to an LL(1) by left factoring: »S ES’ »S’ | +S »E num (S) v What if we start with a left-associative grammar? »S S + E | E »E num | (S)
Reminder: Left vs Right Associativity S E + S S E E num S S + E S E E num Right recursion : right associative Left recursion : left associative Consider a simpler string on a simpler grammar: “ ”
Left Recursion derived stringlookaheadread/unread S S+E S+E+E S+E+E+E E+E+E+E E+E+E E+E E $ Is this right? If not, what’s the problem? S S + E S E E num “ ”
Left-Recursive Grammars v Left-recursive grammars don’t work with top-down parsers: we don’t know when to stop the recursion v Left-recursive grammars are NOT LL(1)! »S S »S v In parse table »Both productions will appear in the predictive table at row S in all the columns corresponding to FIRST( )
Eliminate Left Recursion v Replace »X X 1 |... | X m »X 1 |... | n v With »X 1X’ |... | nX’ »X’ 1X’ |... | mX’ | v See complete algorithm in Dragon book
Class Problem E E + T | T T T * F | F F (E) | num Transform the following grammar to eliminate left recursion:
Creating an LL(1) Grammar v Start with a left-recursive grammar S S + E S E »and apply left-recursion elimination algorithm S ES’ S’ +ES’ | v Start with a right-recursive grammar S E + S S E »and apply left-factoring to eliminate common prefixes S ES’ S’ +S |
Top-Down Parsing Summary Language grammar Left-recursion elimination Left factoring LL(1) grammar predictive parsing table FIRST, FOLLOW recursive-descent parser parser with AST gen
Outline v DFA & NFA v Regular expression v Regular languages v Context free languages &PDA v Scanner v Parser »Top-down parsing »Bottom-up Parsing »Comparison
New Topic: Bottom-Up Parsing v A more power parsing technology v LR grammars – more expressive than LL »Construct right-most derivation of program »Left-recursive grammars, virtually all programming languages are left-recursive »Easier to express syntax v Shift-reduce parsers »Parsers for LR grammars »Automatic parser generators (yacc, bison)
Bottom-Up Parsing (2) v Right-most derivation – Backward »Start with the tokens »End with the start symbol »Match substring on RHS of production, replace by LHS S S + E | E E num | (S) (1+2+(3+4))+5 (E+2+(3+4))+5 (S+2+(3+4))+5 (S+E+(3+4))+5 (S+(3+4))+5 (S+(E+4))+5 (S+(S+4))+5 (S+(S+E))+5 (S+(S))+5 (S+E)+5 (S)+5 E+5 S+E S
Shift-Reduce Parsing v Parsing actions: A sequence of shift and reduce operations v Parser state: A stack of terminals and non- terminals (grows to the right) v Current derivation step = stack + input Derivation stepstackUnconsumed input (1+2+(3+4))+5 (1+2+(3+4))+5 (E+2+(3+4))+5 (E+2+(3+4))+5 (S+2+(3+4))+5 (S+2+(3+4))+5 (S+E+(3+4))+5 (S+E+(3+4))+5...
Shift-Reduce Actions v Parsing is a sequence of shifts and reduces v Shift: move look-ahead token to stack v Reduce: Replace symbols from top of stack with non-terminal symbol X corresponding to the production: X (e.g., pop , push X) stackinputaction (1+2+(3+4))+5 shift 1 (1+2+(3+4))+5 stackinputaction (S+E+(3+4))+5 reduce S S+ E (S+(3+4))+5
Shift-Reduce Parsing derivationstackinput streamaction (1+2+(3+4))+5(1+2+(3+4))+5shift (1+2+(3+4))+5(1+2+(3+4))+5reduce E num (E+2+(3+4))+5(E+2+(3+4))+5reduce S E (S+2+(3+4))+5(S+2+(3+4))+5shift (S+2+(3+4))+5(S+2+(3+4))+5reduce E num (S+E+(3+4))+5(S+E+(3+4))+5reduce S S+E (S+(3+4))+5(S+(3+4))+5shift (S+(3+4))+5(S+(3+4))+5reduce E num... S S + E | E E num | (S)
Potential Problems v How do we know which action to take: whether to shift or reduce, and which production to apply v Issues »Sometimes can reduce but should not »Sometimes can reduce in different ways
Action Selection Problem v Given stack and look-ahead symbol b, should parser: »Shift b onto the stack making it b ? »Reduce X assuming that the stack has the form = making it X ? v If stack has the form , should apply reduction X (or shift) depending on stack prefix ? » is different for different possible reductions since ’s have different lengths
LR Parsing Engine v Basic mechanism »Use a set of parser states »Use stack with alternating symbols and states E.g., 1 ( 6 S (blue = state numbers) »Use parsing table to: Determine what action to apply (shift/reduce) Determine next state v The parser actions can be precisely determined from the table
LR Parsing Table v Algorithm: look at entry for current state S and input terminal C »If Table[S,C] = s(S’) then shift: push(C), push(S’) »If Table[S,C] = X then reduce: pop(2*| |), S’= top(), push(X), push(Table[S’,X]) Next action and next state Next state Terminals Non-terminals State Action tableGoto table
LR Parsing Table Example ()id,$SL 1s3s2g4 2S idS idS idS idS id 3s3s2g7g5 4accept 5s6s8 6S (L)S (L)S (L)S (L)S (L) 7L SL SL SL SL S 8s3s2g9 9L L,SL L,SL L,SL L,SL L,S State Input terminalNon-terminals We want to derive this in an algorithmic fashion
Parsing Example ((a),b) derivationstackinputaction ((a),b) 1((a),b)shift, goto 3 ((a),b) 1(3(a),b)shift, goto 3 ((a),b) 1(3(3a),b)shift, goto 2 ((a),b) 1(3(3a2),b)reduce S id ((S),b) 1(3(3(S7),b)reduce L S ((L),b) 1(3(3(L5),b)shift, goto 6 ((L),b) 1(3(3L5)6,b)reduce S (L) (S,b) 1(3S7,b)reduce L S (L,b) 1(3L5,b)shift, goto 8 (L,b) 1(3L5,8b)shift, goto 9 (L,b) 1(3L5,8b2)reduce S id (L,S) 1(3L8,S9)reduce L L,S (L) 1(3L5)shift, goto 6 (L) 1(3L5)6reduce S (L) S 1S4$done S (L) | id L S | L,S
LR(k) Grammars v LR(k) = Left-to-right scanning, right-most derivation, k lookahead chars v Main cases »LR(0), LR(1) »Some variations SLR and LALR(1) v Parsers for LR(0) Grammars: »Determine the actions without any lookahead »Will help us understand shift-reduce parsing
Building LR(0) Parsing Tables v To build the parsing table: »Define states of the parser »Build a DFA to describe transitions between states »Use the DFA to build the parsing table v Each LR(0) state is a set of LR(0) items »An LR(0) item: X . where X is a production in the grammar »The LR(0) items keep track of the progress on all of the possible upcoming productions »The item X . abstracts the fact that the parser already matched the string at the top of the stack
Example LR(0) State v An LR(0) item is a production from the language with a separator “.” somewhere in the RHS of the production v Sub-string before “.” is already on the stack (beginnings of possible ’s to be reduced) v Sub-string after “.”: what we might see next E num. E (. S) state item
Class Problem For the production, E num | (S) Two items are: E num. E (. S ) Are there any others? If so, what are they? If not, why?
LR(0) Grammar v Nested lists »S (L) | id »L S | L,S v Examples »(a,b,c) »((a,b), (c,d), (e,f)) »(a, (b,c,d), ((f,g))) S ( L ) L, S ( S )S a L, S S b c d Parse tree for (a, (b,c), d)
Start State and Closure v Start state »Augment grammar with production: S’ S $ »Start state of DFA has empty stack: S’ . S $ v Closure of a parser state: »Start with Closure(S) = S »Then for each item in S: X . Y Add items for all the productions Y to the closure of S: Y .
Closure Example S (L) | id L S | L,S DFA start state S’ . S $ closure S’ . S $ S . (L) S . id - Set of possible productions to be reduced next - Added items have the “.” located at the beginning: no symbols for these items on the stack yet
The Goto Operation v Goto operation = describes transitions between parser states, which are sets of items v Algorithm: for state S and a symbol Y »If the item [X . Y ] is in I, then »Goto(I, Y) = Closure( [X Y. ] ) S’ . S $ S . (L) S . id Goto(S, ‘(‘) Closure( { S (. L) } )
Class Problem 1.If I = { [E’ . E]}, then Closure(I) = ?? 2.If I = { [E’ E. ], [E E. + T] }, then Goto(I,+) = ?? E’ E E E + T | T T T * F | F F (E) | id
Applying Reduce Actions S’ . S $ S . (L) S . id S (. L) L . S L . L, S S . (L) S . id S id. id ( ( Grammar S (L) | id L S | L,S S (L. ) L L., S L S. L S states causing reductions (dot has reached the end!) Pop RHS off stack, replace with LHS X (X ), then rerun DFA (e.g., (x))
Reductions v On reducing X with stack »Pop off stack, revealing prefix and state »Take single step in DFA from top state »Push X onto stack with new DFA state v Example derivationstackinputaction ((a),b) 1 ( 3 ( 3a),b)shift, goto 2 ((a),b) 1 ( 3 ( 3 a 2),b)reduce S id ((S),b) 1 ( 3 ( 3 S 7),b)reduce L S
Full DFA S’ . S $ S . (L) S . id S (. L) L . S L . L, S S . (L) S . id S id. id ( ( S (L. )L L L., S L S. S L L,. S S . (L) S . id L L,S. S (L). S’ S. $ final state S, ) S $ id L Grammar S (L) | id L S | L,S
Building the Parsing Table v States in the table = states in the DFA v For transition S S’ on terminal C: »Table[S,C] += Shift(S’) v For transition S S’ on non-terminal N: »Table[S,N] += Goto(S’) v If S is a reduction state X then: »Table[S,*] += Reduce(X )
Computed LR Parsing Table ()id,$SL 1s3s2g4 2S idS idS idS idS id 3s3s2g7g5 4accept 5s6s8 6S (L)S (L)S (L)S (L)S (L) 7L SL SL SL SL S 8s3s2g9 9L L,SL L,SL L,SL L,SL L,S State Input terminalNon-terminals red = reduceblue = shift
LR(0) Summary v LR(0) parsing recipe: »Start with LR(0) grammar »Compute LR(0) states and build DFA: Use the closure operation to compute states Use the goto operation to compute transitions »Build the LR(0) parsing table from the DFA v This can be done automatically
Class Problem S E + S | E E num Generate the DFA for the following grammar
LR(0) Limitations v An LR(0) machine only works if states with reduce actions have a single reduce action »Always reduce regardless of lookahead v With a more complex grammar, construction gives states with shift/reduce or reduce/reduce conflicts v Need to use lookahead to choose L L, S. S S., L L S, L. L S. OK shift/reduce reduce/reduce
A Non-LR(0) Grammar v Grammar for addition of numbers »S S + E | E »E num v Left-associative version is LR(0) v Right-associative is not LR(0) as you saw with the previous class problem »S E + S | E »E num
LR(0) Parsing Table S’ . S $ S .E + S S . E E .num E num. S E. +S S E. E num + S E + S. S’ S $. S S E +. S S . E + S S . E E . num S’ S. $ S Grammar S E + S | E E num $ E num num+$ES 1s4g2g6 2S Es3/S ES E Shift or reduce in state 2?
Solve Conflict With Lookahead v 3 popular techniques for employing lookahead of 1 symbol with bottom-up parsing »SLR – Simple LR »LALR – LookAhead LR »LR(1) v Each as a different means of utilizing the lookahead »Results in different processing capabilities
SLR Parsing v SLR Parsing = Easy extension of LR(0) »For each reduction X , look at next symbol C »Apply reduction only if C is in FOLLOW(X) v SLR parsing table eliminates some conflicts »Same as LR(0) table except reduction rows »Adds reductions X only in the columns of symbols in FOLLOW(X) num+$ES 1s4g2g6 2s3S E Example: FOLLOW(S) = {$} Grammar S E + S | E E num
SLR Parsing Table v Reductions do not fill entire rows as before v Otherwise, same as LR(0) num+$ES 1s4g2g6 2s3S E 3s4g2g5 4E numE num 5 S E+S 6 s7 7 accept Grammar S E + S | E E num
Class Problem Consider: S L = R S R L *R L ident R L Think of L as l-value, R as r-value, and * as a pointer dereference When you create the states in the SLR(1) DFA, 2 of the states are the following: S L. = R R L. S R. Do you have any shift/reduce conflicts? (Not as easy as it looks)
LR(1) Parsing v Get as much as possible out of 1 lookahead symbol parsing table v LR(1) grammar = recognizable by a shift/reduce parser with 1 lookahead v LR(1) parsing uses similar concepts as LR(0) »Parser states = set of items »LR(1) item = LR(0) item + lookahead symbol possibly following production LR(0) item:S . S + E LR(1) item:S . S + E, + Lookahead only has impact upon REDUCE operations, apply when lookahead = next input
LR(1) States v LR(1) state = set of LR(1) items v LR(1) item = (X . , y) »Meaning: already matched at top of the stack, next expect to see y v Shorthand notation »(X . , {x1,..., xn}) »means: (X . , x1) ... (X . , xn) v Need to extend closure and goto operations S S. + E+,$ S S +. Enum
LR(1) Closure v LR(1) closure operation: »Start with Closure(S) = S »For each item in S: X . Y , z and for each production Y , add the following item to the closure of S: Y . , FIRST( z) »Repeat until nothing changes v Similar to LR(0) closure, but also keeps track of lookahead symbol
LR(1) Start State v Initial state: start with (S’ . S, $), then apply closure operation v Example: sum grammar S’ . S, $ S . E + S, $ S . E, $ E . num, +,$ closure S’ S $ S E + S | E E num
LR(1) Goto Operation v LR(1) goto operation = describes transitions between LR(1) states v Algorithm: for a state S and a symbol Y (as before) »If the item [X . Y ] is in I, then »Goto(I, Y) = Closure( [X Y. ] ) S E. + S, $ S E., $ Closure({S E +. S, $}) Goto(S1, ‘+’) S1 S2 Grammar: S’ S$ S E + S | E E num
Class Problem 1. Compute: Closure(I = {S E +. S, $}) 2. Compute: Goto(I, num) 3. Compute: Goto(I, E) S’ S $ S E + S | E E num
LR(1) DFA Construction S’ . S, $ S . E + S, $ S . E, $ E .num, +,$ E num., +,$ S’ S., $ E num + S E+S., +,$ S S E +. S, $ S . E + S, $ S . E, $ E . num, +,$ S E. + S, $ S E., $ S Grammar S’ S$ S E + S | E E num E num
LR(1) Reductions S’ . S, $ S . E + S, $ S . E, $ E .num, +,$ E num., +,$ S’ S., $ E num + S E., +,$ S S E +. S, $ S . E + S, $ S . E, $ E . num, +,$ S E. + S, $ S E., $ S Grammar S’ S$ S E + S | E E num E num Reductions correspond to LR(1) items of the form (X ., y)
LR(1) Parsing Table Construction v Same as construction of LR(0), except for reductions v For a transition S S’ on terminal x: »Table[S,x] += Shift(S’) v For a transition S S’ on non-terminal N: »Table[S,N] += Goto(S’) v If I contains {(X ., y)} then: »Table[I,y] += Reduce(X )
LR(1) Parsing Table Example S’ . S, $ S . E + S, $ S . E, $ E .num, +,$ E + S E +. S, $ S . E + S, $ S . E, $ E . num, +,$ S E. + S, $ S E., $ Grammar S’ S$ S E + S | E E num $E 1g2 2s3S E Fragment of the parsing table
Class Problem Compute the LR(1) DFA for the following grammar E E + T | T T TF | F F F* | a | b
LALR(1) Grammars v Problem with LR(1): too many states v LALR(1) parsing (aka LookAhead LR) »Constructs LR(1) DFA and then merge any 2 LR(1) states whose items are identical except lookahead »Results in smaller parser tables »Theoretically less powerful than LR(1) v LALR(1) grammar = a grammar whose LALR(1) parsing table has no conflicts S id., + S E., $ S id., $ S E., + += ??
LALR Parsers v LALR(1) »Generally same number of states as SLR (much less than LR(1)) »But, with same lookahead capability of LR(1) (much better than SLR) »Example: Pascal programming language In SLR, several hundred states In LR(1), several thousand states
Automate the Parsing Process v Can automate: »The construction of LR parsing tables »The construction of shift-reduce parsers based on these parsing tables v LALR(1) parser generators »yacc, bison »Not much difference compared to LR(1) in practice »Smaller parsing tables than LR(1) »Augment LALR(1) grammar specification with declarations of precedence, associativity »Output: LALR(1) parser program
Associativity S S + E | E E num E E + E E num What happens if we run this grammar through LALR construction? E E + E E num E E + E., + E E. + E, +,$ + shift/reduce conflict shift: 1+ (2+3) reduce: (1+2)
Associativity (2) v If an operator is left associative »Assign a slightly higher value to its precedence if it is on the parse stack than if it is in the input stream »Since stack precedence is higher, reduce will take priority (which is correct for left associative) v If operator is right associative »Assign a slightly higher value if it is in the input stream »Since input stream is higher, shift will take priority (which is correct for right associative)
Precedence E E + E | T T T x T | num | (E) E E + E | E x E | num | (E) Shift/reduce conflict results What happens if we run this grammar through LALR construction? E E. + E,... E E x E., + E E + E., x E E. x E,... Precedence: attach precedence indicators to terminals Shift/reduce conflict resolved by: 1. If precedence of the input token is greater than the last terminal on parse stack, favor shift over reduce 2. If the precedence of the input token is less than or equal to the last terminal on the parse stack, favor reduce over shift
Abstract Syntax Tree (AST) - Review v Derivation = sequence of applied productions »S E+S 1+S 1+E 1+2 v Parse tree = graph representation of a derivation »Doesn’t capture the order of applying the productions v AST discards unnecessary information from the parse tree S E+S ( S )E E + S 5 1 2E ( S ) E + S E34
Implicit AST Construction v LL/LR parsing techniques implicitly build AST v The parse tree is captured in the derivation »LL parsing: AST represented by applied productions »LR parsing: AST represented by applied reductions v We want to explicitly construct the AST during the parsing phase
AST Construction - LL void parse_S() { switch (token) { case num: case ‘(‘: parse_E(); parse_S’(); return; default: ParseError(); } Expr parse_S() { switch (token) { case num: case ‘(‘: Expr left = parse_E(); Expr right = parse_S’(); if (right == NULL) return left else return new Add(left,right); default: ParseError(); } LL parsing: extend procedures for non-terminals S ES’ S’ | +S E num | (S)
AST Construction - LR v We again need to add code for explicit AST construction v AST construction mechanism »Store parts of the tree on the stack »For each nonterminal symbol X on stack, also store the sub-tree rooted at X on stack »Whenever the parser performs a reduce operation for a production X , create an AST node for X
AST Construction for LR - Example S E + S | S E num | (S) S + E.. Add Num(1)Num(2) stack Before reduction: S E + S Num(3) S. Add Num(1) Num(2)Num(3) Add After reduction: S E + S input string: “ ”
Problems v Unstructured code: mixing parsing code with AST construction code v Automatic parser generators »The generated parser needs to contain AST construction code »How to construct a customized AST data structure using an automatic parser generator? v May want to perform other actions concurrently with parsing phase »E.g., semantic checks »This can reduce the number of compiler passes
Syntax-Directed Definition v Solution: Syntax-directed definition »Extends each grammar production with an associated semantic action (code): S E + S {action} »The parser generator adds these actions into the generated parser »Each action is executed when the corresponding production is reduced
Semantic Actions v Actions = C code (for bison/yacc) v The actions access the parser stack »Parser generators extend the stack of symbols with entries for user-defined structures (e.g., parse trees) v The action code should be able to refer to the grammar symbols in the productions »Need to refer to multiple occurrences of the same non- terminal symbol, distinguish RHS vs LHS occurrence E E + E »Use dollar variables in yacc/bison ($$, $1, $2, etc.) expr ::= expr PLUS expr{$$ = $1 + $3;}
Building the AST v Use semantic actions to build the AST v AST is built bottom-up along with parsing expr ::= NUM{$$ = new Num($1.val); } expr ::= expr PLUS expr{$$ = new Add($1, $3); } expr ::= expr MULT expr{$$ = new Mul($1, $3); } expr ::= LPAR expr RPAR{$$ = $2; } Recall: User-defined type for objects on the stack (%union)
Outline v DFA & NFA v Regular expression v Regular languages v Context free languages &PDA v Scanner v Parser »Top-down parsing »Bottom-up Parsing »Comparison
LL/LR Grammar Summary v LL parsing tables »Non-terminals x terminals productions »Computed using FIRST/FOLLOW v LR parsing tables »LR states x terminals {shift/reduce} »LR states x non-terminals goto »Computed using closure/goto operations on LR states v A grammar is: »LL(1) if its LL(1) parsing table has no conflicts »same for LR(0), SLR, LALR(1), LR(1)
Top-Down Parsing S S+E E+E (S)+E (S+E)+E (S+E+E)+E (E+E+E)+E (1+E+E)+E (1+2+E)+E... S S + E | E E num | (S) In left-most derivation, entire tree above token (2) has been expanded when encountered S S+E ( S ) S + E 5 E 2 E 1 ( S ) S + E 4E 3
Top-Down vs Bottom-Up scanned unscannedscannedunscanned Top-downBottom-up Bottom-up: Don’t need to figure out as much of he parse tree for a given amount of input More time to decide what rules to apply
Terminology: LL vs LR v LL(k) »Left-to-right scan of input »Left-most derivation »k symbol lookahead »[Top-down or predictive] parsing or LL parser »Performs pre-order traversal of parse tree v LR(k) »Left-to-right scan of input »Right-most derivation »k symbol lookahead »[Bottom-up or shift-reduce] parsing or LR parser »Performs post-order traversal of parse tree
Classification of Grammars LR(0) SLR LALR(1) LR(1) LL(1) LR(k) LR(k+1) LL(k) LL(k+0) LL(k) LR(k) LR(0) SLR LALR(1) LR(1) not to scale
Bottom-Up Parsing (1+2+(3+4))+5 (E+2+(3+4))+5 (S+2+(3+4))+5 (S+E+(3+4))+5 S S + E | E E num | (S) Advantage of bottom-up parsing: can postpone the selection of productions until more of the input is scanned S S+E ( S ) S + E 5 E 2 E 1 ( S ) S + E 4E 3