1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #10 Parsing
2 S lookahead input string S A B C ? ? lookahead start symbol fringe of the parse tree A ? lookahead upper fringe of the parse tree left-to-right scan Bottom-up Parsing left-most derivation Top-down Parsing D D C S right-most derivation in reverse
3 LR Parsers LR(k) parsers are table-driven, bottom-up, shift-reduce parsers that use a limited right context (k-token lookahead) for handle recognition LR(k): Left-to-right scan of the input, Rightmost derivation in reverse with k token lookahead A grammar is LR(k) if, given a rightmost derivation S 0 1 2 … n-1 n sentence We can 1. isolate the handle of each right-sentential form i, and 2. determine the production by which to reduce, by scanning i from left-to-right, going at most k symbols beyond the right end of the handle of i
4 LR Parsers A table-driven LR parser looks like Scanner Table-driven Parser A CTION & G OTO Tables Parser Generator source code grammar IR Stack
5 LR Shift-Reduce Parsers push($); // $ is the end-of-file symbol push(s 0 ); // s 0 is the start state of the DFA that recognizes handles lookahead = get_next_token(); repeat forever s = top_of_stack(); if ( ACTION[s,lookahead] == reduce ) then pop 2*| | symbols; s = top_of_stack(); push( ); push(GOTO[s, ]); else if ( ACTION[s,lookahead] == shift s i ) then push(lookahead); push(s i ); lookahead = get_next_token(); else if ( ACTION[s,lookahead] == accept and lookahead == $ ) then return success; else error(); The skeleton parser uses A CTION & G OTO does |words| shifts does |derivation| reductions does 1 accept
6 LR Parsers How does this LR stuff work? Unambiguous grammar unique rightmost derivation Keep upper fringe on a stack –All active handles include TOS –Shift inputs until TOS is right end of a handle Language of handles is regular –Build a handle-recognizing DFA –A CTION & G OTO tables encode the DFA To match subterms, recurse & leave DFA ’s state on stack Final states of the DFA correspond to reduce actions –New state is G OTO[lhs, state at TOS]
7 Building LR Parsers How do we generate the A CTION and G OTO tables? Use the grammar to build a model of the handle recognizing DFA Use the DFA model to build A CTION & G OTO tables If construction succeeds, the grammar is LR How do we build the handle recognizing DFA ? Encode the set of productions that can be used as handles in the DFA state: Use LR(k) items Use two functions goto( s, ) and closure( s ) –goto() is analogous to move() in the DFA to NFA conversion –closure() is analogous to -closure Build up the states and transition functions of the DFA Use this information to fill in the A CTION and G OTO tables
8 LR(k) items An LR(k) item is a pair [A, B], where A is a production with a at some position in the rhs B is a lookahead string of length ≤ k (terminal symbols or $) Examples: [ , a], [ , a], [ , a], & [ , a] The in an item indicates the position of the top of the stack LR(0) items [ ] (no lookahead symbol) LR(1) items [ , a ] (one token lookahead) LR(2) items [ , a b ] (two token lookahead)...
9 LR(1) Items The production , with lookahead a, generates 4 items [ , a], [ , a], [ , a], & [ , a] The set of LR(1) items for a grammar is finite What’s the point of all these lookahead symbols? Carry them along to choose correct reduction Lookaheads are bookkeeping, unless item has at right end –Has no direct use in [ , a] –In [ , a], a lookahead of a implies a reduction by –For { [ , a],[ , b] } lookahead = a reduce to ; lookahead F IRST ( ) shift Limited right context is enough to pick the actions
10 High-level overview Build the handle recognizing DFA (aka Canonical Collection of sets of LR(1) items), C = { I 0, I 1,..., I n } aIntroduce a new start symbol S’ which has only one production S’ S bInitial state, I 0 should include [S’ S, $], along with any equivalent items Derive equivalent items as closure( I 0 ) cRepeatedly compute, for each I k, and each grammar symbol , goto(I k, ) If the set is not already in the collection, add it Record all the transitions created by goto( ) This eventually reaches a fixed point 2 Fill in the ACTION and GOTO tables using the DFA The canonical collection completely encodes the transition diagram for the handle-finding DFA LR(1) Table Construction
11 Computing Closures closure(I) adds all the items implied by items already in I Any item [ , a] implies [ , x] for each production with on the lhs, and x F IRST ( a) Since is valid, any way to derive is valid, too The algorithm Closure( I ) while ( I is still changing ) for each item [ , a] I for each production P for each terminal b FIRST( a) if [ , b] I then add [ , b] to I Fixpoint computation
12 Computing Gotos goto(I, x) computes the state that the parser would reach if it recognized an x while in state I goto( { [ , a] }, ) produces [ , a] It also includes closure( [ , a] ) to fill out the state The algorithm Goto( I, x ) new = Ø for each [ x , a] I new = new [ x , a] return closure(new) Not a fixpoint method Uses closure
13 Building the Canonical Collection of LR(1) Items This is where we build the handle recognizing DFA! Start from I 0 = closure( [S’ S, $] ) Repeatedly construct new states, until ano new states are generated The algorithm I 0 = closure( [ S’ S, $] ) C = { I 0 } while ( C is still changing ) for each I i C and for each x ( T NT ) I new = goto( I i, x ) if I new C then C = C I new record transition I i I new on x Fixed-point computation Loop adds to C C 2 ITEMS, so C is finite
14 Algorithms for LR(1) Parsers Constructing canonical collection of LR(1) items: Computing closure of set of LR(1) items: Computing goto for set of LR(1) items: Canonical collection construction algorithm is the algorithm for constructing handle recognizing DFA Uses Closure to compute the states of the DFA Uses Goto to compute the transitions of the DFA I 0 = closure( [ S’ S, $] ) C = { I 0 } while ( C is still changing ) for each I i C and for each x ( T NT ) I new = goto( I i, x ) if I new C then C = C I new record transition I i I new on x Closure( I ) while ( I is still changing ) for each item [ , a] I for each production P for each terminal b FIRST( a) if [ , b] I then add [ , b] to I Goto( I, x ) new = Ø for each [ x , a] I new = new [ x , a] return closure(new)
15 Example Simplified, right recursive expression grammar S Expr Expr Term - Expr Expr Term Term Factor * Term Term Factor Factor id SymbolFIRST S{ id } Expr{ id} Term{ id } Factor{ id} -{ - } *{ * } id{ id}
16 I 0 = {[ S Expr,$], [ Expr Term - Expr, $], [ Expr Term, $], [ Term Factor * Term, {$,-}], [ Term Factor, {$,-}], [ Factor id,{$,-,*}]} I 5 = {[ Expr Term - Expr, $], [ Expr Term - Expr, $], [ Expr Term, $], [ Term Factor * Term, {$,-}], [ Term Factor, {$,-}], [ Factor id, {$,-,*}] } I 8 = { [ Term Factor * Term, {$,-}] } I 1 = {[ S Expr, $]} I 2 = { [ Expr Term - Expr, $], [ Expr Term, $] } I 3 = {[ Term Factor * Term, {$,-}], [ Term Factor, {$,-}]} I 4 = { [ Factor id, {$,-,*}] } I 6 ={[ Term Factor * Term, {$,-}], [ Term Factor * Term, {$,-}], [ Term Factor, {$,-}], [ Factor id, {$, -, *}]} I 7 = { [ Expr Term - Expr, $] } Expr Factor id Term id Term Factor * - Expr
17 Constructing the A CTION and G OTO Tables The algorithm for each set of items I x C for each item I x if item is [ a ,b] and a T and goto(I x,a) = I k, then ACTION[x,a] “shift k” else if item is [S’ S,$] then ACTION[x,$] “accept” else if item is [ ,a] then ACTION[x,a] “reduce ” for each n NT if goto(I x,n) = I k then GOTO[x,n] k x is the state number Each state corresponds to a set of LR(1) items
18 Example (Constructing the LR(1) tables) The algorithm produces the following table
19 What can go wrong in LR(1) parsing? What if state s contains [ a , b] and [ , a] ? First item generates “shift”, second generates “reduce” Both define ACTION[s,a] — cannot do both actions This is called a shift/reduce conflict Modify the grammar to eliminate it Shifting will often resolve it correctly (remember the danglin else problem?) What if set s contains [ , a] and [ , a] ? Each generates “reduce”, but with a different production Both define ACTION[s,a] — cannot do both reductions This is called a reduce/reduce conflict Modify the grammar to eliminate it In either case, the grammar is not LR(1)
20 Algorithms for LR(0) Parsers Basically the same algorithms we use for LR(1) The only difference is there is no lookahead symbol in LR(0) items and therefore there is no lookahead propagation LR(0) parsers generate less states They cannot recognize all the grammars that LR(1) parsers recognize They are more likely to get conflicts in the parse table since there is no lookahead Computing closure of set of LR(0) items: Computing goto for set of LR(0) items: Constructing canonical collection of LR(0) items: Closure( I ) while ( I is still changing ) for each item [ ] I for each production P if [ ] I then add [ ] to I Goto( I, x ) new = Ø for each [ x ] I new = new [ x ] return closure(new) I 0 = closure( [ S’ S ] ) C = { I 0 } while ( C is still changing ) for each I i C and for each x ( T NT ) I new = goto( I i, x ) if I new C then C = C I new record transition I i I new on x
21 SLR (Simple LR) Table Construction SLR(1) algorithm uses canonical collection of LR(0) items It also uses the FOLLOW sets to decide when to do a reduction for each set of items I x C for each item I x if item is [ a ] and a T and goto(I x,a) = I k, then ACTION[x,a] “shift k” else if item is [S’ S ] then ACTION[x,$] “accept” else if item is [ ] then for each a FOLLOW( ) then ACTION[x,a] “reduce ” for each n NT if goto(I x,n) = I k then GOTO[x,n] k
22 LALR (LookAhead LR) Table Construction LR(1) parsers have many more states than SLR(1) parsers Idea: Merge the LR(1) states –Look at the LR(0) core of the LR(1) items (meaning ignore the lookahead) –If two sets of LR(1) items have the same core, merge them and change the ACTION and GOTO tables accordingly LALR(1) parsers can be constructed in two ways 1.Build the canonical collection of LR(1) items and then merge the sets with the same core ( you should be able to do this) 2.Ignore the items with the dot at the beginning of the rhs and construct kernels of sets of LR(0) items. Use a lookahead propagation algorithm to compute lookahead symbols. The second approach is more efficient since it avoids the construction of the large LR(1) table.
23 Properties of LALR(1) Parsers An LALR(1) parser for a grammar G has same number of states as the SLR(1) parser for that grammar If an LR(1) parser for a grammar G does not have any shift/reduce conflicts, then the LALR(1) parser for that grammar will not have any shift/reduce conflicts It is possible for LALR(1) parser for a grammar G to have a reduce/reduce conflict while an LR(1) parser for that grammar does not have any conflicts
24 Error recovery Panic-mode recovery: On discovering an error discard input symbols one at a time until one synchronizing token is found –For example delimiters such as “;” or “}” can be used as synchronizing tokens Phrase-level recovery: On discovering an error make local corrections to the input –For example replace “,” with “;” Error-productions: If we have a good idea about what type of errors occur we can augment the grammar with error productions and generate appropriate error messages when an error production is used Global correction: Given an incorrect input string try to find a correct string which will require minimum changes to the input string –In general too costly
25 Direct Encoding of Parse Tables Rather than using a table-driven interpreter … Generate spaghetti code that implements the logic Each state becomes a small case statement or if-then-else Analogous to direct coding a scanner Advantages No table lookups and address calculations No representation for don’t care states No outer loop —it is implicit in the code for the states This produces a faster parser with more code but no table
26 LR(k) versus LL(k) (Top-down Recursive Descent ) Finding Reductions LR(k) Each reduction in the parse is detectable with the complete left context, the reducible phrase, itself, and the k terminal symbols to its right LL(k) Parser must select the reduction based on The complete left context The next k terminals Thus, LR(k) examines more context
27 Left Recursion versus Right Recursion Right recursion Required for termination in top-down parsers Uses (on average) more stack space Produces right-associative operators Left recursion More efficient bottom-up parsers Limits required stack space Produces left-associative operators Rule of thumb Left recursion for bottom-up parsers (Bottom-up parsers can also deal with right-recursion but left recursion is more efficient) Right recursion for top-down parsers * * * w x y z w * ( x * ( y * z ) ) * * * z w x y ( (w * x ) * y ) * z
28 Hierarchy of Context-Free Languages Context-free languages Deterministic languages (LR(k)) LL(k) languagesSimple precedence languages LL(1) languagesOperator precedence languages LR(k) LR(1) The inclusion hierarchy for context-free languages
29 Hierarchy of Context-Free Grammars Inclusion hierarchy for context-free grammars Operator precedence includes some ambiguous grammars LL(1) is a subset of SLR(1) Context-free grammars Unambiguous CFGs Operator Precedence LR(k) LR(1) LALR(1) SLR(1) LR(0) LL(k) LL(1)
30 Summary Advantages Fast Simplicity Good error detection Fast Deterministic langs. Automatable Left associativity Disadvantages Hand-coded High maintenance Right associativity Large working sets Poor error messages Large table sizes Top-down recursive descent LR(1)
