Lecture 4: Syntax Analysis: Parsing Noam Rinetzky

Lecture 4: Syntax Analysis: Parsing Noam Rinetzky
Compilation Lecture 4: Syntax Analysis: Parsing Noam Rinetzky

The Real Anatomy of a Compiler
Source text txt Process text input Lexical Analysis Lexical Analysis Syntax Analysis Syntax Analysis Sem. Analysis characters tokens AST Annotated AST Intermediate code generation Intermediate code optimization Code generation IR IR Symbolic Instructions Executable code exe Target code optimization SI Machine code generation MI Write executable output

Broad kinds of parsers Parsers for arbitrary grammars Top-Down parsers
Earley’s method, CYK method Usually, not used in practice (though might change) Top-Down parsers Construct parse tree in a top-down matter Find the leftmost derivation Bottom-Up parsers Construct parse tree in a bottom-up manner Find the rightmost derivation in a reverse order

CFG terminology S  S ; S S  id := E E  id E  num E  E + E
Symbols: Terminals (tokens): ; := ( ) id num print Non-terminals: S E L Start non-terminal: S Convention: the non-terminal appearing in the first derivation rule Grammar productions (rules) N  μ

CFG terminology Derivation - a sequence of replacements of non-terminals using the derivation rules Language - the set of strings of terminals derivable from the start symbol Sentential form - the result of a partial derivation May contain non-terminals

Derivations Show that a sentence ω is in a grammar G
Start with the start symbol Repeatedly replace one of the non-terminals by a right-hand side of a production Stop when the sentence contains only terminals Given a sentence αNβ and rule Nµ αNβ => αµβ ω is in L(G) if S =>* ω

Predictive parsing Recursive descent LL(k) grammars

Predictive parsing Given a grammar G and a word w attempt to derive w using G Idea Apply production to leftmost nonterminal Pick production rule based on next input token General grammar More than one option for choosing the next production based on a token Restricted grammars (LL) Know exactly which single rule to apply May require some lookahead to decide

Boolean expressions example
E  LIT | (E OP E) | not E LIT  true | false OP  and | or | xor not ( not true or false ) production to apply known from next token E => not E => not ( E OP E ) => not ( not E OP E ) => not ( not LIT OP E ) => not ( not true OP E ) => not ( not true or E ) => not ( not true or LIT ) => not ( not true or false ) not E ( OP ) LIT or true false

Implementation via recursion
if (current  {TRUE, FALSE}) LIT(); else if (current == LPAREN) match(LPARENT); E(); OP(); E(); match(RPAREN); else if (current == NOT) match(NOT); E(); else error; } E → LIT | ( E OP E ) | not E LIT → true | false OP → and | or | xor LIT() { if (current == TRUE) match(TRUE); else if (current == FALSE) match(FALSE); else error; } OP() { if (current == AND) match(AND); else if (current == OR) match(OR); else if (current == XOR) match(XOR); else error; }

FIRST sets FIRST(X) = { t | X * t β} ∪{ℇ | X * ℇ} Example:
FIRST(X) = all terminals that α can appear as first in some derivation for X + ℇ if can be derived from X Example: FIRST( LIT ) = { true, false } FIRST( ( E OP E ) ) = { ‘(‘ } FIRST( not E ) = { not } First(α) can be defined for any sequence of symbols

Computing FIRST sets FIRST (t) = { t } // “t” terminal ℇ∈ FIRST(X) if
X  ℇ or X  A1 .. Ak and ℇ∈ FIRST(Ai) i=1…k FIRST(α) ⊆ FIRST(X) if X A1 .. Ak α and ℇ∈ FIRST(Ai) i=1…k First(X) is computed for non-terminals

Follow sets Follow(X) = { t | S * α X t β} t – Terminal or $

FOLLOW sets: Constraints
FIRST(β) – {ℇ} ⊆ FOLLOW(X) For each A  α X β FOLLOW(A) ⊆ FOLLOW(X) For each A  α X β and ℇ ∈ FIRST(β)

Example: FOLLOW sets E TX X+ E | ℇ T (E) | int Y Y  * T | ℇ
FIRST(β) – {ℇ} ⊆ FOLLOW(X) For each A  α X β FOLLOW(A) ⊆ FOLLOW(X) For each A  α X β and ℇ ∈ FIRST(β) Non. Term. E T X Y FOLLOW ), $ +, ), $ $, )

Prediction Table A  α T[A,t] = α if t ∈FIRST(α)
T[A,t] = α if ℇ ∈ FIRST(α) and t ∈ FOLLOW(A) t can also be $ T is not well defined  the grammar is not LL(1)

LL(k) grammars A grammar is in the class LL(K) when it can be derived via: Top-down derivation Scanning the input from left to right (L) Producing the leftmost derivation (L) With lookahead of k tokens (k) A language is said to be LL(k) when it has an LL(k) grammar

LL(1) grammars A grammar is in the class LL(1) iff
For every two productions A  α and A  β we have FIRST(α) ∩ FIRST(β) = {} // including  If  ∈ FIRST(α) then FIRST(β) ∩ FOLLOW(A) = {} If  ∈ FIRST(β) then FIRST(α) ∩ FOLLOW(A) = {}

Problem: Non LL Grammars

Back to problem 1: common prefix
term  ID | indexed_elem indexed_elem  ID [ expr ] FIRST(term) = { ID } FIRST(indexed_elem) = { ID } FIRST/FIRST conflict

Solution: left factoring
Rewrite the grammar to be in LL(1) term  ID | indexed_elem indexed_elem  ID [ expr ] term  ID after_ID After_ID  [ expr ] |  Intuition: just like factoring x*y + x*z into x*(y+z)

Left factoring – another example
S  if E then S else S | if E then S | T S  if E then S S’ | T S’  else S | 

Problem: null production
S  A a b A  a |  bool S() { return A() && match(token(‘a’)) && match(token(‘b’)); } bool A() { return match(token(‘a’)) || true; What happens for input “ab”? What happens if you flip order of alternatives and try “aab”?

Problem: null production
S  A a b A  a |  FIRST(S) = { a } FOLLOW(S) = { $ } FIRST(A) = { a,  } FOLLOW(A) = { a } FIRST/FOLLOW conflict

Back to problem 2: null production
S  A a b A  a |  FIRST(S) = { a } FOLLOW(S) = { } FIRST(A) = { a ,  } FOLLOW(A) = { a } FIRST/FOLLOW conflict

Solution: substitution
S  A a b A  a |  Substitute A in S S  a a b | a b Left factoring S  a after_A after_A  a b | b

Back to problem 3: Left recursion
E  E - term | term Left recursion cannot be handled with a bounded lookahead What can we do?

Left recursion removal
p. 130 N  Nα | β N  βN’ N’  αN’ |  G1 G2 L(G1) = β, βα, βαα, βααα, … L(G2) = same Can be done algorithmically. Problem: grammar becomes mangled beyond recognition For our 3rd example: E  E - term | term E  term TE | term TE  - term TE | 

LL(k) Parsers Recursive Descent Wanted Manual construction
Uses recursion Wanted A parser that can be generated automatically Does not use recursion

LL(k) parsing via pushdown automata
Pushdown automaton uses Prediction stack Input stream Transition table nonterminals x tokens -> production alternative Entry indexed by nonterminal N and token t contains the alternative of N that must be predicated when current input starts with t

LL(k) parsing via pushdown automata
Two possible moves Prediction When top of stack is nonterminal N, pop N, lookup table[N,t]. If table[N,t] is not empty, push table[N,t] on prediction stack, otherwise – syntax error Match When top of prediction stack is a terminal T, must be equal to next input token t. If (t == T), pop T and consume t. If (t ≠ T) syntax error Parsing terminates when prediction stack is empty If input is empty at that point, success. Otherwise, syntax error

Example transition table
(1) E → LIT (2) E → ( E OP E ) (3) E → not E (4) LIT → true (5) LIT → false (6) OP → and (7) OP → or (8) OP → xor Which rule should be used Input tokens ( ) not true false and or xor $ E 2 3 1 LIT 4 5 OP 6 7 8 Nonterminals

Model of non-recursive predictive parser
$ b + a Predictive Parsing program Stack X Y Z $ Output Parsing Table

Running parser example
aacbb$ A  aAb | c Input suffix Stack content Move aacbb$ A$ predict(A,a) = A  aAb aAb$ match(a,a) acbb$ Ab$ aAbb$ cbb$ Abb$ predict(A,c) = A  c match(c,c) bb$ match(b,b) b$ $ match($,$) – success a b c A A  aAb A  c

Erorrs

Handling Syntax Errors
Report and locate the error Diagnose the error Correct the error Recover from the error in order to discover more errors without reporting too many “strange” errors

Error Diagnosis Line number The current token The expected tokens
may be far from the actual error The current token The expected tokens Parser configuration

Error Recovery Becomes less important in interactive environments
Example heuristics: Search for a semi-column and ignore the statement Try to “replace” tokens for common errors Refrain from reporting 3 subsequent errors Globally optimal solutions For every input w, find a valid program w’ with a “minimal-distance” from w

Illegal input example abcbb$ A  aAb | c Input suffix Stack content
Move abcbb$ A$ predict(A,a) = A  aAb aAb$ match(a,a) bcbb$ Ab$ predict(A,b) = ERROR a b c A A  aAb A  c

Error handling in LL parsers
c$ S  a c | b S Input suffix Stack content Move c$ S$ predict(S,c) = ERROR Now what? Predict b S anyway “missing token b inserted in line XXX” a b c S S  a c S  b S

Error handling in LL parsers
c$ S  a c | b S Input suffix Stack content Move bc$ S$ predict(b,c) = S  bS bS$ match(b,b) c$ Looks familiar? Result: infinite loop a b c S S  a c S  b S

Error handling and recovery
x = a * (p+q * ( -b * (r-s); Where should we report the error? The valid prefix property

The Valid Prefix Property
For every prefix tokens t1, t2, …, ti that the parser identifies as legal: there exists tokens ti+1, ti+2, …, tn such that t1, t2, …, tn is a syntactically valid program If every token is considered as single character: For every prefix word u that the parser identifies as legal there exists w such that u.w is a valid program

Recovery is tricky Heuristics for dropping tokens, skipping to semicolon, etc.

Building the Parse Tree

Adding semantic actions
Can add an action to perform on each production rule Can build the parse tree Every function returns an object of type Node Every Node maintains a list of children Function calls can add new children

Building the parse tree
Node E() { result = new Node(); result.name = “E”; if (current  {TRUE, FALSE}) // E  LIT result.addChild(LIT()); else if (current == LPAREN) // E  ( E OP E ) result.addChild(match(LPAREN)); result.addChild(E()); result.addChild(OP()); result.addChild(match(RPAREN)); else if (current == NOT) // E  not E result.addChild(match(NOT)); else error; return result; }

Parser for Fully Parenthesized Expers
static int Parse_Expression(Expression **expr_p) { Expression *expr = *expr_p = new_expression() ; /* try to parse a digit */ if (Token.class == DIGIT) { expr->type=‘D’; expr->value=Token.repr –’0’; get_next_token(); return 1; } /* try parse parenthesized expression */ if (Token.class == ‘(‘) { expr->type=‘P’; get_next_token(); if (!Parse_Expression(&expr->left)) Error(“missing expression”); if (!Parse_Operator(&expr->oper)) Error(“missing operator”); if (Token.class != ‘)’) Error(“missing )”); return 1; } return 0; }

Bottom Up parsing

Bottom-Up Parsing Goal: Build a parse tree How:
Report error if input is not a legal program How: Read input left-to-right Construct a subtree for the first left-most tree node whose childern have been constructed

(Non standard precedence)
Bottom-up parsing E  E * T E  T T  T + F T  F F  id F  num F  ( E ) E E T T T F F F 1 + 2 * 3 (Non standard precedence)

Bottom-up parsing: LR(k) Grammars
A grammar is in the class LR(K) when it can be derived via: Bottom-up derivation Scanning the input from left to right (L) Producing the rightmost derivation (R) In reverse oreder With lookahead of k tokens (k) A language is said to be LR(k) if it has an LR(k) grammar The simplest case is LR(0), which we will discuss

Terminology: Reductions & Handles
The opposite of derivation is called reduction Let A  α be a production rule Derivation: βAµ  βαµ Reduction: βαµ  βAµ A handle is the reduced substring α is the handles for βαµ

Use Shift & Reduce In each stage, we shift a symbol from the input to the stack, or reduce according to one of the rules.

How does the parser know what to do?
token stream ) x * 7 + 23 ( RP Id OP Num LP Input Action Table Parser Stack Output Goto table Op(*) Id(b) Num(23) Num(7) Op(+)

How does the parser know what to do?
A state will keep the info gathered on handle(s) A state in the “control” of the PDA Also (part of) the stack alpha bet A table will tell it “what to do” based on current state and next token The transition function of the PDA A stack will records the “nesting level” Stack contains a sequence of prefixes of handles Set of LR(0) items

Important Bottom-Up LR-Parsers
LR(0) – simplest, explains basic ideas SLR(1) – simple, exaplins lookahead LR(1) – complictated, very powerful, expensive LALR(1) – complicated, powerful enough, used by automatic tools

LR(0) vs SLR(1) vs LR(1) vs LALR(1)
All use shift / reduce Main difference: how to identify a handle Technically: Using different sets of states More expsensive  more states  more specific choice of which reduction rule to use But the usage of the states is the same in all parsers Reduction is the same in all techniques Once the handle is determined

LR(0) Parsing

N  αβ LR item Already matched To be matched Input
Hypothesis about αβ being a possible handle: so far we’ve matched α, expecting to see β

Example: LR(0) Items All items can be obtained by placing a dot at every position for every production: LR(0) items 1: S  E$ 2: S  E  $ 3: S  E $  4: E   T 5: E  T  6: E   E + T 7: E  E  + T 8: E  E +  T 9: E  E + T  10: T   i 11: T  i  12: T   (E) 13: T  ( E) 14: T  (E ) 15: T  (E)  Grammar (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E )

Example: LR(0) Items All items can be obtained by placing a dot at every position for every production: Before  = reduced matched prefix After  = may be reduced May be matched by suffix LR(0) items 1: S  E$ 2: S  E  $ 3: S  E $  4: E   T 5: E  T  6: E   E + T 7: E  E  + T 8: E  E +  T 9: E  E + T  10: T   id 11: T  id  12: T   (E) 13: T  ( E) 14: T  (E ) 15: T  (E)  Grammar (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E )

LR(0) items N  αβ N  αβ States are sets of items Shift Item
Reduce Item States are sets of items

LR(0) Items E → E * B | E + B | B B → 0 | 1
A derivation rule with a location marker (●) is called LR(0) item

PDA States E → E * B | E + B | B B → 0 | 1
A PDA state is a set of LR(0) items. E.g., q13 = { E → E ● * B , E → E ● + B, B → 1 ●} Intuitively, if we matched 1, Then the state will remember the 3 possible alternatives rules and where we are in each of them (1) E → E ● * B (2) E → E ● + B (3) B → 1 ●

LR(0) Shift/Reduce Items
N  αβ Shift Item N  αβ Reduce Item

Intuition Read input tokens left-to-right and remember them in the stack When a right hand side of a rule is found, remove it from the stack and replace it with the non-terminal it derives Remembering token is called shift Each shift moves to a state that remembers what we’ve seen so far Replacing RHS with LHS is called reduce Each reduce goes to a state that determines the context of the derivation

Model of an LR parser Input Stack LR Parser Output $ id + T 2 + 7 id 5
state T 2 + 7 id 5 Output symbol Terminals and Non-terminals Goto Table Action

LR parser stack Sequence made of state, symbol pairs
For instance a possible stack for the grammar S  E $ E  T E  E + T T  id T  ( E ) could be: 0 T id 5 Stack grows this way

Form of LR parsing table
state terminals non-terminals Shift/Reduce actions Goto part 1 . . . acc gm rk sn error shift state n reduce by rule k goto state m accept

LR parser table example
goto action STATE T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9

Shift move Input Stack action[q, a] = sn LR Parsing program $ … a q
. . . goto action action[q, a] = sn

Result of shift Input Stack action[q, a] = sn LR Parsing program $ … a
. . . goto action action[q, a] = sn

Reduce move Input Stack 2*n action[qn, a] = rk
$ … a Stack LR Parsing program qn σn … q1 σ1 q 2*n goto action action[qn, a] = rk Production: (k) A  σ1… σn Top of stack looks like q1σ1…qnσn for some q1… qn goto[q, A] = qm

Result of reduce move Input Stack action[qn, a] = rk
$ … a Stack LR Parsing program qm A q … goto action action[qn, a] = rk Production: (k) A  σ1… σn Top of stack looks like q1σ1…qnσn for some q1… qn goto[q, A] = qm

Accept move If action[q, a] = accept parsing completed Input Stack
$ a … Stack LR Parsing program q . . . goto action If action[q, a] = accept parsing completed

Error move If action[q, a] = error (usually empty)
Input $ … a Stack LR Parsing program q . . . goto action If action[q, a] = error (usually empty) parsing discovered a syntactic error

Example Z  E $ E  T | E + T T  i | ( E )

Example: parsing with LR items
Z  E $ E  T | E + T T  i | ( E ) i + i $ Z  E $ Why do we need these additional LR items? Where do they come from? What do they mean? E  T E  E + T T  i T  ( E )

-closure Given a set S of LR(0) items If P  αNβ is in state S
then for each rule N  in the grammar state S must also contain N   -closure({Z  E $}) = { Z  E $, E  T, Z  E $ E  T | E + T T  i | ( E ) E  E + T, T  i , T  ( E ) }

Z  E $ E  T | E + T T  i | ( E ) i + i $ Remember position from which we’re trying to reduce Items denote possible future handles Z  E $ E  T E  E + T T  i T  ( E )

Z  E $ E  T | E + T T  i | ( E ) i + i $ Match items with current token Z  E $ T  i Reduce item! E  T E  E + T T  i T  ( E )

Z  E $ E  T | E + T T  i | ( E ) T + i $ i Z  E $ Reduce item! E  T E  T E  E + T T  i T  ( E )

Z  E $ E  T | E + T T  i | ( E ) E + i $ T i Z  E $ Reduce item! E  T E  T E  E + T T  i T  ( E )

Z  E $ E  T | E + T T  i | ( E ) E + i $ T i Z  E $ Z  E$ E  T E  E + T E  E+ T T  i T  ( E )

Z  E $ E  T | E + T T  i | ( E ) E + i $ T i Z  E $ Z  E$ E  E+T E  T T  i E  E + T E  E+ T T  ( E ) T  i T  ( E )

Z  E $ E  T | E + T T  i | ( E ) E + T $ i T i Z  E $ Z  E$ E  E+T E  T T  i E  E + T E  E+ T T  ( E ) T  i T  ( E )

Z  E $ E  T | E + T T  i | ( E ) E + T $ T i i Reduce item! Z  E $ Z  E$ E  E+T E  E+T E  T T  i E  E + T E  E+ T T  ( E ) T  i T  ( E )

Z  E $ E  T | E + T T  i | ( E ) E $ E + T T i i Z  E $ Z  E$ E  T E  E + T E  E+ T T  i T  ( E )

Z  E $ E  T | E + T T  i | ( E ) E $ E + T T i Reduce item! i Z  E $ Z  E$ Z  E$ E  T E  E + T E  E+ T T  i T  ( E )

Z  E $ E  T | E + T T  i | ( E ) Z E $ E + T Reduce item! T i Z  E $ i Z  E$ Z  E$ E  T E  E + T E  E+ T T  i T  ( E )

GOTO/ACTION tables ACTION Table GOTO Table empty – error move State i
+ ( ) $ E T action q0 q5 q7 q1 q6 shift q3 q2 ZE$ q4 Shift EE+T Ti ET q8 q9 TE

LR(0) parser tables Two types of rows:
Shift row – tells which state to GOTO for current token Reduce row – tells which rule to reduce (independent of current token) GOTO entries are blank

LR parser data structures
Input – remainder of text to be processed Stack – sequence of pairs N, qi N – symbol (terminal or non-terminal) qi – state at which decisions are made Initial stack contains q0 Input suffix + i $ stack q0 i q5 Stack grows this way

LR(0) pushdown automaton
Two moves: shift and reduce Shift move Remove first token from input Push it on the stack Compute next state based on GOTO table Push new state on the stack If new state is error – report error shift input i + i $ input + i $ stack q0 stack q0 i q5 Stack grows this way State i + ( ) $ E T action q0 q5 q7 q1 q6 shift

LR(0) pushdown automaton
Reduce move Using a rule N α Symbols in α and their following states are removed from stack New state computed based on GOTO table (using top of stack, before pushing N) N is pushed on the stack New state pushed on top of N Reduce T  i input + i $ input + i $ stack q0 i q5 stack q0 T q6 Stack grows this way State i + ( ) $ E T action q0 q5 q7 q1 q6 shift

GOTO/ACTION table Warning: numbers mean different things!
State i + ( ) $ E T q0 s5 s7 s1 s6 q1 s3 s2 q2 r1 q3 s4 q4 r3 q5 r4 q6 r2 q7 s8 q8 s9 q9 r5 Z  E $ E  T E  E + T T  i T ( E ) Warning: numbers mean different things! rn = reduce using rule number n sm = shift to state m

Parsing id+id$ Initialize with state 0 Stack Input Action id + id $ s5
(1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 Initialize with state 0 rn = reduce using rule number n sm = shift to state m

Parsing id+id$ Stack Input Action id + id $ s5 0 id 5 + id $ r4
(1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 0 id 5 + id $ r4 goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 rn = reduce using rule number n sm = shift to state m

Parsing id+id$ pop id 5 Stack Input Action id + id $ s5 0 id 5 + id $
(1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 0 id 5 + id $ r4 goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 pop id 5 rn = reduce using rule number n sm = shift to state m

Parsing id+id$ push T 6 Stack Input Action id + id $ s5 0 id 5 + id $
(1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 0 id 5 + id $ r4 goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 push T 6 rn = reduce using rule number n sm = shift to state m

Parsing id+id$ Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6
(1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6 r2 goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 rn = reduce using rule number n sm = shift to state m

(1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6 r2 0 E 1 s3 goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 rn = reduce using rule number n sm = shift to state m

(1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6 r2 0 E 1 s3 0 E 1 + 3 id $ goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 rn = reduce using rule number n sm = shift to state m

(1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6 r2 0 E 1 s3 0 E 1 + 3 id $ 0 E id 5 $ goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 rn = reduce using rule number n sm = shift to state m

(1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6 r2 0 E 1 s3 0 E 1 + 3 id $ 0 E id 5 $ 0 E T 4 r3 goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 rn = reduce using rule number n sm = shift to state m

(1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6 r2 0 E 1 s3 0 E 1 + 3 id $ 0 E id 5 $ 0 E T 4 r3 s2 goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 rn = reduce using rule number n sm = shift to state m

LR(0) automaton example
reduce state shift state q6 E  T T q0 T q7 Z  E$ E  T E  E + T T  i T  (E) T  (E) E  T E  E + T T  i T  (E) read input “(“ ( q5 i T  i i Managed to reduce E E ( E ( i q1 q8 Z  E$ E  E+ T q3 E  E+T T  i T  (E) T  (E) E  E+T + + $ ) q2 q9 Z  E$ T  (E)  T q4 E  E + T

States and LR(0) Items E → E * B | E + B | B B → 0 | 1
The state will “remember” the potential derivation rules given the part that was already identified For example, if we have already identified E then the state will remember the two alternatives: (1) E → E * B, (2) E → E + B Actually, we will also remember where we are in each of them: (1) E → E ● * B, (2) E → E ● + B A derivation rule with a location marker is called LR(0) item. The state is actually a set of LR(0) items. E.g., q13 = { E → E ● * B , E → E ● + B}

GOTO/ACTION tables GOTO Table empty = error move ACTION Table State i
+ ( ) $ E T action q0 q5 q7 q1 q6 shift q3 q2 Z E$ q4 Shift E E+T T i E T q8 q9 T E

LR(0) parser tables Actions determined by topmost state
Two types of rows: Shift row – tells which state to GOTO for current token Reduce row – tells which rule to reduce (independent of current token) GOTO entries are blank

GOTO/ACTION tables GOTO Table empty = error move ACTION Table State i
+ ( ) $ E T action q0 q5 q7 q1 q6 shift q3 q2 Z E$ q4 Shift E E+T T i E T q8 q9 T E

GOTO/ACTION table Warning: numbers mean different things!
State i + ( ) $ E T q0 s5 s7 s1 s6 q1 s3 s2 q2 r1 q3 s4 q4 r3 q5 r4 q6 r2 q7 s8 q8 s9 q9 r5 action shift Z E$ Shift E E+T T i E T T E Z  E $ E  T E  E + T T  i T  ( E ) Warning: numbers mean different things! rn = reduce using rule number n sm = shift to state m

LR(0) Parsing of i + i Warning: numbers mean different things!
State i + ( ) $ E T q0 s5 s7 s1 s6 q1 s3 s2 q2 r1 q3 s4 q4 r3 q5 r4 q6 r2 q7 s8 q8 s9 q9 r5 Stack q0 q0 i q5 q0 T q6 q0 E q1 q0 E q1 + q3 q0 E q1 + q3 i q5 q0 E q1 + q3 T q4 q0 E q1 $ q2 q0 $ input i + i $ + i $ i $ $ Action shift reduce 4 reduce 2 reduce 3 reduce 1 accept Z  E $ E  T E  E + T T  i T  ( E ) Warning: numbers mean different things! rn = reduce using rule number n sm = shift to state m

Constructing an LR parsing table
Construct a (determinized) transition diagram from LR items If there are conflicts – stop Fill table entries from diagram

N  αβ LR item Already matched To be matched Input
Hypothesis about αβ being a possible handle, so far we’ve matched α, expecting to see β

Types of LR(0) items N  αβ Shift Item N  αβ Reduce Item

reduce state shift state q6 E  T T q0 T q7 Z  E$ E  T E  E + T T  i T  (E) T  (E) E  T E  E + T T  i T  (E) ( q5 i T  i i E ( E ( i q1 q8 Z  E$ E  E+ T q3 E  E+T T  i T  (E) T  (E) E  E+T + + $ ) q2 q9 Z  E$ T  (E)  T q4 E  E + T

Computing item sets Initial set
Z is in the start symbol -closure({ Z  α | Z  α is in the grammar } ) Next set from a set S and the next symbol X step(S,X) = { N  αXβ | N  αXβ in the item set S} nextSet(S,X) = -closure(step(S,X))

Operations for transition diagram construction
Initial = {S’  S$} For an item set I Closure(I) = Closure(I) ∪ {X  µ is in grammar| N  αXβ in I} Goto(I, X) = { N  αXβ | N  αXβ in I}

Initial example Initial = {S  E $} Grammar (1) S  E $ (2) E  T
(4) T  id (5) T  ( E ) Initial = {S  E $}

Closure example Initial = {S  E $}
Grammar (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Initial = {S  E $} Closure({S  E $}) = { S  E $ E  T E  E + T T  id T  ( E ) }

Goto example Initial = {S  E $}
Grammar (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Initial = {S  E $} Closure({S  E $}) = { S  E $ E  T E  E + T T  id T  ( E ) } Goto({S  E $ , E  E + T, T  id}, E) = {S  E $, E  E + T}

Constructing the transition diagram
Start with state 0 containing item Closure({S  E $}) Repeat until no new states are discovered For every state p containing item set Ip, and symbol N, compute state q containing item set Iq = Closure(goto(Ip, N))

reduce state shift state q6 E  T T q0 T q7 Z  E$ E  T E  E + T T  i T  (E) T  (E) E  T E  E + T T  i T  (E) ( q5 i T  i i E ( E ( i q1 q8 Z  E$ E  E+ T q3 E  E+T T  i T  (E) T  (E) E  E+T + + $ ) q2 q9 Z  E$ T  (E)  T q4 E  E + T

Automaton construction example
(1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) q0 S E$ Initialize

(1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) q0 S  E$ E  T E  E + T T  i T  (E) apply Closure

(1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) E  T q6 T q0 S  E$ E  T E  E + T T  i T  (E) T  (E) E  T E  E + T T  i T  (E) ( T  i q5 i S  E$ E  E+ T q1 E

(1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) q6 E  T q0 T q7 T S  E$ E  T E  E + T T  i T  (E) T  (E) E  T E  E + T T  i T  (E) non-terminal transition corresponds to goto action in parse table ( q5 i T  i i E ( E ( i q1 q8 q3 Z  E$ E  E+ T terminal transition corresponds to shift action in parse table E  E+T T  i T  (E) T  (E) E  E+T + + $ ) q2 q9 S  E$ T  (E)  T q4 E  E + T a single reduce item corresponds to reduce action

Are we done? Can make a transition diagram for any grammar
Can make a GOTO table for every grammar Cannot make a deterministic ACTION table for every grammar

LR(0) conflicts … T q0 Z  E$ E  T E  E + T T  i T  (E) ( …
T  i[E] ( … q5 i T  i T  i[E] E Shift/reduce conflict … Z  E $ E  T E  E + T T  i T  ( E ) T  i[E]

LR(0) conflicts … T q0 Z  E$ E  T E  E + T T  i T  (E) ( …
T  i[E] ( … q5 i T  i V  i E reduce/reduce conflict … Z  E $ E  T E  E + T T  i V  i T  ( E )

LR(0) conflicts Any grammar with an -rule cannot be LR(0)
Inherent shift/reduce conflict A   – reduce item P  αAβ – shift item A   can always be predicted from P  αAβ

Conflicts Can construct a diagram for every grammar but some may introduce conflicts shift-reduce conflict: an item set contains at least one shift item and one reduce item reduce-reduce conflict: an item set contains two reduce items

LR variants LR(0) – what we’ve seen so far SLR(0) LR(1) LALR(0)
Removes infeasible reduce actions via FOLLOW set reasoning LR(1) LR(0) with one lookahead token in items LALR(0) LR(1) with merging of states with same LR(0) component

LR (0) GOTO/ACTIONS tables
GOTO table is indexed by state and a grammar symbol from the stack ACTION Table GOTO Table State i + ( ) $ E T action q0 q5 q7 q1 q6 shift q3 q2 Z E$ q4 Shift E E+T T i E T q8 q9 T E ACTION table determined only by state, ignores input

SLR parsing A handle should not be reduced to a non-terminal N if the lookahead is a token that cannot follow N A reduce item N  α is applicable only when the lookahead is in FOLLOW(N) If b is not in FOLLOW(N) we proved there is no derivation S * βNb. Thus, it is safe to remove the reduce item from the conflicted state Differs from LR(0) only on the ACTION table Now a row in the parsing table may contain both shift actions and reduce actions and we need to consult the current token to decide which one to take

Lookahead token from the input
SLR action table Lookahead token from the input State i + ( ) [ ] $ shift 1 accept 2 3 4 E E+T 5 T i 6 E T 7 8 9 T (E) state action q0 shift q1 q2 q3 q4 E E+T q5 T i q6 E T q7 q8 q9 T E vs. SLR – use 1 token look-ahead LR(0) – no look-ahead … as before… T  i T  i[E]

LR(1) grammars In SLR: a reduce item N  α is applicable only when the lookahead is in FOLLOW(N) But FOLLOW(N) merges lookahead for all alternatives for N Insensitive to the context of a given production LR(1) keeps lookahead with each LR item Idea: a more refined notion of follows computed per item

LR(1) items LR(1) item is a pair Meaning Example LR(0) item
Lookahead token Meaning We matched the part left of the dot, looking to match the part on the right of the dot, followed by the lookahead token Example The production L  id yields the following LR(1) items LR(1) items [L → ● id, *] [L → ● id, =] [L → ● id, id] [L → ● id, $] [L → id ●, *] [L → id ●, =] [L → id ●, id] [L → id ●, $] (0) S’ → S (1) S → L = R (2) S → R (3) L → * R (4) L → id (5) R → L LR(0) items [L → ● id] [L → id ●]

LR(1) items LR(1) item is a pair Meaning Example
Lookahead token Meaning We matched the part left of the dot, looking to match the part on the right of the dot, followed by the lookahead token Example The production L  id yields the following LR(1) items Reduce only if the the expected lookhead matches the input [L → id ●, =] will be used only if the next input token is =

LALR(1) LR(1) tables have huge number of entries
Often don’t need such refined observation (and cost) Idea: find states with the same LR(0) component and merge their lookaheads component as long as there are no conflicts LALR(1) not as powerful as LR(1) in theory but works quite well in practice Merging may not introduce new shift-reduce conflicts, only reduce-reduce, which is unlikely in practice

Summary

LR is More Powerful than LL
Any LL(k) language is also in LR(k), i.e., LL(k) ⊂ LR(k). LR is more popular in automatic tools But less intuitive Also, the lookahead is counted differently in the two cases In an LL(k) derivation the algorithm sees the left-hand side of the rule + k input tokens and then must select the derivation rule In LR(k), the algorithm “sees” all right-hand side of the derivation rule + k input tokens and then reduces LR(0) sees the entire right-side, but no input token

Using tools to parse + create AST
terminal Integer NUMBER; terminal PLUS,MINUS,MULT,DIV; terminal LPAREN, RPAREN; terminal UMINUS; nonterminal Integer expr; precedence left PLUS, MINUS; precedence left DIV, MULT; Precedence left UMINUS; %% expr ::= expr:e1 PLUS expr:e2 {: RESULT = new Integer(e1.intValue() + e2.intValue()); :} | expr:e1 MINUS expr:e2 {: RESULT = new Integer(e1.intValue() - e2.intValue()); :} | expr:e1 MULT expr:e2 {: RESULT = new Integer(e1.intValue() * e2.intValue()); :} | expr:e1 DIV expr:e2 {: RESULT = new Integer(e1.intValue() / e2.intValue()); :} | MINUS expr:e1 %prec UMINUS {: RESULT = new Integer(0 - e1.intValue(); :} | LPAREN expr:e1 RPAREN {: RESULT = e1; :} | NUMBER:n {: RESULT = n; :}

Grammar Hierarchy Non-ambiguous CFG LR(1) LALR(1) LL(1) SLR(1) LR(0)

Earley Parsing Jay Earley, PhD

Earley Parsing Invented by Jay Earley [PhD. 1968]
Handles arbitrary context free grammars Can handle ambiguous grammars Complexity O(N3) when N = |input| Uses dynamic programming Compactly encodes ambiguity

Dynamic programming Break a problem P into subproblems P1…Pk
Solve P by combining solutions for P1…Pk Memoize (store) solutions to subproblems instead of re-computation Bellman-Ford shortest path algorithm Sol(x,y,i) = minimum of Sol(x,y,i-1) Sol(t,y,i-1) + weight(x,t) for edges (x,t)

Earley Parsing Dynamic programming implementation of a recursive descent parser S[N+1] Sequence of sets of “Earley states” N = |INPUT| Earley state (item) s is a sentential form + aux info S[i] All parse tree that can be produced (by a RDP) after reading the first i tokens S[i+1] built using S[0] … S[i]

Earley Parsing Parse arbitrary grammars in O(|input|3)
O(|input|2) for unambigous grammer Linear for most LR(k) langaues Dynamic programming implementation of a recursive descent parser S[N+1] Sequence of sets of “Earley states” N = |INPUT| Earley states is a sentential form + aux info S[i] All parse tree that can be produced (by an RDP) after reading the first i tokens S[i+1] built using S[0] … S[i]

Earley States s = < constituent, back >
constituent (dotted rule) for Aαβ A•αβ predicated constituents Aα•β in-progress constituents Aαβ• completed constituents back previous Early state in derivation

Earley Parser Input = x[1…N] S[0] = <E’ •E, 0>; S[1] = … S[N] = {} for i = N do until S[i] does not change do foreach s ∈ S[i] if s = <A…•a…, b> and a=x[i+1] then // scan S[i+1] = S[i+1] ∪ {<A…a•…, b> } if s = <A … •X …, b> and Xα then // predict S[i] = S[i] ∪ {<X•α, i > } if s = < A … • , b> and <X…•A…, k> ∈ S[b] then // complete S[i] = S[i] ∪{<X…A•…, k> }

Example if s = <A…•a…, b> and a=x[i+1] then // scan
S[i+1] = S[i+1] ∪ {<A…a•…, b> } if s = <A … •X …, b> and Xα then // predict S[i] = S[i] ∪ {<X•α, i > } if s = < A … • , b> and <X…•A…, k> ∈ S[b] then // complete S[i] = S[i] ∪{<X…A•…, k> }

Earley Parsing Jay Earley, PhD

Lecture 4: Syntax Analysis: Parsing Noam Rinetzky

Similar presentations

Presentation on theme: "Lecture 4: Syntax Analysis: Parsing Noam Rinetzky"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 4: Syntax Analysis: Parsing Noam Rinetzky

Similar presentations

Presentation on theme: "Lecture 4: Syntax Analysis: Parsing Noam Rinetzky"— Presentation transcript:

Similar presentations

About project

Feedback