Compilation 0368-3133 Lecture 3: Syntax Analysis Top Down parsing Bottom Up parsing Noam Rinetzky 1.

Slides:

Advertisements

Similar presentations

Compiler Construction

Advertisements

Compilation (Semester A, 2013/14) Lecture 6a: Syntax (Bottom–up parsing) Noam Rinetzky 1 Slides credit: Roman Manevich, Mooly Sagiv, Eran Yahav.

Compiler Principles Fall Compiler Principles Lecture 4: Parsing part 3 Roman Manevich Ben-Gurion University.

Mooly Sagiv and Roman Manevich School of Computer Science

Lecture 04 – Syntax analysis: top-down and bottom-up parsing

Lecture 03 – Syntax analysis: top-down parsing Eran Yahav 1.

Theory of Compilation Erez Petrank Lecture 3: Syntax Analysis: Bottom-up Parsing 1.

Predictive Parsing l Find derivation for an input string, l Build a abstract syntax tree (AST) –a representation of the parsed program l Build a symbol.

6/12/2015Prof. Hilfinger CS164 Lecture 111 Bottom-Up Parsing Lecture (From slides by G. Necula & R. Bodik)

Recap Mooly Sagiv. Outline Subjects Studied Questions & Answers.

By Neng-Fa Zhou Syntax Analysis lexical analyzer syntax analyzer semantic analyzer source program tokens parse tree parser tree.

Syntax Analysis Mooly Sagiv html:// Textbook:Modern Compiler Implementation in C Chapter 3.

Syntax Analysis Mooly Sagiv html:// Textbook:Modern Compiler Design Chapter 2.2 (Partial) Hashlama 11:00-14:00.

Syntax Analysis Mooly Sagiv Textbook:Modern Compiler Design Chapter 2.2 (Partial)

Table-driven parsing Parsing performed by a finite state machine. Parsing algorithm is language-independent. FSM driven by table (s) generated automatically.

Top-Down Parsing.

CPSC 388 – Compiler Design and Construction

Compiler Principles Winter Compiler Principles Syntax Analysis (Parsing) – Part 1 Mayer Goldberg and Roman Manevich Ben-Gurion University.

Syntax and Semantics Structure of programming languages.

Chapter 9 Syntax Analysis Winter 2007 SEG2101 Chapter 9.

Compilation Lecture 4 Syntax Analysis Noam Rinetzky 1 Zijian Xu, Hong Chen, Song-Chun Zhu and Jiebo Luo, "A hierarchical compositional model.

Top-Down Parsing - recursive descent - predictive parsing

4 4 (c) parsing. Parsing A grammar describes the strings of tokens that are syntactically legal in a PL A recogniser simply accepts or rejects strings.

BİL 744 Derleyici Gerçekleştirimi (Compiler Design)1 Syntax Analyzer Syntax Analyzer creates the syntactic structure of the given source program. This.

Parsing Jaruloj Chongstitvatana Department of Mathematics and Computer Science Chulalongkorn University.

1 Compiler Construction Syntax Analysis Top-down parsing.

Review 1.Lexical Analysis 2.Syntax Analysis 3.Semantic Analysis 4.Code Generation 5.Code Optimization.

Syntax and Semantics Structure of programming languages.

Syntax Analysis Mooly Sagiv Textbook:Modern Compiler Design Chapter 2.2 (Partial) 1.

Compilation (Semester A, 2013/14) Lecture 3: Syntax Analysis (Top-Down Parsing) Modern Compiler Design: Chapter 2.2 Noam Rinetzky 1 Slides credit:

Compiler Principles Fall Compiler Principles Lecture 3: Parsing part 2 Roman Manevich Ben-Gurion University.

Compilation /15a Lecture 5 1 * Syntax Analysis: Bottom-Up parsing Noam Rinetzky.

1 Syntax Analysis Part II Chapter 4 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.

Top-Down Parsing CS 671 January 29, CS 671 – Spring Where Are We? Source code: if (b==0) a = “Hi”; Token Stream: if (b == 0) a = “Hi”; Abstract.

1 Nonrecursive Predictive Parsing  It is possible to build a nonrecursive predictive parser  This is done by maintaining an explicit stack.

Compiler Principles Winter Compiler Principles Syntax Analysis (Parsing) – Part 3 Mayer Goldberg and Roman Manevich Ben-Gurion University.

10/10/2002© 2002 Hal Perkins & UW CSED-1 CSE 582 – Compilers LR Parsing Hal Perkins Autumn 2002.

1 Topic #4: Syntactic Analysis (Parsing) CSC 338 – Compiler Design and implementation Dr. Mohamed Ben Othman ( )

UMBC  CSEE   1 Chapter 4 Chapter 4 (b) parsing.

Bernd Fischer RW713: Compiler and Software Language Engineering.

Bottom Up Parsing CS 671 January 31, CS 671 – Spring Where Are We? Finished Top-Down Parsing Starting Bottom-Up Parsing Lexical Analysis.

COMP 3438 – Part II-Lecture 6 Syntax Analysis III Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.

Lecture 02b – Syntax analysis: top-down parsing Eran Yahav 1.

Compiler Principles Winter Compiler Principles Syntax Analysis (Parsing) – Part 2 Mayer Goldberg and Roman Manevich Ben-Gurion University.

1 Syntax Analysis Part II Chapter 4 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2007.

Lecture 5: LR Parsing CS 540 George Mason University.

Compilers: Bottom-up/6 1 Compiler Structures Objective – –describe bottom-up (LR) parsing using shift- reduce and parse tables – –explain how LR.

Bottom-up parsing. Bottom-up parsing builds a parse tree from the leaves (terminals) to the start symbol int E T * TE+ T (4) (2) (3) (5) (1) int*+ E 

1 Chapter 6 Bottom-Up Parsing. 2 Bottom-up Parsing A bottom-up parsing corresponds to the construction of a parse tree for an input tokens beginning at.

Compilation (Semester A, 2013/14) Lecture 4: Syntax Analysis (Top-Down Parsing) Modern Compiler Design: Chapter 2.2 Noam Rinetzky 1 Slides credit:

CMSC 330: Organization of Programming Languages Pushdown Automata Parsing.

Syntax and Semantics Structure of programming languages.

Programming Languages Translator

Lecture 3: Syntax Analysis: CFLs, PDAs, Top-Down parsing Noam Rinetzky

Textbook:Modern Compiler Design

Table-driven parsing Parsing performed by a finite state machine.

Fall Compiler Principles Lecture 4: Parsing part 3

Lecture 3: Syntax Analysis: Top-Down parsing Noam Rinetzky

Bottom-Up Syntax Analysis

Lecture 4: Syntax Analysis: Parsing Noam Rinetzky

Subject Name:COMPILER DESIGN Subject Code:10CS63

Top-Down Parsing CS 671 January 29, 2008.

Lecture 4: Syntax Analysis: Botom-Up Parsing Noam Rinetzky

Bottom Up Parsing.

Kanat Bolazar February 16, 2010

Compilers Principles, Techniques, & Tools Taught by Jing Zhang

Chap. 3 BOTTOM-UP PARSING

Parsing CSCI 432 Computer Science Theory

Presentation transcript:

Compilation Lecture 3: Syntax Analysis Top Down parsing Bottom Up parsing Noam Rinetzky 1

The Real Anatomy of a Compiler Executable code exe Source text txt Lexical Analysis Sem. Analysis Process text input characters Syntax Analysis tokens AST Intermediate code generation Annotated AST Intermediate code optimization IR Code generation IR Target code optimization Symbolic Instructions SI Machine code generation Write executable output MI 2 Syntax Analysis

Context free grammars (CFG) V – non terminals (syntactic variables) T – terminals (tokens) P – derivation rules –Each rule of the form V  (T ∪ V)* S – start symbol G = (V,T,P,S) 3

Pushdown Automata (PDA) Nondeterministic PDAs define all CFLs Deterministic PDAs model parsers. –Most programming languages have a deterministic PDA –Efficient implementation 4

CFG terminology Symbols: Terminals (tokens): ; := ( ) id num print Non-terminals: S E L Start non-terminal: S Convention: the non-terminal appearing in the first derivation rule Grammar productions (rules) N  μ S  S ; S S  id := E E  id E  num E  E + E 5

CFG terminology Derivation - a sequence of replacements of non-terminals using the derivation rules Language - the set of strings of terminals derivable from the start symbol Sentential form - the result of a partial derivation –May contain non-terminals 6

Derivations Show that a sentence ω is in a grammar G –Start with the start symbol –Repeatedly replace one of the non-terminals by a right-hand side of a production –Stop when the sentence contains only terminals Given a sentence αNβ and rule N  µ αNβ => αµβ ω is in L(G) if S =>* ω 7

Parse trees: Traces of Derivations Tree nodes are symbols, children ordered left-to-right Each internal node is non-terminal –Children correspond to one of its productions N  µ 1 … µ k Root is start non-terminal Leaves are tokens Yield of parse tree: left-to-right walk over leaves µ1µ1 µkµk N … 8

Parse Tree S SS ; id := ES ; id := idS ; id := E ; id := idid := E + E ; id := idid := E + id ; id := idid := id + id ; x:= z;y := x + z S S S S ; ; S S id := E E id := E E E E + + E E id 9

Ambiguity x := y+z*w S  S ; S S  id := E | … E  id | E + E | E * E | … S S id := E E E E + + E E id E E * * E E S S := E E E E * * E E id E E + + E E 10

Leftmost/rightmost Derivation Leftmost derivation –always expand leftmost non-terminal Rightmost derivation –Always expand rightmost non-terminal Allows us to describe derivation by listing the sequence of rules –always know what a rule is applied to 11

Broad kinds of parsers Parsers for arbitrary grammars –Earley’s method, CYK method –Usually, not used in practice (though might change) Top-down parsers –Construct parse tree in a top-down matter –Find the leftmost derivation Bottom-up parsers –Construct parse tree in a bottom-up manner –Find the rightmost derivation in a reverse order 12

Top-Down Parsing: Predictive parsing Recursive descent LL(k) grammars 13

Predictive parsing Given a grammar G and a word w attempt to derive w using G Idea –Apply production to leftmost nonterminal –Pick production rule based on next input token General grammar –More than one option for choosing the next production based on a token Restricted grammars (LL) –Know exactly which single rule to apply –May require some lookahead to decide 14

Recursive descent parsing Define a function for every nonterminal Every function work as follows –Find applicable production rule –Terminal function checks match with next input token –Nonterminal function calls (recursively) other functions If there are several applicable productions for a nonterminal, use lookahead 15

Recursive descent How do you pick the right A-production? Generally – try them all and use backtracking In our case – use lookahead void A() { choose an A-production, A  X 1 X 2 …X k ; for (i=1; i≤ k; i++) { if (X i is a nonterminal) call procedure X i (); elseif (X i == current) advance input; else report error; } 16

The function for indexed_elem will never be tried… –What happens for input of the form ID[expr] term  ID | indexed_elem indexed_elem  ID [ expr ] Problem 1: productions with common prefix 17

Problem 2: null productions int S() { return A() && match(token(‘a’)) && match(token(‘b’)); } int A() { return match(token(‘a’)) || 1; } S  A a b A  a |   What happens for input “ab”?  What happens if you flip order of alternatives and try “aab”? 18

Problem 3: left recursion int E() { return E() && match(token(‘-’)) && term(); } E  E - term | term  What happens with this procedure?  Recursive descent parsers cannot handle left-recursive grammars p

Constructing an LL(1) Parser 20

FIRST sets X  YY | Z Z | Y Z | 1 Y Y  4 | ℇ Z  2 L(Z) = {2} L(Y) = {4, ℇ } L(Y) = {44, 4, ℇ, 22, 42, 2, 14, 1} 21

FIRST sets X  YY | Z Z | Y Z | 1 Y Y  4 | ℇ Z  2 L(Z) = {2} L(Y) = {4, ℇ } L(Y) = {44, 4, ℇ, 22, 42, 2, 14, 1} 22

FIRST sets FIRST(X) = { t | X  * t β} ∪ { ℇ | X  * ℇ } –FIRST(X) = all terminals that α can appear as first in some derivation for X + ℇ if can be derived from X Example: –FIRST( LIT ) = { true, false } –FIRST( ( E OP E ) ) = { ‘(‘ } –FIRST( not E ) = { not } 23 E  LIT | (E OP E) | not E LIT  true | false OP  and | or | xor

FIRST sets No intersection between FIRST sets => can always pick a single rule If the FIRST sets intersect, may need longer lookahead –LL(k) = class of grammars in which production rule can be determined using a lookahead of k tokens –LL(1) is an important and useful class 24

Computing FIRST sets FIRST (t) = { t } // “t” non terminal ℇ∈ FIRST(X) if –X  ℇ or –X  A 1.. A k and ℇ∈ FIRST(A i ) i=1…k FIRST(α) ⊆ FIRST(X) if –X  A 1.. A k α and ℇ∈ FIRST(A i ) i=1…k 25

Computing FIRST sets Assume no null productions A   1.Initially, for all nonterminals A, set FIRST(A) = { t | A  tω for some ω } 2.Repeat the following until no changes occur: for each nonterminal A for each production A  Bω set FIRST(A) = FIRST(A) ∪ FIRST(B) This is known as fixed-point computation 26

FIRST sets computation example STMT  if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR  TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM  id | constant TERMEXPRSTMT 27

1. Initialization TERMEXPRSTMT id constant zero? Not if while STMT  if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR  TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM  id | constant 28

STMT  if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR  TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM  id | constant TERMEXPRSTMT id constant zero? Not if while zero? Not Iterate 1

2. Iterate 2 TERMEXPRSTMT id constant zero? Not if while id constant zero? Not STMT  if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR  TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM  id | constant 30

2. Iterate 3 – fixed-point TERMEXPRSTMT id constant zero? Not if while id constant zero? Not id constant STMT  if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR  TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM  id | constant 31

FOLLOW sets What do we do with nullable (  ) productions? –A  B C D B   C   –Use what comes afterwards to predict the right production For every production rule A  α –FOLLOW(A) = set of tokens that can immediately follow A Can predict the alternative A k for a non-terminal N when the lookahead token is in the set –FIRST(A k )  (if A k is nullable then FOLLOW(N)) p

FOLLOW sets: Constraints $ ∈ FOLLOW(S) FIRST(β) – { ℇ } ⊆ FOLLOW(X) –For each A  α X β FOLLOW(A) ⊆ FOLLOW(X) –For each A  α X β and ℇ ∈ FIRST(β) 33

Example: FOLLOW sets E  TX X  + E | ℇ T  (E) | int Y Y  * T | ℇ 34 Terminal+(*)int FOLLOWint, ( _, ), $*, ), +, $ Non. Term. ETXY FOLLOW), $+, ), $$, )_, ), $

Prediction Table A  α T[A,t] = α if t ∈ FIRST(α) T[A,t] = α if ℇ ∈ FIRST(α) and t ∈ FOLLOW(A) –t can also be $ T is not well defined  the grammar is not LL(1) 35

LL(k) grammars A grammar is in the class LL(K) when it can be derived via: –Top-down derivation –Scanning the input from left to right (L) –Producing the leftmost derivation (L) –With lookahead of k tokens (k) A language is said to be LL(k) when it has an LL(k) grammar 36

LL(1) grammars A grammar is in the class LL(K) iff –For every two productions A  α and A  β we have FIRST(α) ∩ FIRST(β) = {} // including  If  ∈ FIRST(α) then FIRST(β) ∩ FOLLOW(A) = {} If  ∈ FIRST(β) then FIRST(α) ∩ FOLLOW(A) = {} 37

Problem: Non LL Grammars bool S() { return A() && match(token(‘a’)) && match(token(‘b’)); } bool A() { return match(token(‘a’)) || true; } S  A a b A  a |   What happens for input “ab”?  What happens if you flip order of alternatives and try “aab”? 38

FIRST(S) = { a }FOLLOW(S) = { $ } FIRST(A) = { a,  }FOLLOW(A) = { a } FIRST/FOLLOW conflict S  A a b A  a |  39 Problem: Non LL Grammars

Back to problem 1 FIRST(term) = { ID } FIRST(indexed_elem) = { ID } FIRST/FIRST conflict term  ID | indexed_elem indexed_elem  ID [ expr ] 40

Solution: left factoring Rewrite the grammar to be in LL(1) Intuition: just like factoring x*y + x*z into x*(y+z) term  ID | indexed_elem indexed_elem  ID [ expr ] term  ID after_ID After_ID  [ expr ] |  41

S  if E then S else S | if E then S | T S  if E then S S’ | T S’  else S |  Left factoring – another example 42

Back to problem 2 FIRST(S) = { a }FOLLOW(S) = { } FIRST(A) = { a,  }FOLLOW(A) = { a } FIRST/FOLLOW conflict S  A a b A  a |  43

Solution: substitution S  A a b A  a |  S  a a b | a b Substitute A in S S  a after_A after_A  a b | b Left factoring 44

Back to problem 3 Left recursion cannot be handled with a bounded lookahead What can we do? E  E - term | term 45

Left recursion removal L(G 1 ) = β, βα, βαα, βααα, … L(G 2 ) = same N  Nα | β N  βN’ N’  αN’ |  G1G1 G2G2 E  E - term | term E  term TE | term TE  - term TE |   For our 3 rd example: p. 130 Can be done algorithmically. Problem: grammar becomes mangled beyond recognition 46

LL(k) Parsers Recursive Descent –Manual construction –Uses recursion Wanted –A parser that can be generated automatically –Does not use recursion 47

Pushdown automaton uses –Prediction stack –Input stream –Transition table nonterminals x tokens -> production alternative Entry indexed by nonterminal N and token t contains the alternative of N that must be predicated when current input starts with t LL(k) parsing via pushdown automata 48

LL(k) parsing via pushdown automata Two possible moves –Prediction When top of stack is nonterminal N, pop N, lookup table[N,t]. If table[N,t] is not empty, push table[N,t] on prediction stack, otherwise – syntax error –Match When top of prediction stack is a terminal T, must be equal to next input token t. If (t == T), pop T and consume t. If (t ≠ T) syntax error Parsing terminates when prediction stack is empty –If input is empty at that point, success. Otherwise, syntax error 49

()nottruefalseandorxor$ E2311 LIT45 OP678 (1) E → LIT (2) E → ( E OP E ) (3) E → not E (4) LIT → true (5) LIT → false (6) OP → and (7) OP → or (8) OP → xor Nonterminals Input tokens Which rule should be used Example transition table 50

Model of non-recursive predictive parser Predictive Parsing program Parsing Table X Y Z $ Stack $b+a Output 51

abc A A  aAbA  c A  aAb | c aacbb$ Input suffixStack contentMove aacbb$A$ predict(A,a) = A  aAb aacbb$aAb$match(a,a) acbb$Ab$ predict(A,a) = A  aAb acbb$aAbb$match(a,a) cbb$Abb$ predict(A,c) = A  c cbb$ match(c,c) bb$ match(b,b) b$ match(b,b) $$match($,$) – success Running parser example 52

Erorrs 53

Handling Syntax Errors Report and locate the error Diagnose the error Correct the error Recover from the error in order to discover more errors –without reporting too many “strange” errors 54

Error Diagnosis Line number –may be far from the actual error The current token The expected tokens Parser configuration 55

Error Recovery Becomes less important in interactive environments Example heuristics: –Search for a semi-column and ignore the statement –Try to “replace” tokens for common errors –Refrain from reporting 3 subsequent errors Globally optimal solutions –For every input w, find a valid program w’ with a “minimal-distance” from w 56

abc A A  aAbA  c A  aAb | c abcbb$ Input suffixStack contentMove abcbb$A$ predict(A,a) = A  aAb abcbb$aAb$match(a,a) bcbb$Ab$predict(A,b) = ERROR Illegal input example 57

Error handling in LL parsers Now what? –Predict b S anyway “missing token b inserted in line XXX” S  a c | b S c$ abc S S  a cS  b S Input suffixStack contentMove c$S$predict(S,c) = ERROR 58

Error handling in LL parsers Result: infinite loop S  a c | b S c$ abc S S  a cS  b S Input suffixStack contentMove bc$S$ predict(b,c) = S  bS bc$bS$match(b,b) c$S$Looks familiar? 59

Error handling and recovery x = a * (p+q * ( -b * (r-s); Where should we report the error? The valid prefix property 60

The Valid Prefix Property For every prefix tokens –t 1, t 2, …, t i that the parser identifies as legal: there exists tokens t i+1, t i+2, …, t n such that t 1, t 2, …, t n is a syntactically valid program If every token is considered as single character: –For every prefix word u that the parser identifies as legal there exists w such that u.w is a valid program 61

Recovery is tricky Heuristics for dropping tokens, skipping to semicolon, etc. 62

Building the Parse Tree 63

Adding semantic actions Can add an action to perform on each production rule Can build the parse tree –Every function returns an object of type Node –Every Node maintains a list of children –Function calls can add new children 64

Building the parse tree Node E() { result = new Node(); result.name = “E”; if (current  {TRUE, FALSE}) // E  LIT result.addChild(LIT()); else if (current == LPAREN) // E  ( E OP E ) result.addChild(match(LPAREN)); result.addChild(E()); result.addChild(OP()); result.addChild(E()); result.addChild(match(RPAREN)); else if (current == NOT) // E  not E result.addChild(match(NOT)); result.addChild(E()); else error; return result; } 65

static int Parse_Expression(Expression **expr_p) { Expression *expr = *expr_p = new_expression() ; /* try to parse a digit */ if (Token.class == DIGIT) { expr->type=‘D’; expr->value=Token.repr –’0’; get_next_token(); return 1; } /* try parse parenthesized expression */ if (Token.class == ‘(‘) { expr->type=‘P’; get_next_token(); if (!Parse_Expression(&expr->left)) Error(“missing expression”); if (!Parse_Operator(&expr->oper)) Error(“missing operator”); if (Token.class != ‘)’) Error(“missing )”); get_next_token(); return 1; } return 0; } 66 Parser for Fully Parenthesized Expers

Bottom-up parsing 67

Intuition: Bottom-Up Parsing Begin with the user's program Guess parse (sub)trees Check if root is the start symbol 68

+* Bottom-up parsing Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E )

+*321 F 70 Bottom-up parsing Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E )

Bottom-up parsing Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E ) +*321 FF T F T 71

Top-Down vs Bottom-Up Top-down (predict match/scan-complete ) 72 to be read… already read… A A ab Aabc aacbb$ A  aAb|c

Top-Down vs Bottom-Up Top-down (predict match/scan-complete ) 73  Bottom-up (shift reduce) to be read… already read… A A ab Aabc A ab A c ab A aacbb$ A  aAb|c

Bottom-up parsing: LR(k) Grammars A grammar is in the class LR(K) when it can be derived via: –Bottom-up derivation –Scanning the input from left to right (L) –Producing the rightmost derivation (R) –With lookahead of k tokens (k) A language is said to be LR(k) if it has an LR(k) grammar The simplest case is LR(0), which we will discuss 74

Terminology: Reductions & Handles The opposite of derivation is called reduction –Let A  α be a production rule –Derivation: βAµ  βαµ –Reduction: βαµ  βAµ A handle is the reduced substring –α is the handles for βαµ 75

Goal: Reduce the Input to the Start Symbol Example: * 1 B + 0 * 1 E + 0 * 1 E + B * 1 E * 1 E * B E E → E * B | E + B | B B → 0 | 1 Go over the input so far, and upon seeing a right-hand side of a rule, “invoke” the rule and replace the right-hand side with the left-hand side (reduce) E BE * B 1 0 B 0 E + 76

Use Shift & Reduce In each stage, we shift a symbol from the input to the stack, or reduce according to one of the rules. 77

Use Shift & Reduce In each stage, we shift a symbol from the input to the stack, or reduce according to one of the rules. E BE * B 1 0 StackInputaction 0+0*1$shift 0 +0*1$reduce B +0*1$reduce E +0*1$shift E+0 *1$reduce E+B *1$reduce E *1$shift E*1 $reduce E*B $reduce E $accept B 0 E + 78 Example: “0+0*1” E → E * B | E + B | B B → 0 | 1

Stack Parser Input Output Action Table Goto table 79 )x*)7+23(( RPIdOPRPNumOPNumLP token stream Op(*) Id(b) Num(23)Num(7) Op(+) How does the parser know what to do?

A state will keep the info gathered on handle(s) –A state in the “control” of the PDA –Also (part of) the stack alpha beit A table will tell it “what to do” based on current state and next token –The transition function of the PDA A stack will records the “nesting level” –Prefixes of handles 80 Set of LR(0) items

LR item 81 N  α  β Already matched To be matched Input Hypothesis about αβ being a possible handle, so far we’ve matched α, expecting to see β

Example: LR(0) Items All items can be obtained by placing a dot at every position for every production: 82 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) 1: S   E$ 2: S  E  $ 3: S  E $  4: E   T 5: E  T  6: E   E + T 7: E  E  + T 8: E  E +  T 9: E  E + T  10: T   i 11: T  i  12: T   (E) 13: T  (  E) 14: T  (E  ) 15: T  (E)  Grammar LR(0) items

83 N  α  β Shift Item N  αβ  Reduce Item

States and LR(0) Items The state will “remember” the potential derivation rules given the part that was already identified For example, if we have already identified E then the state will remember the two alternatives: (1) E → E * B, (2) E → E + B Actually, we will also remember where we are in each of them: (1) E → E ● * B, (2) E → E ● + B A derivation rule with a location marker is called LR(0) item. The state is actually a set of LR(0) items. E.g., q 13 = { E → E ● * B, E → E ● + B} E → E * B | E + B | B B → 0 | 1 84

Intuition Gather input token by token until we find a right-hand side of a rule and then replace it with the non-terminal on the left hand side –Going over a token and remembering it in the stack is a shift Each shift moves to a state that remembers what we’ve seen so far –A reduce replaces a string in the stack with the non-terminal that derives it 85

Model of an LR parser 86 LR Parsing program 0 T id 5 Stack $id+ + Output state symbol gotoaction Input Terminals and Non-terminals

LR parser stack Sequence made of state, symbol pairs For instance a possible stack for the grammar S  E $ E  T E  E + T T  id T  ( E ) could be: 0 T id 5 87 Stack grows this way

Form of LR parsing table 88 state terminals non-terminals Shift/Reduce actionsGoto part snsn rkrk shift state nreduce by rule k gmgm goto state m acc accept error

LR parser table example 89 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) gotoactionSTATE TE$)(+id g6g1s7s50 accs31 2 g4s7s53 r3 4 r4 5 r2 6 g6g8s7s57 s9s38 r5 9

Shift move 90 LR Parsing program q Stack $…a… Output gotoaction Input If action[q, a] = sn

Result of shift 91 LR Parsing program n a q Stack $…a… Output gotoaction Input If action[q, a] = sn

Reduce move 92 If action[qn, a] = rk Production: (k) A  β If β= σ1… σn Top of stack looks like q1 σ1… qn σn goto[q, A] = qm LR Parsing program qn … q … Stack $…a… Output gotoaction Input 2*|β|

Result of reduce move 93 LR Parsing program Stack Output gotoaction 2*|β| qm A q … $…a… Input If action[qn, a] = rk Production: (k) A  β If β= σ1… σn Top of stack looks like q1 σ1… qn σn goto[q, A] = qm

Accept move 94 LR Parsing program q Stack $a… Output gotoaction Input If action[q, a] = accept parsing completed

Error move 95 LR Parsing program q Stack $…a… Output gotoaction Input If action[q, a] = error (usually empty) parsing discovered a syntactic error

Example 96 Z  E $ E  T | E + T T  i | ( E )

Example: parsing with LR items 97 Z  E $ E  T | E + T T  i | ( E ) E   T E   E + T T   i T   ( E ) Z   E $ i+i$ Why do we need these additional LR items? Where do they come from? What do they mean?

 -closure Given a set S of LR(0) items If P  α  Nβ is in S then for each rule N  α in the grammar S must also contain N   α 98  -closure ({Z   E $}) = E   T, E   E + T, T   i, T   ( E ) } { Z   E $, Z  E $ E  T | E + T T  i | ( E )

99 i+i$ E   T E   E + T T   i T   ( E ) Z   E $ Z  E $ E  T | E + T T  i | ( E ) Items denote possible future handles Remember position from which we’re trying to reduce Example: parsing with LR items

100 T  i  Reduce item! i+i$ E   T E   E + T T   i T   ( E ) Z   E $ Z  E $ E  T | E + T T  i | ( E ) Match items with current token Example: parsing with LR items

101 i E  T  Reduce item! T+i$ E   T E   E + T T   i T   ( E ) Z   E $ Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E )

102 T E  T  Reduce item! i E+i$ E   T E   E + T T   i T   ( E ) Z   E $ Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E )

103 T i E+i$ E   T E   E + T T   i T   ( E ) Z   E $ E  E  + T Z  E  $ Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E )

104 T i E+i$ E   T E   E + T T   i T   ( E ) Z   E $ E  E  + T Z  E  $ E  E+  T T   i T   ( E ) Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E )

105 E  E  + T Z  E  $ E  E+  T T   i T   ( E ) E+T$ i E   T E   E + T T   i T   ( E ) Z   E $ T i Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E )

106 E   T E   E + T T   i T   ( E ) Z   E $ E+T T i E  E  + T Z  E  $ E  E+  T T   i T   ( E ) i E  E+T  $ Reduce item! Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E )

107 E   T E   E + T T   i T   ( E ) Z   E $ E$ E T i +T Z  E  $ E  E  + T i Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E )

108 E   T E   E + T T   i T   ( E ) Z   E $ E$ E T i +T Z  E  $ E  E  + T Z  E$  i Example: parsing with LR items Reduce item! Z  E $ E  T | E + T T  i | ( E )

109 E   T E   E + T T   i T   ( E ) Z   E $ ZE E T i +T Z  E  $ E  E  + T Z  E$  Reduce item! E$ i Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E )

GOTO/ACTION tables 110 Statei+()$ETaction q0q5q7q1q6shift q1q3q2shift q2 Z  E$ q3q5q7q4Shift q4 E  E+T q5 T iT i q6 E TE T q7q5q7q8q6shift q8q3q9shift q9 T ET E GOTO Table ACTION Table empty – error move

LR(0) parser tables Two types of rows: –Shift row – tells which state to GOTO for current token –Reduce row – tells which rule to reduce (independent of current token) GOTO entries are blank 111

LR parser data structures Input – remainder of text to be processed Stack – sequence of pairs N, qi –N – symbol (terminal or non-terminal) –qi – state at which decisions are made Initial stack contains q i$ input q0 stack iq5

LR(0) pushdown automaton Two moves: shift and reduce Shift move –Remove first token from input –Push it on the stack –Compute next state based on GOTO table –Push new state on the stack –If new state is error – report error 113 i+i$ input q0 stack +i$ input q0 stack shift iq5 Statei+()$ETaction q0q5q7q1q6shift

LR(0) pushdown automaton Reduce move –Using a rule N  α –Symbols in α and their following states are removed from stack –New state computed based on GOTO table (using top of stack, before pushing N) –N is pushed on the stack –New state pushed on top of N 114 +i$ input q0 stack iq5 Reduce T  i +i$ input q0 stack Tq6 Statei+()$ETaction q0q5q7q1q6shift

GOTO/ACTION table 115 Statei+()$ET q0s5s7s1s6 q1s3s2 q2r1 q3s5s7s4 q4r3 q5r4 q6r2 q7s5s7s8s6 q8s3s9 q9r5 (1)Z  E $ (2)E  T (3)E  E + T (4)T  i (5)T  ( E ) Warning: numbers mean different things! rn = reduce using rule number n sm = shift to state m

Parsing id+id$ 116 gotoactionS TE$)(+id g6g1s7s50 accs31 2 g4s7s53 r3 4 r4 5 r2 6 g6g8s7s57 s9s38 r5 9 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) StackInputAction 0id + id $s5 Initialize with state 0

Parsing id+id$ 117 gotoactionS TE$)(+id g6g1s7s50 accs31 2 g4s7s53 r3 4 r4 5 r2 6 g6g8s7s57 s9s38 r5 9 StackInputAction 0id + id $s5 Initialize with state 0 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E )

Parsing id+id$ 118 StackInputAction 0id + id $s5 0 id 5+ id $r4 gotoactionS TE$)(+id g6g1s7s50 accs31 2 g4s7s53 r3 4 r4 5 r2 6 g6g8s7s57 s9s38 r5 9 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E )

Parsing id+id$ 119 StackInputAction 0id + id $s5 0 id 5+ id $r4 gotoactionS TE$)(+id g6g1s7s50 accs31 2 g4s7s53 r3 4 r4 5 r2 6 g6g8s7s57 s9s38 r5 9 pop id 5 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E )

Parsing id+id$ 120 StackInputAction 0id + id $s5 0 id 5+ id $r4 gotoactionS TE$)(+id g6g1s7s50 accs31 2 g4s7s53 r3 4 r4 5 r2 6 g6g8s7s57 s9s38 r5 9 push T 6 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E )

Parsing id+id$ 121 StackInputAction 0id + id $s5 0 id 5+ id $r4 0 T 6+ id $r2 gotoactionS TE$)(+id g6g1s7s50 accs31 2 g4s7s53 r3 4 r4 5 r2 6 g6g8s7s57 s9s38 r5 9 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E )

Parsing id+id$ 122 StackInputAction 0id + id $s5 0 id 5+ id $r4 0 T 6+ id $r2 0 E 1+ id $s3 gotoactionS TE$)(+id g6g1s7s50 accs31 2 g4s7s53 r3 4 r4 5 r2 6 g6g8s7s57 s9s38 r5 9 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E )

Parsing id+id$ 123 StackInputAction 0id + id $s5 0 id 5+ id $r4 0 T 6+ id $r2 0 E 1+ id $s3 0 E 1 + 3id $s5 gotoactionS TE$)(+id g6g1s7s50 accs31 2 g4s7s53 r3 4 r4 5 r2 6 g6g8s7s57 s9s38 r5 9 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E )

Parsing id+id$ 124 StackInputAction 0id + id $s5 0 id 5+ id $r4 0 T 6+ id $r2 0 E 1+ id $s3 0 E 1 + 3id $s5 0 E id 5$r4 gotoactionS TE$)(+id g6g1s7s50 accs31 2 g4s7s53 r3 4 r4 5 r2 6 g6g8s7s57 s9s38 r5 9 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E )

Parsing id+id$ 125 StackInputAction 0id + id $s5 0 id 5+ id $r4 0 T 6+ id $r2 0 E 1+ id $s3 0 E 1 + 3id $s5 0 E id 5$r4 0 E T 4$r3 gotoactionS TE$)(+id g6g1s7s50 accs31 2 g4s7s53 r3 4 r4 5 r2 6 g6g8s7s57 s9s38 r5 9 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E )

Parsing id+id$ 126 StackInputAction 0id + id $s5 0 id 5+ id $r4 0 T 6+ id $r2 0 E 1+ id $s3 0 E 1 + 3id $s5 0 E id 5$r4 0 E T 4$r3 0 E 1$s2 gotoactionS TE$)(+id g6g1s7s50 accs31 2 g4s7s53 r3 4 r4 5 r2 6 g6g8s7s57 s9s38 r5 9 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E )

Constructing an LR parsing table Construct a (determinized) transition diagram from LR items If there are conflicts – stop Fill table entries from diagram 127

LR item 128 N  α  β Already matched To be matched Input Hypothesis about αβ being a possible handle, so far we’ve matched α, expecting to see β

Types of LR(0) items 129 N  α  β Shift Item N  αβ  Reduce Item

LR(0) automaton example 130 Z   E$ E   T E   E + T T   i T   (E) T  (  E) E   T E   E + T T   i T   (E) E  E + T  T  (E)  Z  E$  Z  E  $ E  E  + T E  E+  T T   i T   (E) T  i  T  (E  ) E  E  +T E  T  q0q0 q1q1 q2q2 q3q3 q4q4 q5q5 q6q6 q7q7 q8q8 q9q9 T ( i E + $ T ) + E i T ( i ( reduce state shift state

Computing item sets Initial set –Z is in the start symbol –  -closure({ Z   α | Z  α is in the grammar } ) Next set from a set S and the next symbol X –step(S,X) = { N  αX  β | N  α  Xβ in the item set S} –nextSet(S,X) =  -closure(step(S,X)) 131

Operations for transition diagram construction Initial = {S’   S$} For an item set I Closure(I) = Closure(I) ∪ {X   µ is in grammar| N  α  Xβ in I} Goto(I, X) = { N  αX  β | N  α  Xβ in I} 132

Initial example Initial = {S   E $} 133 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Grammar

Closure example Initial = {S   E $} Closure({S   E $}) = { S   E $ E   T E   E + T T   id T   ( E ) } 134 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Grammar

Goto example Initial = {S   E $} Closure({S   E $}) = { S   E $ E   T E   E + T T   id T   ( E ) } Goto({S   E $, E   E + T, T   id}, E) = {S  E  $, E  E  + T} 135 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Grammar

Constructing the transition diagram Start with state 0 containing item Closure({S   E $}) Repeat until no new states are discovered –For every state p containing item set Ip, and symbol N, compute state q containing item set Iq = Closure(goto(Ip, N)) 136

LR(0) automaton example 137 Z   E$ E   T E   E + T T   i T   (E) T  (  E) E   T E   E + T T   i T   (E) E  E + T  T  (E)  Z  E$  Z  E  $ E  E  + T E  E+  T T   i T   (E) T  i  T  (E  ) E  E  +T E  T  q0q0 q1q1 q2q2 q3q3 q4q4 q5q5 q6q6 q7q7 q8q8 q9q9 T ( i E + $ T ) + E i T ( i ( reduce state shift state

Automaton construction example 138 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) S  E$ q0q0 Initialize

139 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) S   E$ E   T E   E + T T   i T   (E) q0q0 apply Closure Automaton construction example

140 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) S   E$ E   T E   E + T T   i T   (E) q0q0 E  T  q6q6 T T  (  E) E   T E   E + T T   i T   (E) ( T  i  q5q5 i S  E  $ E  E  + T q1q1 E

141 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) S   E$ E   T E   E + T T   i T   (E) T  (  E) E   T E   E + T T   i T   (E) E  E + T  T  (E)  S  E$  Z  E  $ E  E  + T E  E+  T T   i T   (E) T  i  T  (E  ) E  E  +T E  T  q0q0 q1q1 q2q2 q3q3 q4q4 q5q5 q6q6 q7q7 q8q8 q9q9 T ( i E + $ T ) + E i T ( i ( terminal transition corresponds to shift action in parse table non-terminal transition corresponds to goto action in parse table a single reduce item corresponds to reduce action Automaton construction example

Are we done? Can make a transition diagram for any grammar Can make a GOTO table for every grammar Cannot make a deterministic ACTION table for every grammar 142

LR(0) conflicts 143 Z  E $ E  T E  E + T T  i T  ( E ) T  i[E] Z   E$ E   T E   E + T T   i T   (E) T   i[E] T  i  T  i  [E] q0q0 q5q5 T ( i E Shift/reduce conflict … … …

LR(0) conflicts 144 Z  E $ E  T E  E + T T  i V  i T  ( E ) Z   E$ E   T E   E + T T   i T   (E) T   i[E] T  i  V  i  q0q0 q5q5 T ( i E reduce/reduce conflict … … …

LR(0) conflicts Any grammar with an  -rule cannot be LR(0) Inherent shift/reduce conflict –A   – reduce item –P  α  Aβ – shift item –A   can always be predicted from P  α  Aβ 145

Conflicts Can construct a diagram for every grammar but some may introduce conflicts shift-reduce conflict: an item set contains at least one shift item and one reduce item reduce-reduce conflict: an item set contains two reduce items 146

LR variants LR(0) – what we’ve seen so far SLR(0) –Removes infeasible reduce actions via FOLLOW set reasoning LR(1) –LR(0) with one lookahead token in items LALR(0) –LR(1) with merging of states with same LR(0) component 147

LR (0) GOTO/ACTIONS tables 148 Statei+()$ETaction q0q5q7q1q6shift q1q3q2shift q2 Z  E$ q3q5q7q4Shift q4 E  E+T q5 T iT i q6 E TE T q7q5q7q8q6shift q8q3q9shift q9 T ET E GOTO Table ACTION Table ACTION table determined only by state, ignores input GOTO table is indexed by state and a grammar symbol from the stack

SLR parsing A handle should not be reduced to a non-terminal N if the lookahead is a token that cannot follow N A reduce item N  α  is applicable only when the lookahead is in FOLLOW(N) –If b is not in FOLLOW(N) we just proved there is no derivation S  * βNb. –Thus, it is safe to remove the reduce item from the conflicted state Differs from LR(0) only on the ACTION table –Now a row in the parsing table may contain both shift actions and reduce actions and we need to consult the current token to decide which one to take 149

SLR action table 150 Statei+()[]$ 0shift 1 accept 2 3shift 4 E  E+T 5 T iT iT iT i shift T iT i 6 E TE TE TE TE TE T T  (E) vs. stateaction q0shift q1shift q2 q3shift q4 E  E+T q5 T iT i q6 E TE T q7shift q8shift q9 T ET E SLR – use 1 token look-aheadLR(0) – no look-ahead … as before… T  i T  i[E] Lookahead token from the input

LR(1) grammars In SLR: a reduce item N  α  is applicable only when the lookahead is in FOLLOW(N) But FOLLOW(N) merges lookahead for all alternatives for N –Insensitive to the context of a given production LR(1) keeps lookahead with each LR item Idea: a more refined notion of follows computed per item 151

LR(1) items LR(1) item is a pair –LR(0) item –Lookahead token Meaning –We matched the part left of the dot, looking to match the part on the right of the dot, followed by the lookahead token Example –The production L  id yields the following LR(1) items 152 [L → ● id, *] [L → ● id, =] [L → ● id, id] [L → ● id, $] [L → id ●, *] [L → id ●, =] [L → id ●, id] [L → id ●, $] (0) S’ → S (1) S → L = R (2) S → R (3) L → * R (4) L → id (5) R → L [L → ● id] [L → id ●] LR(0) items LR(1) items

LALR(1) LR(1) tables have huge number of entries Often don’t need such refined observation (and cost) Idea: find states with the same LR(0) component and merge their lookaheads component as long as there are no conflicts LALR(1) not as powerful as LR(1) in theory but works quite well in practice –Merging may not introduce new shift-reduce conflicts, only reduce-reduce, which is unlikely in practice 153

Summary 154

LR is More Powerful than LL Any LL(k) language is also in LR(k), i.e., LL(k) ⊂ LR(k). –LR is more popular in automatic tools But less intuitive Also, the lookahead is counted differently in the two cases –In an LL(k) derivation the algorithm sees the left-hand side of the rule + k input tokens and then must select the derivation rule –In LR(k), the algorithm “sees” all right-hand side of the derivation rule + k input tokens and then reduces LR(0) sees the entire right-side, but no input token 155

Grammar Hierarchy 156 Non-ambiguous CFG LR(1) LALR(1) SLR(1) LL(1) LR(0)

Earley Parsing 157 Jay Earley, PhD

Earley Parsing Invented by Jay Earley [PhD. 1968] Handles arbitrary context free grammars –Can handle ambiguous grammars Complexity O(N 3 ) when N = |input| Uses dynamic programming –Compactly encodes ambiguity 158

Dynamic programming Break a problem P into subproblems P 1 …P k –Solve P by combining solutions for P 1 …P k –Memo-ize (store) solutions to subproblems instead of re-computation Bellman-Ford shortest path algorithm –Sol(x,y,i) = minimum of Sol(x,y,i-1) Sol(t,y,i-1) + weight(x,t) for edges (x,t) 159

Earley Parsing Dynamic programming implementation of a recursive descent parser –S[N+1] Sequence of sets of “Earley states” N = |INPUT| Earley state (item) s is a sentential form + aux info –S[i] All parse tree that can be produced (by a RDP) after reading the first i tokens S[i+1] built using S[0] … S[i] 160

Earley Parsing Parse arbitrary grammars in O(|input| 3 ) –O(|input| 2 ) for unambigous grammer –Linear for most LR(k) langaues (next lesson) Dynamic programming implementation of a recursive descent parser –S[N+1] Sequence of sets of “Earley states” N = |INPUT| Earley states is a sentential form + aux info –S[i] All parse tree that can be produced (by an RDP) after reading the first i tokens S[i+1] built using S[0] … S[i] 161

Earley States s = –constituent (dotted rule) for A  αβ A  αβ predicated constituents A  αβ in-progress constituents A  αβ completed constituents –back previous Early state in derivation 162

Earley States s = –constituent (dotted rule) for A  αβ A  αβ predicated constituents A  αβ in-progress constituents A  αβ completed constituents –back previous Early state in derivation 163

Earley Parser Input = x[1…N] S[0] = ; S[1] = … S[N] = {} for i = 0... N do until S[i] does not change do foreach s ∈ S[i] if s = and a=x[i+1] then // scan S[i+1] = S[i+1] ∪ { } if s = and X  α then // predict S[i] = S[i] ∪ { } if s = and ∈ S[b] then // complete S[i] = S[i] ∪ { } 164

Example 165 if s = and a=x[i+1] then // scan S[i+1] = S[i+1] ∪ { } if s = and X  α then // predict S[i] = S[i] ∪ { } if s = and ∈ S[b] then // complete S[i] = S[i] ∪ { }

166