COMP3190: Principle of Programming Languages Formal Language Syntax.

Slides:



Advertisements
Similar presentations
Compiler Construction
Advertisements

Mooly Sagiv and Roman Manevich School of Computer Science
Predictive Parsing l Find derivation for an input string, l Build a abstract syntax tree (AST) –a representation of the parsed program l Build a symbol.
6/12/2015Prof. Hilfinger CS164 Lecture 111 Bottom-Up Parsing Lecture (From slides by G. Necula & R. Bodik)
Top-Down Parsing.
1 Bottom Up Parsing. 2 Bottom-Up Parsing l Bottom-up parsing is more general than top-down parsing »And just as efficient »Builds on ideas in top-down.
By Neng-Fa Zhou Syntax Analysis lexical analyzer syntax analyzer semantic analyzer source program tokens parse tree parser tree.
Bottom-Up Syntax Analysis Mooly Sagiv html:// Textbook:Modern Compiler Design Chapter
CS Summer 2005 Top-down and Bottom-up Parsing - a whirlwind tour June 20, 2005 Slide acknowledgment: Radu Rugina, CS 412.
More SLR /LR(1) Professor Yihjia Tsai Tamkang University.
Bottom Up Parsing.
Prof. Fateman CS 164 Lecture 91 Bottom-Up Parsing Lecture 9.
Bottom-Up Syntax Analysis Mooly Sagiv html:// Textbook:Modern Compiler Implementation in C Chapter 3.
LR(k) Grammar David Rodriguez-Velazquez CS6800-Summer I, 2009 Dr. Elise De Doncker.
COS 320 Compilers David Walker. last time context free grammars (Appel 3.1) –terminals, non-terminals, rules –derivations & parse trees –ambiguous grammars.
Table-driven parsing Parsing performed by a finite state machine. Parsing algorithm is language-independent. FSM driven by table (s) generated automatically.
Bottom-up parsing Goal of parser : build a derivation
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
Syntax Analysis – Part II Quick Look at Using Bison Top-Down Parsers EECS 483 – Lecture 5 University of Michigan Wednesday, September 20, 2006.
Syntax and Semantics Structure of programming languages.
Parsing. Goals of Parsing Check the input for syntactic accuracy Return appropriate error messages Recover if possible Produce, or at least traverse,
LR Parsing Compiler Baojian Hua
Chap. 6, Bottom-Up Parsing J. H. Wang May 17, 2011.
Parsing G Programming Languages May 24, 2012 New York University Chanseok Oh
1 Top Down Parsing. CS 412/413 Spring 2008Introduction to Compilers2 Outline Top-down parsing SLL(1) grammars Transforming a grammar into SLL(1) form.
Parsing Jaruloj Chongstitvatana Department of Mathematics and Computer Science Chulalongkorn University.
Profs. Necula CS 164 Lecture Top-Down Parsing ICOM 4036 Lecture 5.
1 Compiler Construction Syntax Analysis Top-down parsing.
Review 1.Lexical Analysis 2.Syntax Analysis 3.Semantic Analysis 4.Code Generation 5.Code Optimization.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Syntax and Semantics Structure of programming languages.
Bernd Fischer RW713: Compiler and Software Language Engineering.
Chapter 5: Bottom-Up Parsing (Shift-Reduce)
Prof. Necula CS 164 Lecture 8-91 Bottom-Up Parsing LR Parsing. Parser Generators. Lecture 6.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 3: Introduction to Syntactic Analysis.
Top-down Parsing lecture slides from C OMP 412 Rice University Houston, Texas, Fall 2001.
Announcements/Reading
1 Syntax Analysis Part II Chapter 4 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.
Top-Down Parsing CS 671 January 29, CS 671 – Spring Where Are We? Source code: if (b==0) a = “Hi”; Token Stream: if (b == 0) a = “Hi”; Abstract.
Top-down Parsing. 2 Parsing Techniques Top-down parsers (LL(1), recursive descent) Start at the root of the parse tree and grow toward leaves Pick a production.
Syntax Analysis – Part I EECS 483 – Lecture 4 University of Michigan Monday, September 17, 2006.
Top-Down Parsing.
4. Bottom-up Parsing Chih-Hung Wang
CS 330 Programming Languages 09 / 25 / 2007 Instructor: Michael Eckmann.
UMBC  CSEE   1 Chapter 4 Chapter 4 (b) parsing.
Bernd Fischer RW713: Compiler and Software Language Engineering.
Bottom Up Parsing CS 671 January 31, CS 671 – Spring Where Are We? Finished Top-Down Parsing Starting Bottom-Up Parsing Lexical Analysis.
COMP 3438 – Part II-Lecture 6 Syntax Analysis III Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
Syntax-Directed Definitions CS375 Compilers. UT-CS. 1.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 6: LR grammars and automatic parser generators.
1 Syntax Analysis Part II Chapter 4 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2007.
Compilers: Bottom-up/6 1 Compiler Structures Objective – –describe bottom-up (LR) parsing using shift- reduce and parse tables – –explain how LR.
Bottom-up parsing. Bottom-up parsing builds a parse tree from the leaves (terminals) to the start symbol int E T * TE+ T (4) (2) (3) (5) (1) int*+ E 
CMSC 330: Organization of Programming Languages Pushdown Automata Parsing.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
COMPILER CONSTRUCTION
2016/7/9Page 1 Lecture 11: Semester Review COMP3100 Dept. Computer Science and Technology United International College.
Syntax and Semantics Structure of programming languages.
Announcements/Reading
Programming Languages Translator
Chapter 2 :: Programming Language Syntax
Table-driven parsing Parsing performed by a finite state machine.
4 (c) parsing.
Lexical and Syntax Analysis
Top-Down Parsing CS 671 January 29, 2008.
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Chapter 2 :: Programming Language Syntax
Kanat Bolazar February 16, 2010
Chapter 2 :: Programming Language Syntax
Chap. 3 BOTTOM-UP PARSING
Presentation transcript:

COMP3190: Principle of Programming Languages Formal Language Syntax

- 1 - Motivation The problem of parsing structured text is very common Consider the structure of addresses (using a grammar): := := |. Describe and recognize addresses in arbitrary text.

- 2 - Outline v DFA & NFA v Regular expression v Regular languages v Context free languages &PDA v Scanner v Parser

- 3 - Deterministic Finite Automata (DFA) v Q: finite set of states v Σ: finite set of “letters” (alphabet) v δ: QxΣ -> Q (transition function) v q 0 : start state (in Q) v F : set of accept states (subset of Q) v Acceptance: input consumed with the automata in a final state.

- 4 - Example of DFA q1 q δ01 q1 q2 q1q2 Accepts all strings that end in 1

- 5 - Another Example of a DFA S q1 q2 r1 r2 a b a ab b b ab a Accepts all strings that start and end with “a” OR start and end with “b”

- 6 - Non-deterministic Finite Automata (NFA) Transition function is different v δ: QxΣ ε -> P(Q) v P(Q) is the powerset of Q (set of all subsets) v Σ ε is the union of Σ and the special symbol ε (denoting empty) String is accepted if there is at least one path leading to an accept state, and input consumed.

- 7 - Example of an NFA q1q2q3q4 0, 1 1 0, ε1 0, 1 δ01ε q1{q1}{q1, q2} q2{q3} q3{q4} q4{q4} What strings does this NFA accept?

- 8 - Outline v DFA & NFA v Regular expression v Regular languages v Context free languages &PDA v Scanner v Parser

- 9 - Regular Expressions R is a regular expression if R is v “a” for some a in Σ. v ε (the empty string). v member of the empty language. v the union of two regular expressions. v the concatenation of two regular expr. v R 1 * (Kleene closure: zero or more repetitions of R 1 ).

Regular Expression Notation v a: an ordinary letter v ε: the empty string v M | N: choosing from M or N v MN: concatenation of M and N v M*: zero or more times (Kleene star) v M + : one or more times v M?: zero or one occurence v [a-zA-Z] character set alternation (choice) v. period stands for any single char exc. newline

Examples of Regular Expressions {0, 1}* 0 all strings that end in 0 {0, 1} 0* string that start with 1 or 0 followed by zero or more 0s. {0, 1}* all strings {0 n 1 n, n >=0} not a regular expression!!!

Converting a Regular Expression to an NFA ε ε ε ε ε M N M M N ε a M|N MN M*

Regular expression->NFA Language: Strings of 0s and 1s in which the number of 0s is even Regular expression: (1*01*0)*1*

Converting an NFA to a DFA v For set of states S, closure(S) is the set of states that can be reached from S without consuming any input. v For a set of states S, DFAedge(s, c) is the set of states that can be reached from S by consuming input symbol c. v Each set of NFA states corresponds to one DFA state (hence at most 2 n states).

NFA -> DFA Initial classes: {A, B, E}, {C, D} No class requires partitioning! Hence a two-state DFA is obtained.

Obtaining the minimal equivalent DFA v Initially two equivalence classes: final and nonfinal states.  Search for an equivalence class C and an input letter a such that with a as input, the states in C make transitions to states in k>1 different equivalence classes. v Partition C into k classes accordingly v Repeat until unable to find a class to partition.

Example (cont.)

Outline v DFA & NFA v Regular expression v Regular languages v Context free languages &PDA v Scanner v Parser

Regular Grammar v Later definitions build on earlier ones v Nothing defined in terms of itself (no recursion) Regular grammar for numeric literals in Pascal: digit -> 0|1|2|...|8|9 unsigned_integer -> digit digit* unsigned_number -> unsigned_integer ((. unsigned_integer) | ε ) (( e (+ | - | ε ) unsigned_integer ) | ε )

Languages and Automata in Programming Languages v Regular languages »Recognized(accepted) by finite automata »Useful for tokenizing program text (lexical analysis) v Context-free languages »Recognized(accepted) by pushdown automata »Useful for parsing the syntax of a program

Important Theorems v A language is regular if a regular expression describes it. v A language is regular if a finite automata recognizes it. v DFAs and NFAs are equally powerful.

Outline v DFA & NFA v Regular expression v Regular languages v Context free languages &PDA v Scanner v Parser

Context-free Grammars Context-free grammars are defined by substitution rules Big Jim ate gree cheese green Jim ate green cheese Jim ate cheese Cheese ate Jim P -> N P -> AP S -> PVP A -> big|green N -> cheese|Jim V -> ate

Context-free Grammars v Context-free grammars are used to formally describe the syntax of programming languages. v Every syntactically correct program is derived using the context-free grammar of the language. v Parsing a program involves tracing such derivation, given the context-free grammar and the program.

Context-free Grammars A context-free grammar consists of v V: a finite set of variables v Σ: a finite set of terminals v R: a finite set of rules of the form variable -> {variable, terminal}* v S: the start variable

Pushdown Automata (PDA) v A pushdown automata consists of v Q: a set of states v Σ: input alphabet (of terminals) v Γ: stack alphabet v δ: a set of transition rules Q x Σ ε x Γ ε -> P(Q x Γ ε ) currentState, inputSymbol, headOfStack ->newState, pushSymbolOnStack v q 0 : the start state v F: the set of accept states (subset of Q) Deterministic: At most one move is possible from any configuration

How does a PDA accept? v By final state: »Consume all the input while »Reaching a final state v By empty stack: »Consume all the input while »Having an empty stack »Set of final states is irrelevant

Example of a PDA q1 q2 q3 q4 ε, ε ->$ 0, ε->0 1, 0->ε ε, $->ε Notation: a, b->c: when PDA reads “a” from input, it replaces “b” at the top of stack with “c”. What does this PDA accept?

Important Theorems v A language is context-free iff a pushdown automata recognizes it v Non-deterministic PDA are more powerful than deterministic ones

Example of Context-free Language That Requires a Non-deterministic PDA {w w R | w belongs to {0, 1}*} i.e. w R is w written backwards Idea: Non-deterministically guess the middle of the input string

The Solution q1 q2 q3 q4 ε, ε ->$ 0, ε->0 1, ε->1 ε, ε->ε 1, 1->ε 0, 0->ε ε, $->ε

Derivations and Parse Trees Nested constructs require recursion, i.e. context-free grammars CFG for arithmetic expressions expression -> identifier | number | - expression | (expression) | expression operator expression operator -> + | - | * | /

Parse Tree for Slope*x + Intercept Is this the only parse tree for this expression and grammar?

A Better Expression Grammar 1. expression -> term | expression add_op term 2. term -> factor | term mult_op factor 3. factor -> identifier | number | - factor | (expression) 4. add_op -> + | - 5. mult_op -> * | / A good grammar reflects the internal structure of programs. This grammar is unambiguous and captures (HOW?): - operator precedence (*,/ bind tighter than +,- ) - associativity (ops group left to right)

And Better Parse Trees *

Syntax-directed Compilation v Parser calls scanner to obtain tokens. v Assembles tokens into parse tree. v Passes tree to later phases of compilation. v Scanner: deterministic finite automata. v Parser: pushdown automata. v Scanners and parsers can be generated automatically from regular expressions and CFGs (e.G. lex/yacc).

Outline v DFA & NFA v Regular expression v Regular languages v Context free languages &PDA v Scanner v Parser

Scanning v Accept the longest possible token in each invocation of the scanner. v Implementation. »Capture finite automata.  Case(switch) statements.  Table and driver.

Scanner for Pascal

Scanner for Pascal(case Statements)

Scanner (Table&driver)

Scanner Generators v Start with a regular expression. v Construct an NFA from it. v Use a set of subsets construction to obtain an equivalent DFA. v Construct the minimal equivalent DFA.

Outline v DFA & NFA v Regular expression v Regular languages v Context free languages &PDA v Scanner v Parser »Top-down parsing »Bottom-up Parsing »Comparison

Parsing approaches v Parsing in general has O(n 3 ) cost. v Need classes of grammars that can be parsed in linear time »Top-down or predictive parsing or recursive descent parsing or LL parsing (Left-to-right Left-most) »Bottom-up or shift-reduce parsing or LR parsing (Left-to-right Right-most)

A Simple Grammar for a Comma-separated List of Identifiers id_list -> id id_list_tail id_list_tail ->, id id_list_tail id_list_tail -> ; _________________________ String to be parsed: A, B, C;

Top-down/bottom-up Parsing

Outline v DFA & NFA v Regular expression v Regular languages v Context free languages &PDA v Scanner v Parser »Top-down parsing »Bottom-up Parsing »Comparison

Top-down Parsing v Predicts a derivation v Matches non-terminal against token observed in input

LL(1) Grammar v A grammar for which a top-down deterministic parser can be produced with one token of look- ahead. v LL(1) grammar: »For a given non-terminal, the lookahead symbol uniquely determines the production to apply »Top-down parsing = predictive parsing »Driven by predictive parsing table of  non-terminals x terminals  productions

From Last Time: Parsing with Table Partly-derived StringLookaheadparsed part unparsed part  ES’((1+2+(3+4))+5  (S)S’1(1+2+(3+4))+5  (ES’)S’1(1+2+(3+4))+5  (1S’)S’+(1+2+(3+4))+5  (1+ES’)S’2(1+2+(3+4))+5  (1+2S’)S’+(1+2+(3+4))+5 S  ES’S’   | +SE  num | (S) num+()$ S  ES’  ES’ S’  +S     E  num  (S)

How to Construct Parsing Tables? Needed: Algorithm for automatically generating a predictive parse table from a grammar S  ES’ S’   | +S E  number | (S) num+()$ SES’ES’ S’+S  Enum (S) ??

Constructing Parse Tables v Can construct predictive parser if: »For every non-terminal, every lookahead symbol can be handled by at most 1 production v FIRST(  ) for an arbitrary string of terminals and non-terminals  is: »Set of symbols that might begin the fully expanded version of  v FOLLOW(X) for a non-terminal X is: »Set of symbols that might follow the derivation of X in the input stream FIRSTFOLLOW X

Parse Table Entries v Consider a production X   v Add   to the X row for each symbol in FIRST(  ) v If  can derive  (  is nullable), add   for each symbol in FOLLOW(X) v Grammar is LL(1) if no conflicting entries num+()$ SES’ES’ S’+S  Enum (S) S  ES’ S’   | +S E  number | (S)

Computing Nullable v X is nullable if it can derive the empty string: »If it derives  directly (X   ) »If it has a production X  YZ... where all RHS symbols (Y,Z) are nullable v Algorithm: assume all non-terminals are non- nullable, apply rules repeatedly until no change S  ES’ S’   | +S E  number | (S) Only S’ is nullable

Computing FIRST v Determining FIRST(X) 1.if X is a terminal, then add X to FIRST(X) 2.if X   then add  to FIRST(X) 3.if X is a nonterminal and X  Y1Y2...Yk then a is in FIRST(X) if a is in FIRST(Yi) and  is in FIRST(Yj) for j = 1...i-1 (i.e., its possible to have an empty prefix Y1... Yi-1 4.if  is in FIRST(Y1Y2...Yk) then  is in FIRST(X)

FIRST Example S  ES’ S’   | +S E  number | (S) Apply rule 1: FIRST(num) = {num}, FIRST(+) = {+}, etc. Apply rule 2: FIRST(S’) = {  } Apply rule 3: FIRST(S) = FIRST(E) = {} FIRST(S’) = FIRST(‘+’) + {  } = { , + } FIRST(E) = FIRST(num) + FIRST(‘(‘) = {num, ( } Rule 3 again: FIRST(S) = FIRST(E) = {num, ( } FIRST(S’) = { , + } FIRST(E) = {num, ( }

Computing FOLLOW v Determining FOLLOW(X) 1.if S is the start symbol then $ is in FOLLOW(S) 2.if A   B  then add all FIRST(  ) !=  to FOLLOW(B) 3.if A   B or  B  and  is in FIRST(  ) then add FOLLOW(A) to FOLLOW(B)

FOLLOW Example S  ES’ S’   | +S E  number | (S) FIRST(S) = {num, ( } FIRST(S’) = { , + } FIRST(E) = { num, ( } Apply rule 1: FOL(S) = {$} Apply rule 2: S  ES’FOL(E) += {FIRST(S’) -  } = {+} S’   | +S- E  num | (S) FOL(S) += {FIRST(‘)’) -  } = {$,) } Apply rule 3:S  ES’FOL(E) += FOL(S) = {+,$,)} (because S’ is nullable) FOL(S’) += FOL(S) = {$,)}

Putting it all Together FOLLOW(S) = { $, ) } FOLLOW(S’) = { $, ) } FOLLOW(E) = { +, ), $ } FIRST(S) = {num, ( } FIRST(S’) = { , + } FIRST(E) = { num, ( } v Consider a production X   v Add   to the X row for each symbol in FIRST(  ) v If  can derive  (  is nullable), add   for each symbol in FOLLOW(X) num+()$ SES’ES’ S’+S  Enum (S) S  ES’ S’   | +S E  number | (S)

Ambiguous Grammars Construction of predictive parse table for ambiguous grammar results in conflicts in the table (ie 2 or more productions to apply in same cell) S  S + S | S * S | num FIRST(S+S) = FIRST(S*S) = FIRST(num) = { num }

Class Problem E  E + T | T T  T * F | F F  (E) | num |  1. Compute FIRST and FOLLOW sets for this G 2. Compute parse table entries

Top-Down Parsing Up to This Point v Now we know »How to build parsing table for an LL(1) grammar (ie FIRST/FOLLOW) »How to construct recursive-descent parser from parsing table »Call tree = parse tree v Open question – Can we generate the AST?

Creating the Abstract Syntax Tree v Some class definitions to assist with AST construction v class Expr {} v class Add extends Expr { »Expr left, right; »Add(Expr L, Expr R) {  left = L; right = R; »} v } v class Num extends Expr { »int value; »Num(int v) {value = v;} v } Expr NumAdd Class Hierarchy

Creating the AST ( (3 + 4)) + 5 S E+S ( S )E E + S 5 1 2E ( S ) E + S E34 We got the parse tree from the call tree Just add code to each parsing routine to create the appropriate nodes Works because parse tree and call tree are the same shape, and AST is just a compressed form of the parse tree

AST Creation: parse_E v Expr parse_E() { »switch (token) {  case num:// E  number u Expr result = Num(token.value); u token = input.read(); return result;  case ‘(‘:// E  (S) u token = input.read(); u Expr result = parse_S(); u if (token != ‘)’) ParseError(); u token = input.read(); return result;  default: ParseError(); »} v } Remember, this is lookahead token S  ES’ S’   | +S E  number | (S)

AST Creation: parse_S v Expr parse_S() { »switch (token) {  case num:  case ‘(‘:// S  ES’ u Expr left = parse_E(); u Expr right = parse_S’(); u if (right == NULL) return left; u else return new Add(left,right);  default: ParseError(); »} v } S  ES’ S’   | +S E  number | (S)

Grammars v Have been using grammar for language “sums with parentheses” (1+2+(3+4))+5 v Started with simple, right-associative grammar »S  E + S | E »E  num | (S) v Transformed it to an LL(1) by left factoring: »S  ES’ »S’   | +S »E  num (S) v What if we start with a left-associative grammar? »S  S + E | E »E  num | (S)

Reminder: Left vs Right Associativity S  E + S S  E E  num S  S + E S  E E  num Right recursion : right associative Left recursion : left associative Consider a simpler string on a simpler grammar: “ ”

Left Recursion derived stringlookaheadread/unread S S+E S+E+E S+E+E+E E+E+E+E E+E+E E+E E $ Is this right? If not, what’s the problem? S  S + E S  E E  num “ ”

Left-Recursive Grammars v Left-recursive grammars don’t work with top-down parsers: we don’t know when to stop the recursion v Left-recursive grammars are NOT LL(1)! »S  S  »S   v In parse table »Both productions will appear in the predictive table at row S in all the columns corresponding to FIRST(  )

Eliminate Left Recursion v Replace »X  X  1 |... | X  m »X   1 |... |  n v With »X   1X’ |... |  nX’ »X’   1X’ |... |  mX’ |  v See complete algorithm in Dragon book

Class Problem E  E + T | T T  T * F | F F  (E) | num Transform the following grammar to eliminate left recursion:

Creating an LL(1) Grammar v Start with a left-recursive grammar  S  S + E  S  E »and apply left-recursion elimination algorithm  S  ES’  S’  +ES’ |  v Start with a right-recursive grammar  S  E + S  S  E »and apply left-factoring to eliminate common prefixes  S  ES’  S’  +S | 

Top-Down Parsing Summary Language grammar Left-recursion elimination Left factoring LL(1) grammar predictive parsing table FIRST, FOLLOW recursive-descent parser parser with AST gen

Outline v DFA & NFA v Regular expression v Regular languages v Context free languages &PDA v Scanner v Parser »Top-down parsing »Bottom-up Parsing »Comparison

New Topic: Bottom-Up Parsing v A more power parsing technology v LR grammars – more expressive than LL »Construct right-most derivation of program »Left-recursive grammars, virtually all programming languages are left-recursive »Easier to express syntax v Shift-reduce parsers »Parsers for LR grammars »Automatic parser generators (yacc, bison)

Bottom-Up Parsing (2) v Right-most derivation – Backward »Start with the tokens »End with the start symbol »Match substring on RHS of production, replace by LHS S  S + E | E E  num | (S) (1+2+(3+4))+5  (E+2+(3+4))+5  (S+2+(3+4))+5  (S+E+(3+4))+5  (S+(3+4))+5  (S+(E+4))+5  (S+(S+4))+5  (S+(S+E))+5  (S+(S))+5  (S+E)+5  (S)+5  E+5  S+E  S

Shift-Reduce Parsing v Parsing actions: A sequence of shift and reduce operations v Parser state: A stack of terminals and non- terminals (grows to the right) v Current derivation step = stack + input Derivation stepstackUnconsumed input (1+2+(3+4))+5  (1+2+(3+4))+5 (E+2+(3+4))+5  (E+2+(3+4))+5 (S+2+(3+4))+5  (S+2+(3+4))+5 (S+E+(3+4))+5  (S+E+(3+4))+5...

Shift-Reduce Actions v Parsing is a sequence of shifts and reduces v Shift: move look-ahead token to stack v Reduce: Replace symbols  from top of stack with non-terminal symbol X corresponding to the production: X   (e.g., pop , push X) stackinputaction (1+2+(3+4))+5 shift 1 (1+2+(3+4))+5 stackinputaction (S+E+(3+4))+5 reduce S  S+ E (S+(3+4))+5

Shift-Reduce Parsing derivationstackinput streamaction (1+2+(3+4))+5(1+2+(3+4))+5shift (1+2+(3+4))+5(1+2+(3+4))+5reduce E  num (E+2+(3+4))+5(E+2+(3+4))+5reduce S  E (S+2+(3+4))+5(S+2+(3+4))+5shift (S+2+(3+4))+5(S+2+(3+4))+5reduce E  num (S+E+(3+4))+5(S+E+(3+4))+5reduce S  S+E (S+(3+4))+5(S+(3+4))+5shift (S+(3+4))+5(S+(3+4))+5reduce E  num... S  S + E | E E  num | (S)

Potential Problems v How do we know which action to take: whether to shift or reduce, and which production to apply v Issues »Sometimes can reduce but should not »Sometimes can reduce in different ways

Action Selection Problem v Given stack  and look-ahead symbol b, should parser: »Shift b onto the stack making it  b ? »Reduce X   assuming that the stack has the form  =  making it  X ? v If stack has the form , should apply reduction X   (or shift) depending on stack prefix  ? »  is different for different possible reductions since  ’s have different lengths

LR Parsing Engine v Basic mechanism »Use a set of parser states »Use stack with alternating symbols and states  E.g., 1 ( 6 S (blue = state numbers) »Use parsing table to:  Determine what action to apply (shift/reduce)  Determine next state v The parser actions can be precisely determined from the table

LR Parsing Table v Algorithm: look at entry for current state S and input terminal C »If Table[S,C] = s(S’) then shift:  push(C), push(S’) »If Table[S,C] = X   then reduce:  pop(2*|  |), S’= top(), push(X), push(Table[S’,X]) Next action and next state Next state Terminals Non-terminals State Action tableGoto table

LR Parsing Table Example ()id,$SL 1s3s2g4 2S  idS  idS  idS  idS  id 3s3s2g7g5 4accept 5s6s8 6S  (L)S  (L)S  (L)S  (L)S  (L) 7L  SL  SL  SL  SL  S 8s3s2g9 9L  L,SL  L,SL  L,SL  L,SL  L,S State Input terminalNon-terminals We want to derive this in an algorithmic fashion

Parsing Example ((a),b) derivationstackinputaction ((a),b)  1((a),b)shift, goto 3 ((a),b)  1(3(a),b)shift, goto 3 ((a),b)  1(3(3a),b)shift, goto 2 ((a),b)  1(3(3a2),b)reduce S  id ((S),b)  1(3(3(S7),b)reduce L  S ((L),b)  1(3(3(L5),b)shift, goto 6 ((L),b)  1(3(3L5)6,b)reduce S  (L) (S,b)  1(3S7,b)reduce L  S (L,b)  1(3L5,b)shift, goto 8 (L,b)  1(3L5,8b)shift, goto 9 (L,b)  1(3L5,8b2)reduce S  id (L,S)  1(3L8,S9)reduce L  L,S (L)  1(3L5)shift, goto 6 (L)  1(3L5)6reduce S  (L) S  1S4$done S  (L) | id L  S | L,S

LR(k) Grammars v LR(k) = Left-to-right scanning, right-most derivation, k lookahead chars v Main cases »LR(0), LR(1) »Some variations SLR and LALR(1) v Parsers for LR(0) Grammars: »Determine the actions without any lookahead »Will help us understand shift-reduce parsing

Building LR(0) Parsing Tables v To build the parsing table: »Define states of the parser »Build a DFA to describe transitions between states »Use the DFA to build the parsing table v Each LR(0) state is a set of LR(0) items »An LR(0) item: X  .  where X   is a production in the grammar »The LR(0) items keep track of the progress on all of the possible upcoming productions »The item X  .  abstracts the fact that the parser already matched the string  at the top of the stack

Example LR(0) State v An LR(0) item is a production from the language with a separator “.” somewhere in the RHS of the production v Sub-string before “.” is already on the stack (beginnings of possible  ’s to be reduced) v Sub-string after “.”: what we might see next E  num. E  (. S) state item

Class Problem For the production, E  num | (S) Two items are: E  num. E  (. S ) Are there any others? If so, what are they? If not, why?

LR(0) Grammar v Nested lists »S  (L) | id »L  S | L,S v Examples »(a,b,c) »((a,b), (c,d), (e,f)) »(a, (b,c,d), ((f,g))) S ( L ) L, S ( S )S a L, S S b c d Parse tree for (a, (b,c), d)

Start State and Closure v Start state »Augment grammar with production: S’  S $ »Start state of DFA has empty stack: S’ . S $ v Closure of a parser state: »Start with Closure(S) = S »Then for each item in S:  X  . Y   Add items for all the productions Y   to the closure of S: Y . 

Closure Example S  (L) | id L  S | L,S DFA start state S’ . S $ closure S’ . S $ S . (L) S . id - Set of possible productions to be reduced next - Added items have the “.” located at the beginning: no symbols for these items on the stack yet

The Goto Operation v Goto operation = describes transitions between parser states, which are sets of items v Algorithm: for state S and a symbol Y »If the item [X  . Y  ] is in I, then »Goto(I, Y) = Closure( [X   Y.  ] ) S’ . S $ S . (L) S . id Goto(S, ‘(‘) Closure( { S  (. L) } )

Class Problem 1.If I = { [E’ . E]}, then Closure(I) = ?? 2.If I = { [E’  E. ], [E  E. + T] }, then Goto(I,+) = ?? E’  E E  E + T | T T  T * F | F F  (E) | id

Applying Reduce Actions S’ . S $ S . (L) S . id S  (. L) L . S L . L, S S . (L) S . id S  id. id ( ( Grammar S  (L) | id L  S | L,S S  (L. ) L  L., S L  S. L S states causing reductions (dot has reached the end!) Pop RHS off stack, replace with LHS X (X   ), then rerun DFA (e.g., (x))

Reductions v On reducing X   with stack  »Pop  off stack, revealing prefix  and state »Take single step in DFA from top state »Push X onto stack with new DFA state v Example derivationstackinputaction ((a),b)  1 ( 3 ( 3a),b)shift, goto 2 ((a),b)  1 ( 3 ( 3 a 2),b)reduce S  id ((S),b)  1 ( 3 ( 3 S 7),b)reduce L  S

Full DFA S’ . S $ S . (L) S . id S  (. L) L . S L . L, S S . (L) S . id S  id. id ( ( S  (L. )L L  L., S L  S. S L  L,. S S . (L) S . id L  L,S. S  (L). S’  S. $ final state S, ) S $ id L Grammar S  (L) | id L  S | L,S

Building the Parsing Table v States in the table = states in the DFA v For transition S  S’ on terminal C: »Table[S,C] += Shift(S’) v For transition S  S’ on non-terminal N: »Table[S,N] += Goto(S’) v If S is a reduction state X   then: »Table[S,*] += Reduce(X   )

Computed LR Parsing Table ()id,$SL 1s3s2g4 2S  idS  idS  idS  idS  id 3s3s2g7g5 4accept 5s6s8 6S  (L)S  (L)S  (L)S  (L)S  (L) 7L  SL  SL  SL  SL  S 8s3s2g9 9L  L,SL  L,SL  L,SL  L,SL  L,S State Input terminalNon-terminals red = reduceblue = shift

LR(0) Summary v LR(0) parsing recipe: »Start with LR(0) grammar »Compute LR(0) states and build DFA:  Use the closure operation to compute states  Use the goto operation to compute transitions »Build the LR(0) parsing table from the DFA v This can be done automatically

Class Problem S  E + S | E E  num Generate the DFA for the following grammar

LR(0) Limitations v An LR(0) machine only works if states with reduce actions have a single reduce action »Always reduce regardless of lookahead v With a more complex grammar, construction gives states with shift/reduce or reduce/reduce conflicts v Need to use lookahead to choose L  L, S. S  S., L L  S, L. L  S. OK shift/reduce reduce/reduce

A Non-LR(0) Grammar v Grammar for addition of numbers »S  S + E | E »E  num v Left-associative version is LR(0) v Right-associative is not LR(0) as you saw with the previous class problem »S  E + S | E »E  num

LR(0) Parsing Table S’ . S $ S .E + S S . E E .num E  num. S  E. +S S  E. E num + S  E + S. S’  S $. S S  E +. S S . E + S S . E E . num S’  S. $ S Grammar S  E + S | E E  num $ E num num+$ES 1s4g2g6 2S  Es3/S  ES  E Shift or reduce in state 2?

Solve Conflict With Lookahead v 3 popular techniques for employing lookahead of 1 symbol with bottom-up parsing »SLR – Simple LR »LALR – LookAhead LR »LR(1) v Each as a different means of utilizing the lookahead »Results in different processing capabilities

SLR Parsing v SLR Parsing = Easy extension of LR(0) »For each reduction X  , look at next symbol C »Apply reduction only if C is in FOLLOW(X) v SLR parsing table eliminates some conflicts »Same as LR(0) table except reduction rows »Adds reductions X   only in the columns of symbols in FOLLOW(X) num+$ES 1s4g2g6 2s3S  E Example: FOLLOW(S) = {$} Grammar S  E + S | E E  num

SLR Parsing Table v Reductions do not fill entire rows as before v Otherwise, same as LR(0) num+$ES 1s4g2g6 2s3S  E 3s4g2g5 4E  numE  num 5 S  E+S 6 s7 7 accept Grammar S  E + S | E E  num

Class Problem Consider: S  L = R S  R L  *R L  ident R  L Think of L as l-value, R as r-value, and * as a pointer dereference When you create the states in the SLR(1) DFA, 2 of the states are the following: S  L. = R R  L. S  R. Do you have any shift/reduce conflicts? (Not as easy as it looks)

LR(1) Parsing v Get as much as possible out of 1 lookahead symbol parsing table v LR(1) grammar = recognizable by a shift/reduce parser with 1 lookahead v LR(1) parsing uses similar concepts as LR(0) »Parser states = set of items »LR(1) item = LR(0) item + lookahead symbol possibly following production  LR(0) item:S . S + E  LR(1) item:S . S + E, +  Lookahead only has impact upon REDUCE operations, apply when lookahead = next input

LR(1) States v LR(1) state = set of LR(1) items v LR(1) item = (X  . , y) »Meaning:  already matched at top of the stack, next expect to see  y v Shorthand notation »(X  . , {x1,..., xn}) »means:  (X  . , x1) ...  (X  . , xn) v Need to extend closure and goto operations S  S. + E+,$ S  S +. Enum

LR(1) Closure v LR(1) closure operation: »Start with Closure(S) = S »For each item in S:  X  . Y , z  and for each production Y  , add the following item to the closure of S: Y . , FIRST(  z) »Repeat until nothing changes v Similar to LR(0) closure, but also keeps track of lookahead symbol

LR(1) Start State v Initial state: start with (S’ . S, $), then apply closure operation v Example: sum grammar S’ . S, $ S . E + S, $ S . E, $ E . num, +,$ closure S’  S $ S  E + S | E E  num

LR(1) Goto Operation v LR(1) goto operation = describes transitions between LR(1) states v Algorithm: for a state S and a symbol Y (as before) »If the item [X  . Y  ] is in I, then »Goto(I, Y) = Closure( [X   Y.  ] ) S  E. + S, $ S  E., $ Closure({S  E +. S, $}) Goto(S1, ‘+’) S1 S2 Grammar: S’  S$ S  E + S | E E  num

Class Problem 1. Compute: Closure(I = {S  E +. S, $}) 2. Compute: Goto(I, num) 3. Compute: Goto(I, E) S’  S $ S  E + S | E E  num

LR(1) DFA Construction S’ . S, $ S . E + S, $ S . E, $ E .num, +,$ E  num., +,$ S’  S., $ E num + S  E+S., +,$ S S  E +. S, $ S . E + S, $ S . E, $ E . num, +,$ S  E. + S, $ S  E., $ S Grammar S’  S$ S  E + S | E E  num E num

LR(1) Reductions S’ . S, $ S . E + S, $ S . E, $ E .num, +,$ E  num., +,$ S’ S., $ E num + S  E., +,$ S S  E +. S, $ S . E + S, $ S . E, $ E . num, +,$ S  E. + S, $ S  E., $ S Grammar S’  S$ S  E + S | E E  num E num Reductions correspond to LR(1) items of the form (X  ., y)

LR(1) Parsing Table Construction v Same as construction of LR(0), except for reductions v For a transition S  S’ on terminal x: »Table[S,x] += Shift(S’) v For a transition S  S’ on non-terminal N: »Table[S,N] += Goto(S’) v If I contains {(X  ., y)} then: »Table[I,y] += Reduce(X   )

LR(1) Parsing Table Example S’ . S, $ S . E + S, $ S . E, $ E .num, +,$ E + S  E +. S, $ S . E + S, $ S . E, $ E . num, +,$ S  E. + S, $ S  E., $ Grammar S’  S$ S  E + S | E E  num $E 1g2 2s3S  E Fragment of the parsing table

Class Problem Compute the LR(1) DFA for the following grammar E  E + T | T T  TF | F F  F* | a | b

LALR(1) Grammars v Problem with LR(1): too many states v LALR(1) parsing (aka LookAhead LR) »Constructs LR(1) DFA and then merge any 2 LR(1) states whose items are identical except lookahead »Results in smaller parser tables »Theoretically less powerful than LR(1) v LALR(1) grammar = a grammar whose LALR(1) parsing table has no conflicts S  id., + S  E., $ S  id., $ S  E., + += ??

LALR Parsers v LALR(1) »Generally same number of states as SLR (much less than LR(1)) »But, with same lookahead capability of LR(1) (much better than SLR) »Example: Pascal programming language  In SLR, several hundred states  In LR(1), several thousand states

Automate the Parsing Process v Can automate: »The construction of LR parsing tables »The construction of shift-reduce parsers based on these parsing tables v LALR(1) parser generators »yacc, bison »Not much difference compared to LR(1) in practice »Smaller parsing tables than LR(1) »Augment LALR(1) grammar specification with declarations of precedence, associativity »Output: LALR(1) parser program

Associativity S  S + E | E E  num E  E + E E  num What happens if we run this grammar through LALR construction? E  E + E E  num E  E + E., + E  E. + E, +,$ + shift/reduce conflict shift: 1+ (2+3) reduce: (1+2)

Associativity (2) v If an operator is left associative »Assign a slightly higher value to its precedence if it is on the parse stack than if it is in the input stream »Since stack precedence is higher, reduce will take priority (which is correct for left associative) v If operator is right associative »Assign a slightly higher value if it is in the input stream »Since input stream is higher, shift will take priority (which is correct for right associative)

Precedence E  E + E | T T  T x T | num | (E) E  E + E | E x E | num | (E) Shift/reduce conflict results What happens if we run this grammar through LALR construction? E  E. + E,... E  E x E., + E  E + E., x E  E. x E,... Precedence: attach precedence indicators to terminals Shift/reduce conflict resolved by: 1. If precedence of the input token is greater than the last terminal on parse stack, favor shift over reduce 2. If the precedence of the input token is less than or equal to the last terminal on the parse stack, favor reduce over shift

Abstract Syntax Tree (AST) - Review v Derivation = sequence of applied productions »S  E+S  1+S  1+E  1+2 v Parse tree = graph representation of a derivation »Doesn’t capture the order of applying the productions v AST discards unnecessary information from the parse tree S E+S ( S )E E + S 5 1 2E ( S ) E + S E34

Implicit AST Construction v LL/LR parsing techniques implicitly build AST v The parse tree is captured in the derivation »LL parsing: AST represented by applied productions »LR parsing: AST represented by applied reductions v We want to explicitly construct the AST during the parsing phase

AST Construction - LL void parse_S() { switch (token) { case num: case ‘(‘: parse_E(); parse_S’(); return; default: ParseError(); } Expr parse_S() { switch (token) { case num: case ‘(‘: Expr left = parse_E(); Expr right = parse_S’(); if (right == NULL) return left else return new Add(left,right); default: ParseError(); } LL parsing: extend procedures for non-terminals S  ES’ S’   | +S E  num | (S)

AST Construction - LR v We again need to add code for explicit AST construction v AST construction mechanism »Store parts of the tree on the stack »For each nonterminal symbol X on stack, also store the sub-tree rooted at X on stack »Whenever the parser performs a reduce operation for a production X  , create an AST node for X

AST Construction for LR - Example S  E + S | S E  num | (S) S + E.. Add Num(1)Num(2) stack Before reduction: S  E + S Num(3) S. Add Num(1) Num(2)Num(3) Add After reduction: S  E + S input string: “ ”

Problems v Unstructured code: mixing parsing code with AST construction code v Automatic parser generators »The generated parser needs to contain AST construction code »How to construct a customized AST data structure using an automatic parser generator? v May want to perform other actions concurrently with parsing phase »E.g., semantic checks »This can reduce the number of compiler passes

Syntax-Directed Definition v Solution: Syntax-directed definition »Extends each grammar production with an associated semantic action (code):  S  E + S {action} »The parser generator adds these actions into the generated parser »Each action is executed when the corresponding production is reduced

Semantic Actions v Actions = C code (for bison/yacc) v The actions access the parser stack »Parser generators extend the stack of symbols with entries for user-defined structures (e.g., parse trees) v The action code should be able to refer to the grammar symbols in the productions »Need to refer to multiple occurrences of the same non- terminal symbol, distinguish RHS vs LHS occurrence  E  E + E »Use dollar variables in yacc/bison ($$, $1, $2, etc.)  expr ::= expr PLUS expr{$$ = $1 + $3;}

Building the AST v Use semantic actions to build the AST v AST is built bottom-up along with parsing expr ::= NUM{$$ = new Num($1.val); } expr ::= expr PLUS expr{$$ = new Add($1, $3); } expr ::= expr MULT expr{$$ = new Mul($1, $3); } expr ::= LPAR expr RPAR{$$ = $2; } Recall: User-defined type for objects on the stack (%union)

Outline v DFA & NFA v Regular expression v Regular languages v Context free languages &PDA v Scanner v Parser »Top-down parsing »Bottom-up Parsing »Comparison

LL/LR Grammar Summary v LL parsing tables »Non-terminals x terminals  productions »Computed using FIRST/FOLLOW v LR parsing tables »LR states x terminals  {shift/reduce} »LR states x non-terminals  goto »Computed using closure/goto operations on LR states v A grammar is: »LL(1) if its LL(1) parsing table has no conflicts »same for LR(0), SLR, LALR(1), LR(1)

Top-Down Parsing S  S+E  E+E  (S)+E  (S+E)+E  (S+E+E)+E  (E+E+E)+E  (1+E+E)+E  (1+2+E)+E... S  S + E | E E  num | (S) In left-most derivation, entire tree above token (2) has been expanded when encountered S S+E ( S ) S + E 5 E 2 E 1 ( S ) S + E 4E 3

Top-Down vs Bottom-Up scanned unscannedscannedunscanned Top-downBottom-up Bottom-up: Don’t need to figure out as much of he parse tree for a given amount of input  More time to decide what rules to apply

Terminology: LL vs LR v LL(k) »Left-to-right scan of input »Left-most derivation »k symbol lookahead »[Top-down or predictive] parsing or LL parser »Performs pre-order traversal of parse tree v LR(k) »Left-to-right scan of input »Right-most derivation »k symbol lookahead »[Bottom-up or shift-reduce] parsing or LR parser »Performs post-order traversal of parse tree

Classification of Grammars LR(0) SLR LALR(1) LR(1) LL(1) LR(k)  LR(k+1) LL(k)  LL(k+0) LL(k)  LR(k) LR(0)  SLR LALR(1)  LR(1) not to scale

Bottom-Up Parsing (1+2+(3+4))+5  (E+2+(3+4))+5  (S+2+(3+4))+5  (S+E+(3+4))+5 S  S + E | E E  num | (S) Advantage of bottom-up parsing: can postpone the selection of productions until more of the input is scanned S S+E ( S ) S + E 5 E 2 E 1 ( S ) S + E 4E 3