Fall Compiler Principles Lecture 4: Parsing part 3

Slides:



Advertisements
Similar presentations
Compiler construction in4020 – lecture 4 Koen Langendoen Delft University of Technology The Netherlands.
Advertisements

Compilation (Semester A, 2013/14) Lecture 6a: Syntax (Bottom–up parsing) Noam Rinetzky 1 Slides credit: Roman Manevich, Mooly Sagiv, Eran Yahav.
Compiler Principles Fall Compiler Principles Lecture 4: Parsing part 3 Roman Manevich Ben-Gurion University.
Bottom up Parsing Bottom up parsing trys to transform the input string into the start symbol. Moves through a sequence of sentential forms (sequence of.
Chapter 3 Syntax Analysis
Abstract Syntax Mooly Sagiv html:// 1.
1 JavaCUP JavaCUP (Construct Useful Parser) is a parser generator Produce a parser written in java, itself is also written in Java; There are many parser.
Mooly Sagiv and Roman Manevich School of Computer Science
Cse321, Programming Languages and Compilers 1 6/12/2015 Lecture #10, Feb. 14, 2007 Modified sets of item construction Rules for building LR parse tables.
6/12/2015Prof. Hilfinger CS164 Lecture 111 Bottom-Up Parsing Lecture (From slides by G. Necula & R. Bodik)
1 Chapter 5: Bottom-Up Parsing (Shift-Reduce). 2 - attempts to construct a parse tree for an input string beginning at the leaves (the bottom) and working.
Bottom-Up Syntax Analysis Mooly Sagiv Textbook:Modern Compiler Design Chapter (modified)
Bottom-Up Syntax Analysis Mooly Sagiv html:// Textbook:Modern Compiler Design Chapter
Context-Free Grammars Lecture 7
Compiler Construction Parsing Rina Zviel-Girshin and Ohad Shacham School of Computer Science Tel-Aviv University.
Lecture #8, Feb. 7, 2007 Shift-reduce parsing,
Bottom-Up Syntax Analysis Mooly Sagiv & Greta Yorsh Textbook:Modern Compiler Design Chapter (modified)
Syntax Analysis Mooly Sagiv Textbook:Modern Compiler Design Chapter 2.2 (Partial)
Bottom-Up Syntax Analysis Mooly Sagiv & Greta Yorsh Textbook:Modern Compiler Design Chapter (modified)
Table-driven parsing Parsing performed by a finite state machine. Parsing algorithm is language-independent. FSM driven by table (s) generated automatically.
Compiler Construction Parsing II Ran Shaham and Ohad Shacham School of Computer Science Tel-Aviv University.
Compiler Construction Parsing I Ran Shaham and Ohad Shacham School of Computer Science Tel-Aviv University.
Syntax and Semantics Structure of programming languages.
LR Parsing Compiler Baojian Hua
Automated Parser Generation (via CUP)CUP 1. High-level structure JFlexjavac Lexer spec Lexical analyzer text tokens.java CUPjavac Parser spec.javaParser.
Syntactic Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.
Syntax and Semantics Structure of programming languages.
Syntax Analysis Mooly Sagiv Textbook:Modern Compiler Design Chapter 2.2 (Partial) 1.
Chapter 5: Bottom-Up Parsing (Shift-Reduce)
Compilation /15a Lecture 5 1 * Syntax Analysis: Bottom-Up parsing Noam Rinetzky.
CS 614: Theory and Construction of Compilers Lecture 9 Fall 2002 Department of Computer Science University of Alabama Joel Jones.
Compiler Principles Fall Compiler Principles Lecture 6: Parsing part 5 Roman Manevich Ben-Gurion University.
CS 614: Theory and Construction of Compilers Lecture 7 Fall 2003 Department of Computer Science University of Alabama Joel Jones.
Compiler Principles Winter Compiler Principles Syntax Analysis (Parsing) – Part 3 Mayer Goldberg and Roman Manevich Ben-Gurion University.
Mooly Sagiv and Roman Manevich School of Computer Science
Compiler Principles Fall Compiler Principles Lecture 5: Parsing part 4 Roman Manevich Ben-Gurion University.
4. Bottom-up Parsing Chih-Hung Wang
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 6: LR grammars and automatic parser generators.
Lecture 5: LR Parsing CS 540 George Mason University.
Compilers: Bottom-up/6 1 Compiler Structures Objective – –describe bottom-up (LR) parsing using shift- reduce and parse tables – –explain how LR.
Bottom-up parsing. Bottom-up parsing builds a parse tree from the leaves (terminals) to the start symbol int E T * TE+ T (4) (2) (3) (5) (1) int*+ E 
Syntax and Semantics Structure of programming languages.
Compiler Principles Fall Compiler Principles Lecture 5: Parsing part 4 Roman Manevich Ben-Gurion University.
Announcements/Reading
CUP: An LALR Parser Generator for Java
A Simple Syntax-Directed Translator
Programming Languages Translator
CS510 Compiler Lecture 4.
50/50 rule You need to get 50% from tests, AND
Compiler Baojian Hua LR Parsing Compiler Baojian Hua
Unit-3 Bottom-Up-Parsing.
Textbook:Modern Compiler Design
Table-driven parsing Parsing performed by a finite state machine.
Java CUP.
Fall Compiler Principles Lecture 4: Parsing part 3
Bottom-Up Syntax Analysis
Fall Compiler Principles Lecture 4: Parsing part 3
CPSC 388 – Compiler Design and Construction
Top-Down Parsing CS 671 January 29, 2008.
Introduction CI612 Compiler Design CI612 Compiler Design.
CPSC 388 – Compiler Design and Construction
Parsing #2 Leonidas Fegaras.
Lecture 4: Syntax Analysis: Botom-Up Parsing Noam Rinetzky
Lecture 7: Introduction to Parsing (Syntax Analysis)
Bottom Up Parsing.
Subject: Language Processor
Parsing #2 Leonidas Fegaras.
5. Bottom-Up Parsing Chih-Hung Wang
Kanat Bolazar February 16, 2010
Lec00-outline May 18, 2019 Compiler Design CS416 Compiler Design.
Presentation transcript:

Fall 2017-2018 Compiler Principles Lecture 4: Parsing part 3 Roman Manevich Ben-Gurion University of the Negev

Tentative syllabus mid-term exam Front End Intermediate Representation Scanning Top-down Parsing (LL) Bottom-up Parsing (LR) Intermediate Representation Operational Semantics Lowering Optimizations Dataflow Analysis Loop Optimizations Code Generation Register Allocation Instruction Selection mid-term exam

Previously LR(0) parsing Running the parser Constructing transition diagram Constructing parser table Detecting conflicts

Agenda SLR LR(1) LALR(1) Automatic LR parser generation Handling ambiguities

SLR parsing

SRL (Simple LR) parsing Observation: a handle should not be reduced to non-terminal N if the next token cannot follow N A reduce item N  α is applicable only when the next token b is in FOLLOW(N) If b is not in FOLLOW(N) we just proved there is no terminating derivation S * βNb and thus it is safe to remove the reduce item from the conflicted state SLR differs from LR(0) only on ACTION table Now a row in the parsing table may contain both shift actions and reduce actions and we need to consult the current token to decide which one to take

Exercise: What is FOLLOW(T)? E  T E  E + T T  id T  ( E ) T  id[E]

Lookahead token from the input SLR action table Lookahead token from the input State id + ( ) [ ] $ shift 1 2 accept 3 4 EE+T 5 r5 r5, s6 Tid 6 ET 7 8 9 T(E) state action q0 shift q1 q2 q3 q4 EE+T q5 Tid q6 ET q7 q8 q9 TE vs. SLR – use 1 token look-ahead LR(0) – no look-ahead [ is not in FOLLOW(T) … as before… T  id T  id[E]

Beyond SLR parsing

Going beyond SLR Some common language constructs introduce conflicts even for SLR (0) S’ → S (1) S → L = R (2) S → R (3) L → * R (4) L → id (5) R → L

L = R id * L → id  id id * L L R q3 R S → R  q0 S S’ →  S S →  L = R S →  R L →  * R L →  id R →  L S q1 S’ → S q9 S → L = R  q2 L S → L  = R R → L  = R q6 S → L =  R R →  L L →  * R L →  id q5 id * L → id  * id id q4 L → *  R R →  L L →  * R L →  id * L q8 L R → L  q7 L → * R  R

shift/reduce conflict (0) S’ → S (1) S → L = R (2) S → R (3) L → * R (4) L → id (5) R → L q2 S → L  = R R → L  = q6 S → L =  R R →  L L →  * R L →  id S → L  = R vs. R → L  FOLLOW(R) contains = S → L = R → * R = R SLR cannot resolve conflict

Inputs requiring shift/reduce (0) S’ → S (1) S → L = R (2) S → R (3) L → * R (4) L → id (5) R → L q2 S → L  = R R → L  = q6 S → L =  R R →  L L →  * R L →  id For the input id the rightmost derivation S’ → S → R → L → id requires reducing in q2 For the input id = id S’ → S → L = R → L = L → L = id → id = id requires shifting

LR(1) grammars In SLR: a reduce item N  α is applicable only when the lookahead is in FOLLOW(N) But for a given context (state) are all tokens in FOLLOW(N) indeed possible? Not always We can compute a context-sensitive (i.e., specific to a given state) subset of FOLLOW(N) and use it to remove even more conflicts LR(1) keeps lookahead with each LR item Idea: a more refined notion of FOLLOW computed per item

N  αβ, t LR(1) item Already matched To be matched Input Hypothesis about αβ being a possible handle: so far we’ve matched α, expecting to see β and after reducing N we expect to see the token t

LR(1) items LR(1) item is a pair Meaning Example LR(0) item Lookahead token Meaning We matched the part left of the dot, looking to match the part on the right of the dot, followed by the lookahead token Example The production L  id yields the following LR(1) items LR(1) items [L → ● id, *] [L → ● id, =] [L → ● id, id] [L → ● id, $] [L → id ●, *] [L → id ●, =] [L → id ●, id] [L → id ●, $] (0) S’ → S (1) S → L = R (2) S → R (3) L → * R (4) L → id (5) R → L LR(0) items [L → ● id] [L → id ●]

Computing Closure for LR(1) For every [A → α ● Bβ , c] in S for every production B→δ and every token b in the grammar such that b  FIRST(βc) Add [B → ● δ , b] to S

q3 q0 (S → R ∙ , $) (S’ → ∙ S , $) (S → ∙ L = R , $) (S → ∙ R , $) (L → ∙ * R , = ) (L → ∙ id , = ) (R → ∙ L , $ ) (L → ∙ id , $ ) (L → ∙ * R , $ ) q9 q11 R (S → L = R ∙ , $) (L → id ∙ , $) q1 S (S’ → S ∙ , $) R id id L q2 q12 (S → L ∙ = R , $) (R → L ∙ , $) (R → L ∙ , $) = id * * q6 (S → L = ∙ R , $) (R → ∙ L , $) (L → ∙ * R , $) (L → ∙ id , $) L q10 q4 id q5 (L → * ∙ R , =) (R → ∙ L , =) (L → ∙ * R , =) (L → ∙ id , =) (L → * ∙ R , $) (R → ∙ L , $) (L → ∙ * R , $) (L → ∙ id , $) (L → id ∙ , $) (L → id ∙ , =) (L → * ∙ R , $) (R → ∙ L , $) (L → ∙ * R , $) (L → ∙ id , $) * L * q8 R q7 (R → L ∙ , =) (R → L ∙ , $) R q13 (L → * R ∙ , =) (L → * R ∙ , $) (L → * R ∙ , $)

Back to the conflict Is there a conflict now? q2 (S → L ∙ = R , $) (R → L ∙ , $) = q6 (S → L = ∙ R , $) (R → ∙ L , $) (L → ∙ * R , $) (L → ∙ id , $) Is there a conflict now?

Question What is SLR(1)?

LALR(1)

LALR(1) LR(1) tables have huge number of entries Often don’t need such refined observation (and cost) Idea: find states with the same LR(0) component and merge their lookaheads component as long as there are no conflicts LALR(1) not as powerful as LR(1) in theory but works quite well in practice Merging may not introduce new shift-reduce conflicts, only reduce-reduce, which is unlikely in practice

q3 q0 (S → R ∙ , $) (S’ → ∙ S , $) (S → ∙ L = R , $) (S → ∙ R , $) (L → ∙ * R , = ) (L → ∙ id , = ) (R → ∙ L , $ ) (L → ∙ id , $ ) (L → ∙ * R , $ ) q9 q11 R (S → L = R ∙ , $) (L → id ∙ , $) q1 S (S’ → S ∙ , $) R id id L q2 q12 (S → L ∙ = R , $) (R → L ∙ , $) (R → L ∙ , $) = id * * q6 (S → L = ∙ R , $) (R → ∙ L , $) (L → ∙ * R , $) (L → ∙ id , $) L q10 q4 id q5 (L → * ∙ R , =) (R → ∙ L , =) (L → ∙ * R , =) (L → ∙ id , =) (L → * ∙ R , $) (R → ∙ L , $) (L → ∙ * R , $) (L → ∙ id , $) (L → id ∙ , $) (L → id ∙ , =) (L → * ∙ R , $) (R → ∙ L , $) (L → ∙ * R , $) (L → ∙ id , $) * L * q8 R q7 (R → L ∙ , =) (R → L ∙ , $) R q13 (L → * R ∙ , =) (L → * R ∙ , $) (L → * R ∙ , $)

q3 q0 (S → R ∙ , $) (S’ → ∙ S , $) (S → ∙ L = R , $) (S → ∙ R , $) (L → ∙ * R , = ) (L → ∙ id , = ) (R → ∙ L , $ ) (L → ∙ id , $ ) (L → ∙ * R , $ ) q9 q11 R (S → L = R ∙ , $) (L → id ∙ , $) q1 S (S’ → S ∙ , $) R id id L q2 q12 (S → L ∙ = R , $) (R → L ∙ , $) (R → L ∙ , $) = id * * q6 (S → L = ∙ R , $) (R → ∙ L , $) (L → ∙ * R , $) (L → ∙ id , $) L q10 q4 id q5 (L → * ∙ R , =) (R → ∙ L , =) (L → ∙ * R , =) (L → ∙ id , =) (L → * ∙ R , $) (R → ∙ L , $) (L → ∙ * R , $) (L → ∙ id , $) (L → id ∙ , $) (L → id ∙ , =) (L → * ∙ R , $) (R → ∙ L , $) (L → ∙ * R , $) (L → ∙ id , $) * L * q8 R q7 (R → L ∙ , =) (R → L ∙ , $) R q13 (L → * R ∙ , =) (L → * R ∙ , $) (L → * R ∙ , $)

q3 q0 (S → R ∙ , $) (S’ → ∙ S , $) (S → ∙ L = R , $) (S → ∙ R , $) (L → ∙ * R , = ) (L → ∙ id , = ) (R → ∙ L , $ ) (L → ∙ id , $ ) (L → ∙ * R , $ ) q9 R (S → L = R ∙ , $) q1 S (S’ → S ∙ , $) R L q2 (S → L ∙ = R , $) (R → L ∙ , $) = id * * q6 (S → L = ∙ R , $) (R → ∙ L , $) (L → ∙ * R , $) (L → ∙ id , $) q4 id q5 (L → * ∙ R , =) (R → ∙ L , =) (L → ∙ * R , =) (L → ∙ id , =) (L → * ∙ R , $) (R → ∙ L , $) (L → ∙ * R , $) (L → ∙ id , $) (L → id ∙ , $) (L → id ∙ , =) * id L L q8 R q7 (R → L ∙ , =) (R → L ∙ , $) (L → * R ∙ , =) (L → * R ∙ , $)

Left/Right- recursion At home: create a simple grammar with left-recursion and one with right-recursion Construct corresponding LR(0) parser Any conflicts? Run on simple input and observe behavior Attempt to generalize observation for long inputs

Example: non-LR(1) grammar (1) S  Y b c $ (2) S  Z b d $ (3) Y  a (4) Z  a S  ∙ Y b c, $ S  ∙ Z b d, $ Y  ∙ a, b Z  ∙ a, b a Y  a ∙, b Z  a ∙, b reduce-reduce conflict on lookahead ‘b’

Automated parser generation (via CUP)

High-level structure Lexer.java LANG.lex (Token.java) text Lexer spec .java Lexical analyzer JFlex javac LANG.lex Lexer.java tokens (Token.java) Parser spec .java Parser CUP javac Parser.java sym.java LANG.cup AST

Expression calculator expr  expr + expr | expr - expr | expr * expr | expr / expr | - expr | ( expr ) | number Goals of expression calculator parser: Is 2+3+4+5 a valid expression? What is the meaning (value) of this expression?

Syntax analysis with CUP CUP – parser generator Generates an LALR(1) Parser Input: spec file Output: a syntax analyzer Can dump automaton and table tokens Parser spec .java Parser CUP javac AST

CUP spec file Package and import specifications User code components Symbol (terminal and non-terminal) lists Terminals go to sym.java Types of AST nodes Precedence declarations The grammar Semantic actions to construct AST

Parsing ambiguous grammars

Expression Calculator – 1st Attempt terminal Integer NUMBER; terminal PLUS, MINUS, MULT, DIV; terminal LPAREN, RPAREN; non terminal Integer expr; expr ::= expr PLUS expr | expr MINUS expr | expr MULT expr | expr DIV expr | MINUS expr | LPAREN expr RPAREN | NUMBER ; Symbol type explained later

Ambiguities a b c * + a b c + * a + b * c a b c + a b c + a + b + c

Ambiguities as conflicts for LR(1) * + a b c + * a + b * c a b c + a b c + a + b + c

Expression Calculator – 2nd Attempt terminal Integer NUMBER; terminal PLUS,MINUS,MULT,DIV; terminal LPAREN, RPAREN; terminal UMINUS; non terminal Integer expr; precedence left PLUS, MINUS; precedence left DIV, MULT; precedence left UMINUS; expr ::= expr PLUS expr | expr MINUS expr | expr MULT expr | expr DIV expr | MINUS expr %prec UMINUS | LPAREN expr RPAREN | NUMBER ; Increasing precedence Contextual precedence

Parsing ambiguous grammars using precedence declarations Each terminal assigned with precedence By default all terminals have lowest precedence User can assign his own precedence CUP assigns each production a precedence Precedence of rightmost terminal in production or user-specified contextual precedence On shift/reduce conflict resolve ambiguity by comparing precedence of terminal and production and decides whether to shift or reduce In case of equal precedences left/right help resolve conflicts left means reduce right means shift More information on precedence declarations in CUP’s manual

Resolving ambiguity (associativity) precedence left PLUS a b c + a b c + a + b + c

Resolving ambiguity (op. precedence) precedence left PLUS precedence left MULT a b c * + a b c + * a + b * c

Resolving ambiguity (contextual) precedence left MULT MINUS expr %prec UMINUS * - - * b a a b - a * b

Resolving ambiguity terminal Integer NUMBER; terminal PLUS,MINUS,MULT,DIV; terminal LPAREN, RPAREN; terminal UMINUS; precedence left PLUS, MINUS; precedence left DIV, MULT; precedence left UMINUS; expr ::= expr PLUS expr | expr MINUS expr | expr MULT expr | expr DIV expr | MINUS expr %prec UMINUS | LPAREN expr RPAREN | NUMBER ; UMINUS never returned by scanner (used only to define precedence) Rule has precedence of UMINUS

More CUP directives precedence nonassoc NEQ start non-terminal Non-associative operators: < > == != etc. 1<2<3 identified as an error (semantic error?) start non-terminal Specifies start non-terminal other than first non-terminal Can change to test parts of grammar Getting internal representation Command line options: -dump_grammar -dump_states -dump_tables -dump

Scanner integration Parser gets terminals from the scanner import java_cup.runtime.*; %% %cup %eofval{ return new Symbol(sym.EOF); %eofval} NUMBER=[0-9]+ <YYINITIAL>”+” { return new Symbol(sym.PLUS); } <YYINITIAL>”-” { return new Symbol(sym.MINUS); } <YYINITIAL>”*” { return new Symbol(sym.MULT); } <YYINITIAL>”/” { return new Symbol(sym.DIV); } <YYINITIAL>”(” { return new Symbol(sym.LPAREN); } <YYINITIAL>”)” { return new Symbol(sym.RPAREN); } <YYINITIAL>{NUMBER} { return new Symbol(sym.NUMBER, new Integer(yytext())); } <YYINITIAL>\n { } <YYINITIAL>. { } Generated from token declarations in .cup file Parser gets terminals from the scanner

Recap Package and import specifications and user code components Symbol (terminal and non-terminal) lists Define building-blocks of the grammar Precedence declarations May help resolve conflicts The grammar May introduce conflicts that have to be resolved

Abstract syntax tree construction

Assigning meaning So far, only validation expr ::= expr PLUS expr | expr MINUS expr | expr MULT expr | expr DIV expr | MINUS expr %prec UMINUS | LPAREN expr RPAREN | NUMBER ; So far, only validation Add Java code implementing semantic actions

Assigning meaning Symbol labels used to name variables non terminal Integer expr; expr ::= expr:e1 PLUS expr:e2 {: RESULT = new Integer(e1.intValue() + e2.intValue()); :} | expr:e1 MINUS expr:e2 {: RESULT = new Integer(e1.intValue() - e2.intValue()); :} | expr:e1 MULT expr:e2 {: RESULT = new Integer(e1.intValue() * e2.intValue()); :} | expr:e1 DIV expr:e2 {: RESULT = new Integer(e1.intValue() / e2.intValue()); :} | MINUS expr:e1 {: RESULT = new Integer(0 - e1.intValue(); :} %prec UMINUS | LPAREN expr:e1 RPAREN {: RESULT = e1; :} | NUMBER:n {: RESULT = n; :} ; Symbol labels used to name variables RESULT names the left-hand side symbol

Abstract Syntax Trees More useful representation of syntax tree Less clutter Actual level of detail depends on your design Basis for semantic analysis Later annotated with various information Type information Computed values Technically – a class hierarchy of abstract syntax tree nodes

Parse tree vs. AST expr + expr + expr expr expr 1 + ( 2 ) + ( 3 ) 1 2

AST hierarchy example expr int_const plus minus times divide

AST construction AST Nodes constructed during parsing Bottom-up parser Stored in push-down stack Bottom-up parser Grammar rules annotated with actions for AST construction When node is constructed all children available (already constructed) Node (RESULT) pushed on stack

AST construction expr ::= expr:e1 PLUS expr:e2 1 + (2) + (3) expr ::= expr:e1 PLUS expr:e2 {: RESULT = new plus(e1,e2); :} | LPAREN expr:e RPAREN {: RESULT = e; :} | INT_CONST:i {: RESULT = new int_const(…, i); :} expr + (2) + (3) expr + (expr) + (3) expr + (3) expr + (expr) expr plus e1 e2 expr plus e1 e2 expr expr expr expr int_const val = 1 int_const val = 2 int_const val = 3 1 + ( 2 ) + ( 3 )

Executed when e is shifted Example of lists terminal Integer NUMBER; terminal PLUS,MINUS,MULT,DIV,LPAREN,RPAREN,SEMI; terminal UMINUS; non terminal Integer expr; non terminal expr_list, expr_part; precedence left PLUS, MINUS; precedence left DIV, MULT; precedence left UMINUS; expr_list ::= expr_list expr_part | expr_part ; expr_part ::= expr:e {: System.out.println("= " + e); :} SEMI expr ::= expr PLUS expr | expr MINUS expr | expr MULT expr | expr DIV expr | MINUS expr %prec UMINUS | LPAREN expr RPAREN | NUMBER Executed when e is shifted

Grammar Hierarchy Non-ambiguous CFG LR(1) LALR(1) LL(1) SLR(1) LR(0)

Next lecture: IR and Operational Semantics