Fall Compiler Principles Lecture 2: LL parsing

Slides:



Advertisements
Similar presentations
Compilation (Semester A, 2013/14) Lecture 6a: Syntax (Bottom–up parsing) Noam Rinetzky 1 Slides credit: Roman Manevich, Mooly Sagiv, Eran Yahav.
Advertisements

Compiler Principles Fall Compiler Principles Lecture 4: Parsing part 3 Roman Manevich Ben-Gurion University.
Mooly Sagiv and Roman Manevich School of Computer Science
Lecture 03 – Syntax analysis: top-down parsing Eran Yahav 1.
Parsing — Part II (Ambiguity, Top-down parsing, Left-recursion Removal)
Professor Yihjia Tsai Tamkang University
Top-Down Parsing.
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
Compiler Principles Winter Compiler Principles Syntax Analysis (Parsing) – Part 1 Mayer Goldberg and Roman Manevich Ben-Gurion University.
Chapter 9 Syntax Analysis Winter 2007 SEG2101 Chapter 9.
Top-Down Parsing - recursive descent - predictive parsing
4 4 (c) parsing. Parsing A grammar describes the strings of tokens that are syntactically legal in a PL A recogniser simply accepts or rejects strings.
4 4 (c) parsing. Parsing A grammar describes syntactically legal strings in a language A recogniser simply accepts or rejects strings A generator produces.
Profs. Necula CS 164 Lecture Top-Down Parsing ICOM 4036 Lecture 5.
4 4 (c) parsing. Parsing A grammar describes syntactically legal strings in a language A recogniser simply accepts or rejects strings A generator produces.
Exercise 1 A ::= B EOF B ::=  | B B | (B) Tokens: EOF, (, ) Generate constraints and compute nullable and first for this grammar. Check whether first.
Compiler Principles Fall Compiler Principles Lecture 3: Parsing part 2 Roman Manevich Ben-Gurion University.
Top-down Parsing Recursive Descent & LL(1) Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.
Top-Down Parsing CS 671 January 29, CS 671 – Spring Where Are We? Source code: if (b==0) a = “Hi”; Token Stream: if (b == 0) a = “Hi”; Abstract.
Top-Down Parsing.
Syntax Analyzer (Parser)
1 Topic #4: Syntactic Analysis (Parsing) CSC 338 – Compiler Design and implementation Dr. Mohamed Ben Othman ( )
Bernd Fischer RW713: Compiler and Software Language Engineering.
UMBC  CSEE   1 Chapter 4 Chapter 4 (b) parsing.
COMP 3438 – Part II-Lecture 6 Syntax Analysis III Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
Lecture 02b – Syntax analysis: top-down parsing Eran Yahav 1.
Compilation (Semester A, 2013/14) Lecture 4: Syntax Analysis (Top-Down Parsing) Modern Compiler Design: Chapter 2.2 Noam Rinetzky 1 Slides credit:
CMSC 330: Organization of Programming Languages Pushdown Automata Parsing.
Syntax Analysis By Noor Dhia Syntax analysis:- Syntax analysis or parsing is the most important phase of a compiler. The syntax analyzer considers.
Compiler Design (40-414) Main Text Book:
Programming Languages Translator
CS510 Compiler Lecture 4.
Fall Compiler Principles Context-free Grammars Refresher
Lecture 3: Syntax Analysis: CFLs, PDAs, Top-Down parsing Noam Rinetzky
Textbook:Modern Compiler Design
Parsing IV Bottom-up Parsing
Parsing — Part II (Top-down parsing, left-recursion removal)
Parsing with Context Free Grammars
Fall Compiler Principles Lecture 4: Parsing part 3
Lecture 3: Syntax Analysis: Top-Down parsing Noam Rinetzky
Bottom-Up Syntax Analysis
Top-Down Parsing.
4 (c) parsing.
Fall Compiler Principles Lecture 4: Parsing part 3
CS416 Compiler Design lec00-outline September 19, 2018
Parsing Techniques.
Top-Down Parsing.
Lexical and Syntax Analysis
Top-Down Parsing CS 671 January 29, 2008.
Introduction CI612 Compiler Design CI612 Compiler Design.
Lecture 8 Bottom Up Parsing
Compiler Design 7. Top-Down Table-Driven Parsing
Lecture 7: Introduction to Parsing (Syntax Analysis)
Lecture 8: Top-Down Parsing
LL and Recursive-Descent Parsing Hal Perkins Autumn 2011
Subject: Language Processor
Syntax Analysis - Parsing
CS416 Compiler Design lec00-outline February 23, 2019
Fall Compiler Principles Lecture 4: Parsing part 3
Fall Compiler Principles Lecture 2: LL parsing
Fall Compiler Principles Context-free Grammars Refresher
Chapter 3 Syntactic Analysis I.
Fall Compiler Principles Lecture 4b: Scanning/Parsing recap
LL and Recursive-Descent Parsing Hal Perkins Autumn 2009
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Predictive Parsing Program
LL and Recursive-Descent Parsing Hal Perkins Winter 2008
Lec00-outline May 18, 2019 Compiler Design CS416 Compiler Design.
Ben-Gurion University
Parsing CSCI 432 Computer Science Theory
Presentation transcript:

Fall 2016-2017 Compiler Principles Lecture 2: LL parsing Roman Manevich Ben-Gurion University of the Negev

Books Compilers Principles, Techniques, and Tools Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman Modern Compiler Implementation in Java Andrew W. Appel Modern Compiler Design D. Grune, H. Bal, C. Jacobs, K. Langendoen Advanced Compiler Design and Implementation Steven Muchnik

Tentative syllabus mid-term exam Front End Intermediate Representation Scanning Top-down Parsing (LL) Bottom-up Parsing (LR) Intermediate Representation Operational Semantics Lowering Optimizations Dataflow Analysis Loop Optimizations Code Generation Register Allocation Energy Optimization Instruction Selection mid-term exam

Parsing background Context-free grammars Context-free languages Terminals Nonterminals Start nonterminal Productions (rules) Context-free languages Derivations (leftmost, rightmost) Derivation tree (also called parse tree) Ambiguous grammars

Agenda Understand role of syntax analysis Parsing strategies LL parsing Building a predictor table via FIRST/FOLLOW/NULLABLE sets Pushdown automata algorithm Handling conflicts

Role of syntax analysis High-level Language (scheme) Executable Code Lexical Analysis Syntax Analysis Parsing AST Symbol Table etc. Inter. Rep. (IR) Code Generation Recover structure from stream of tokens Parse tree / abstract syntax tree Error reporting (recovery) Other possible tasks Syntax directed translation (one pass compilers) Create symbol table Create pretty-printed version of the program, e.g., Auto Formatting function in IDE

From tokens to abstract syntax trees program text 59 + (1257 * xPosition) Lexical Analyzer Regular expressions Finite automata Lexical error valid token stream ) id * num ( + Grammar: E  id E  num E  E + E E  E * E E  ( E ) Parser Context-free grammars Push-down automata syntax error valid + num x * Abstract Syntax Tree

Marking “end-of-file” Sometimes it will be useful to transform a grammar G with start non-terminal S into a grammar G’ with a new start non-terminal S‘ and a new production rule S’  S $ $ is not part of the set of tokens It is a special End-Of-File (EOF) token To parse α with G’ we change it into α $ Simplifies parsing grammars with null productions Also simplifies parsing LR grammars

Another convention We will assume that all productions have been consecutively numbered (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E )

Parsing strategies

Broad kinds of parsers Parsers for arbitrary grammars Cocke-Younger-Kasami [‘65] method O(n3) Earley’s method (implemented by NLTK) O(n3) but lower for restricted classes Not commonly used by compilers Parsers for restricted classes of grammars Top-Down With/without backtracking Bottom-Up

Top-down parsing Constructs parse tree in a top-down matter Find leftmost derivation Predictive: for every non-terminal and k-tokens predict the next production LL(k) Challenge: beginning with the start symbol, try to guess the productions to apply to end up at the user's program By Fidelio (Own work) [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC-BY-SA-3.0-2.5-2.0-1.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

Predictive parsing

Exercise: show leftmost derivation (1) E  LIT (2) | (E OP E) (3) | not E (4) LIT  true (5) | false (6) OP  and (7) | or (8) | xor How did we decide which production of ‘E’ to take? E  not E  E not ( E OP E )  not E not ( not E OP E )  not ( not LIT OP E )  ( E OP E ) not ( not true OP E )  not E or LIT not ( not true or E )  not ( not true or LIT )  LIT false not ( not true or false ) true

Predictive parsing Given a grammar G attempt to derive a word ω Idea Scan input from left to right Apply production to leftmost nonterminal Pick production rule based on next input token Problem: there is more than one production based for next token Solution: restrict grammars to LL(1) Parser correctly predicts which production to apply If grammar is not in LL(1) the parser construction algorithm will detect it

LL(1) parsing via pushdown automata Input stream $ b + a Stack of symbols (current sentential form) Parsing program X Y Z $ Derivation tree / error Prediction table nonterminal token production

LL(1) parsing algorithm Set stack=S$ while true Prediction When top of stack is nonterminal N Pop N lookup Table[N,t] If table[N,t] is not empty, push Table[N,t] on stack else return syntax error Match When top of stack is terminal t If t=next input toke, pop t and increment input index else return syntax error End When stack is empty If input is empty return success else return syntax error

Example prediction table (1) E → LIT (2) E → ( E OP E ) (3) E → not E (4) LIT → true (5) LIT → false (6) OP → and (7) OP → or (8) OP → xor ‘(‘  FIRST(‘( E OP E )’ ) Table entries determine which production to take Input tokens ( ) not true false and or xor $ E 2 3 1 LIT 4 5 OP 6 7 8 Nonterminals

Running parser example aacbb$ S  aSb | c Input suffix Stack content Move aacbb$ S$ predict(S,a) = S  aSb aSb$ match(a,a) acbb$ Sb$ aSbb$ cbb$ Sbb$ predict(S,c) = S  c match(c,c) bb$ match(b,b) b$ $ match($,$) – success a b c S S  aSb S  c

Illegal input example abcbb$ S  aSb | c Input suffix Stack content Move abcbb$ S$ predict(S,a) = S  aSb aSb$ match(a,a) bcbb$ Sb$ predict(S,b) = ERROR a b c S S  aSb S  c

Building the prediction table Let G be a grammar Compute FIRST/NULLABLE/FOLLOW Check for conflicts No conflicts => G is an LL(1) grammar Conflicts exit => G is not an LL(1) grammar Attempt to transform G into an equivalent LL(1) grammar G’

First sets

FIRST sets Definition: For a nonterminal A, FIRST(A) is the set of terminals that can start in a sentence derived from A Formally: FIRST(A) = {t | A * t ω} Definition: For a sentential form α, FIRST(α) is the set of terminals that can start in a sentence derived from α Formally: FIRST(α) = {t | α * t ω}

FIRST sets example E  LIT | (E OP E) | not E LIT  true | false OP  and | or | xor FIRST(E) = …? FIRST(LIT) = …? FIRST(OP) = …?

FIRST sets example E  LIT | (E OP E) | not E LIT  true | false OP  and | or | xor FIRST(E) = FIRST(LIT)  FIRST(( E OP E ))  FIRST(not E) FIRST(LIT) = { true, false } FIRST(OP) = {and, or, xor} A set of recursive equations How do we solve them?

Assume no null productions (A  ) Computing FIRST sets Assume no null productions (A  ) Initially, for all nonterminals A, set FIRST(A) = { t | A  t ω for some ω } Repeat the following until no changes occur: for each nonterminal A for each production A  α1 | … | αk FIRST(A) := FIRST(α1) ∪ … ∪ FIRST(αk) This is known as a fixed-point algorithm We will see such iterative methods later in the course and learn to reason about them

Exercise: compute FIRST FIRST(STMT) = FIRST(if) ∪ FIRST(while) ∪ FIRST(EXPR) FIRST(EXPR) = FIRST(TERM) ∪ FIRST(zero?) ∪ FIRST(not) ∪ FIRST(++) ∪ FIRST(--) FIRST(TERM) = FIRST(id) ∪ FIRST(constant) STMT  if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR  TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM  id | constant TERM EXPR STMT

Exercise: compute FIRST FIRST(STMT) = {if, while} ∪ FIRST(EXPR) FIRST(EXPR) = {zero?, not, ++, --} ∪ FIRST(TERM) FIRST(TERM) = {id, constant} STMT  if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR  TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM  id | constant TERM EXPR STMT

1. Initialization FIRST(STMT) = {if, while} ∪ FIRST(EXPR) FIRST(EXPR) = {zero?, not, ++, --} ∪ FIRST(TERM) FIRST(TERM) = {id, constant} STMT  if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR  TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM  id | constant TERM EXPR STMT id constant zero? Not ++ -- if while

2. Iterate 1 FIRST(STMT) = {if, while} ∪ FIRST(EXPR) FIRST(EXPR) = {zero?, not, ++, --} ∪ FIRST(TERM) FIRST(TERM) = {id, constant} STMT  if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR  TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM  id | constant TERM EXPR STMT id constant zero? Not ++ -- if while

2. Iterate 2 FIRST(STMT) = {if, while} ∪ FIRST(EXPR) FIRST(EXPR) = {zero?, not, ++, --} ∪ FIRST(TERM) FIRST(TERM) = {id, constant} STMT  if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR  TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM  id | constant TERM EXPR STMT id constant zero? Not ++ -- if while

2. Iterate 3 – fixed-point FIRST(STMT) = {if, while} ∪ FIRST(EXPR) FIRST(EXPR) = {zero?, not, ++, --} ∪ FIRST(TERM) FIRST(TERM) = {id, constant} STMT  if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR  TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM  id | constant TERM EXPR STMT id constant zero? Not ++ -- if while

Reasoning about the algorithm Assume no null productions (A  ) Initially, for all nonterminals A, set FIRST(A) = { t | A  t ω for some ω } Repeat the following until no changes occur: for each nonterminal A for each production A  α1 | … | αk FIRST(A) := FIRST(α1) ∪ … ∪ FIRST(αk) Is the algorithm correct? Does it terminate? (complexity)

Reasoning about the algorithm Termination: Correctness:

LL(1) Parsing of grammars without epsilon productions

Using FIRST sets Assume G has no epsilon productions and for every non-terminal X and every pair of productions X   and X   we have that FIRST()  FIRST() = {} No intersection between FIRST sets => can always pick a single rule

Using FIRST sets In our Boolean expressions example FIRST( LIT ) = { true, false } FIRST( ( E OP E ) ) = { ‘(‘ } FIRST( not E ) = { not } If the FIRST sets intersect, may need longer lookahead LL(k) = class of grammars in which production rule can be determined using a lookahead of k tokens LL(1) is an important and useful class What if there are epsilon productions?

Extending LL(1) Parsing for epsilon productions

FIRST, FOLLOW, NULLABLE sets For each non-terminal X FIRST(X) = set of terminals that can start in a sentence derived from X FIRST(X) = {t | X * t ω} NULLABLE(X) if X *  FOLLOW(X) = set of terminals that can follow X in some derivation FOLLOW(X) = {t | S *  X t }

Computing the NULLABLE set Lemma: NULLABLE(1 … k) = NULLABLE(1) …  NULLABLE(k) Initially NULLABLE(X) = false For each non-terminal X if exists a production X   then NULLABLE(X) = true Repeat for each production Y  1 … k if NULLABLE(1 … k) then NULLABLE(Y) = true until NULLABLE stabilizes

Exercise: compute NULLABLE S  A a b A  a |  B  A B | C C  b |  NULLABLE(S) = NULLABLE(A)  NULLABLE(a)  NULLABLE(b) NULLABLE(A) = NULLABLE(a)  NULLABLE() NULLABLE(B) = NULLABLE(A)  NULLABLE(B)  NULLABLE(C) NULLABLE(C) = NULLABLE(b)  NULLABLE()

FIRST with epsilon productions How do we compute FIRST(1 … k) when epsilon productions are allowed? FIRST(1 … k) = ?

FIRST with epsilon productions How do we compute FIRST(1 … k) when epsilon productions are allowed? FIRST(1 … k) = if not NULLABLE(1) then FIRST(1) else FIRST(1)  FIRST (2 … k)

Exercise: compute FIRST S  A c b A  a |  NULLABLE(S) = NULLABLE(A)  NULLABLE(c)  NULLABLE(b) NULLABLE(A) = NULLABLE(a)  NULLABLE() FIRST(S) = FIRST(A)  FIRST(cb) FIRST(A) = FIRST(a)  FIRST () FIRST(S) = FIRST(A)  {c} FIRST(A) = {a}

FOLLOW sets p. 189 if X  α Y  then FOLLOW(Y) ? if NULLABLE() or = then FOLLOW(Y) ?

FOLLOW sets p. 189 if X  α Y  then FOLLOW(Y)  FIRST() if NULLABLE() or = then FOLLOW(Y) ?

FOLLOW sets p. 189 if X  α Y  then FOLLOW(Y)  FIRST() if NULLABLE() or = then FOLLOW(Y)  FOLLOW(X)

FOLLOW sets p. 189 if X  α Y  then FOLLOW(Y)  FIRST() if NULLABLE() or = then FOLLOW(Y)  FOLLOW(X) Allows predicting epsilon productions: X   when the lookahead token is in FOLLOW(X) S  A c b A  a |  What should we predict for input “cb”? What should we predict for input “acb”?

LL(1) conflicts

Conflicts FIRST-FIRST conflict X  α and X   and If FIRST(α)  FIRST(β)  {} FIRST-FOLLOW conflict NULLABLE(X) If FIRST(X)  FOLLOW(X)  {}

LL(1) grammars A grammar is in the class LL(1) when its LL(1) prediction table contains no conflicts A language is said to be LL(1) when it has an LL(1) grammar

LL(k) grammars

LL(k) grammars Generalizes LL(1) for k lookahead tokens Need to generalize FIRST and FOLLOW for k lookahead tokens

Agenda LL(k) via pushdown automata Predicting productions via FIRST/FOLLOW/NULLABLE sets Handling conflicts

Handling conflicts

Problem 1: FIRST-FIRST conflict term  ID | indexed_elem indexed_elem  ID [ expr ] FIRST(term) = { ID } FIRST(indexed_elem) = { ID } How can we transform the grammar into an equivalent grammar that does not have this conflict?

Solution: left factoring Rewrite the grammar to be in LL(1) term  ID | indexed_elem indexed_elem  ID [ expr ] New grammar is more complex – has epsilon production term  ID after_ID After_ID  [ expr ] |  Intuition: just like factoring in algebra: x*y + x*z into x*(y+z)

Exercise: apply left factoring S  if E then S else S | if E then S | T

Exercise: apply left factoring S  if E then S else S | if E then S | T S  if E then S S’ | T S’  else S | 

Problem 2: FIRST-FOLLOW conflict S  A a b A  a |  FIRST(S) = { a } FOLLOW(S) = { } FIRST(A) = { a } FOLLOW(A) = { a } How can we transform the grammar into an equivalent grammar that does not have this conflict?

Solution: substitution S  A a b A  a |  Substitute A in S S  a a b | a b

Solution: substitution S  A a b A  a |  Substitute A in S S  a a b | a b Left factoring S  a after_A after_A  a b | b

Problem 3: FIRST-FIRST conflict E  E - term | term Left recursion cannot be handled with a bounded lookahead How can we transform the grammar into an equivalent grammar that does not have this conflict?

Solution: left recursion removal p. 130 N  Nα | β N  βN’ N’  αN’ |  G1 G2 L(G1) = β, βα, βαα, βααα, … L(G2) = same Can be done algorithmically. Problem 1: grammar becomes mangled beyond recognition Problem 2: grammar may not be LL(1) For our 3rd example: E  E - term | term E  term TE | term TE  - term TE | 

Recap Given a grammar Compute for each non-terminal NULLABLE FIRST using NULLABLE FOLLOW using FIRST and NULLABLE Compute FIRST for each sentential form appearing on right-hand side of a production Check for conflicts If exist: attempt to remove conflicts by rewriting grammar

The bigger picture Compilers include different kinds of program analyses each further constrains the set of legal programs Lexical constraints Syntax constraints Semantic constraints “Logical” constraints (Verifying Compiler grand challenge) Program consists of legal tokens Program included in a given context-free language Program included in a given attribute grammar (type checking, legal inheritance graph, variables initialized before used) Memory safety: null dereference, array-out-of-bounds access, data races, functional correctness (program meets specification)

Next lecture: bottom-up parsing