Parsing Compiler Baojian Hua Front End source code abstract syntax tree lexical analyzer parser tokens IR semantic analyzer.

Slides:



Advertisements
Similar presentations
Parsing 4 Dr William Harrison Fall 2008
Advertisements

Mooly Sagiv and Roman Manevich School of Computer Science
6/12/2015Prof. Hilfinger CS164 Lecture 111 Bottom-Up Parsing Lecture (From slides by G. Necula & R. Bodik)
Top-Down Parsing.
Parsing Discrete Mathematics and Its Applications Baojian Hua
By Neng-Fa Zhou Syntax Analysis lexical analyzer syntax analyzer semantic analyzer source program tokens parse tree parser tree.
CS Summer 2005 Top-down and Bottom-up Parsing - a whirlwind tour June 20, 2005 Slide acknowledgment: Radu Rugina, CS 412.
Context-Free Grammars Lecture 7
COS 320 Compilers David Walker.
Parsing III (Eliminating left recursion, recursive descent parsing)
COS 320 Compilers David Walker. last time context free grammars (Appel 3.1) –terminals, non-terminals, rules –derivations & parse trees –ambiguous grammars.
CS 310 – Fall 2006 Pacific University CS310 Parsing with Context Free Grammars Today’s reference: Compilers: Principles, Techniques, and Tools by: Aho,
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
1 The Parser Its job: –Check and verify syntax based on specified syntax rules –Report errors –Build IR Good news –the process can be automated.
Professor Yihjia Tsai Tamkang University
COS 320 Compilers David Walker. last time context free grammars (Appel 3.1) –terminals, non-terminals, rules –derivations & parse trees –ambiguous grammars.
Top-Down Parsing.
COP4020 Programming Languages
Chapter 2 Chang Chi-Chung rev.1. A Simple Syntax-Directed Translator This chapter contains introductory material to Chapters 3 to 8  To create.
Chapter 3 Chang Chi-Chung Parse tree intermediate representation The Role of the Parser Lexical Analyzer Parser Source Program Token Symbol.
CSE 413 Programming Languages & Implementation Hal Perkins Autumn 2012 Context-Free Grammars and Parsing 1.
8/19/2015© Hal Perkins & UW CSEC-1 CSE P 501 – Compilers Parsing & Context-Free Grammars Hal Perkins Winter 2008.
CPSC 388 – Compiler Design and Construction Parsers – Context Free Grammars.
Parsing Chapter 4 Parsing2 Outline Top-down v.s. Bottom-up Top-down parsing Recursive-descent parsing LL(1) parsing LL(1) parsing algorithm First.
LR Parsing Compiler Baojian Hua
Top-Down Parsing - recursive descent - predictive parsing
4 4 (c) parsing. Parsing A grammar describes the strings of tokens that are syntactically legal in a PL A recogniser simply accepts or rejects strings.
BİL 744 Derleyici Gerçekleştirimi (Compiler Design)1 Syntax Analyzer Syntax Analyzer creates the syntactic structure of the given source program. This.
4 4 (c) parsing. Parsing A grammar describes syntactically legal strings in a language A recogniser simply accepts or rejects strings A generator produces.
Profs. Necula CS 164 Lecture Top-Down Parsing ICOM 4036 Lecture 5.
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
4 4 (c) parsing. Parsing A grammar describes syntactically legal strings in a language A recogniser simply accepts or rejects strings A generator produces.
11 Chapter 4 Grammars and Parsing Grammar Grammars, or more precisely, context-free grammars, are the formalism for describing the structure of.
Exercise 1 A ::= B EOF B ::=  | B B | (B) Tokens: EOF, (, ) Generate constraints and compute nullable and first for this grammar. Check whether first.
COS 320 Compilers David Walker. The Front End Lexical Analysis: Create sequence of tokens from characters (Chap 2) Syntax Analysis: Create abstract syntax.
LL(k) Parsing Compiler Baojian Hua
Muhammad Idrees, Lecturer University of Lahore 1 Top-Down Parsing Top down parsing can be viewed as an attempt to find a leftmost derivation for an input.
LESSON 04.
Top-down Parsing lecture slides from C OMP 412 Rice University Houston, Texas, Fall 2001.
Top-Down Parsing CS 671 January 29, CS 671 – Spring Where Are We? Source code: if (b==0) a = “Hi”; Token Stream: if (b == 0) a = “Hi”; Abstract.
1 Context free grammars  Terminals  Nonterminals  Start symbol  productions E --> E + T E --> E – T E --> T T --> T * F T --> T / F T --> F F --> (F)
Top-down Parsing. 2 Parsing Techniques Top-down parsers (LL(1), recursive descent) Start at the root of the parse tree and grow toward leaves Pick a production.
1 A Simple Syntax-Directed Translator CS308 Compiler Theory.
Top-Down Parsing.
Syntax Analyzer (Parser)
CSE 5317/4305 L3: Parsing #11 Parsing #1 Leonidas Fegaras.
Overview of Previous Lesson(s) Over View 3 Model of a Compiler Front End.
1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations.
1 February 23, February 23, 2016February 23, 2016February 23, 2016 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University.
1 Topic #4: Syntactic Analysis (Parsing) CSC 338 – Compiler Design and implementation Dr. Mohamed Ben Othman ( )
Chapter 2 (part) + Chapter 4: Syntax Analysis S. M. Farhad 1.
Bernd Fischer RW713: Compiler and Software Language Engineering.
UMBC  CSEE   1 Chapter 4 Chapter 4 (b) parsing.
Parsing III (Top-down parsing: recursive descent & LL(1) )
COMP 3438 – Part II-Lecture 6 Syntax Analysis III Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
CMSC 330: Organization of Programming Languages Pushdown Automata Parsing.
Syntax Analysis By Noor Dhia Syntax analysis:- Syntax analysis or parsing is the most important phase of a compiler. The syntax analyzer considers.
Parsing COMP 3002 School of Computer Science. 2 The Structure of a Compiler syntactic analyzer code generator program text interm. rep. machine code tokenizer.
Parsing #1 Leonidas Fegaras.
Programming Languages Translator
CS510 Compiler Lecture 4.
Compiler Baojian Hua LR Parsing Compiler Baojian Hua
Introduction to Parsing (adapted from CS 164 at Berkeley)
4 (c) parsing.
Top-Down Parsing CS 671 January 29, 2008.
Lecture 7: Introduction to Parsing (Syntax Analysis)
Ambiguity, Precedence, Associativity & Top-Down Parsing
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
COMPILER CONSTRUCTION
Parsing CSCI 432 Computer Science Theory
Presentation transcript:

Parsing Compiler Baojian Hua

Front End source code abstract syntax tree lexical analyzer parser tokens IR semantic analyzer

Parsing The parser translates the source program into abstract syntax trees Token sequence: from the lexer abstract syntax trees: check validity of programs cook compiler internal data structures for programs Must take account the program syntax

Conceptually token sequence abstract syntax tree parser language syntax

Syntax: Context-free Grammar Context-free grammars are (often) given by BNF expressions (Backus-Naur Form) read Dragon sec 2.2 More powerful than RE in theory Good for defining language syntax

Context-free Grammar (CFG) A CFG consists of 4 components: a set of terminals (tokens): T a set of nonterminals: N a set of production rules: P s -> t1 t2 … tn with s  N, and t1, …, tn  (T ∪ N) a unique start nonterminal: S

Example // Recall the min-ML language in “code3” // (simplified) N = {decs, dec, exp} T = {SEMICOLON, VAL, ID, ASSIGN, NUM} S = decs decs -> dec SEMICOLON decs | dec -> VAL ID ASSIGN exp exp -> ID | NUM

Derivation A derivation: Starts with the unique start nonterminal S repeatedly replacing a right-hand nonterminal s by the body of a production rule of the nonterminal s stop when right-hand are all terminals The final string consists of terminals only and is called a sentence (program)

Example decs -> dec SEMICOLON decs | dec -> VAL ID ASSIGN exp exp -> ID | NUM val x = 5; val y = x; derive me decs -> … (a choice)

Example decs -> dec SEMICOLON decs | dec -> VAL ID ASSIGN exp exp -> ID | NUM val x = 5; val y = x; derive me decs -> dec SEMICOLON decs -> VAL ID ASSIGN exp SEMICOLON decs -> VAL ID ASSIGN NUM SEMICOLON decs -> VAL ID ASSIGN NUM SEMICOLON dec SEMICOLON decs -> … -> VAL ID ASSIGN NUM SEMICOLON VAL ID ASSIGN ID SEMICOLON decs

Another Way to Derive the same Program decs -> dec SEMICOLON decs | dec -> VAL ID ASSIGN exp exp -> ID | NUM val x = 5; val y = x; derive me decs -> dec SEMICOLON decs -> dec SEMICOLON dec SEMICOLON decs -> …

Derivation For same string, there may exist many derivations left-most derivation right-most derivation Parsing is the problem of taking a string of terminals and figure out whether it could be derived from a CFG error-detection

Parse Trees Derivation can also be represented as trees useful to understand AST (discussed later) Idea: each internal node is labeled with a non-terminal each leaf node is labeled with a terminal each use of a rule in a derivation explains how to generate children in the parse tree from the parents

Example decs -> dec SEMICOLON decs | dec -> VAL ID ASSIGN exp exp -> ID | NUM val x = 5; val y = x; derive me decs dec SEMI decs VAL ID =exp 5 decSEMIdecs similar case

Different Derivations, same Tree decs -> dec SEMICOLON decs -> VAL ID ASSIGN exp SEMICOLON decs -> … decs -> dec SEMICOLON decs -> dec SEMICOLON dec SEMICOLON decs -> … val x = 5; val y = x; derive me decs dec SEMI decs VAL ID =exp 5 decSEMIdecs similar case

Parse Tree has Meanings: post-order traversal decs -> dec SEMICOLON decs -> VAL ID ASSIGN exp SEMICOLON decs -> … decs -> dec SEMICOLON decs -> dec SEMICOLON dec SEMICOLON decs -> … val x = 5; val y = x; derive me decs dec SEMI decs VAL ID =exp 5 decSEMIdecs similar case

Ambiguous Grammars A grammar is ambiguous if the same sequence of tokens can give rise to two or more different parse trees

Example exp -> num -> id -> exp + exp -> exp * exp 3+4*5 derive me exp -> exp + exp -> 3 + exp -> 3 + exp * exp -> * exp -> * 5 exp -> exp * exp -> exp + exp * exp -> 3 + exp * exp -> * exp -> * 5

Example exp -> num -> id -> exp + exp -> exp * exp exp -> exp + exp -> 3 + exp -> 3 + exp * exp -> * exp -> * 5 exp -> exp * exp -> exp + exp * exp -> 3 + exp * exp -> * exp -> * 5 exp + 3 * 5 4 *

Ambiguous Grammars Problem: compilers make use of parse trees to interpret the meaning of parsed programs different parse trees have different meanings eg: * 6 is not (4 + 5) * 6 languages with ambiguous grammars are DISASTROUS; the meaning of programs isn ’ t well-defined! You can ’ t tell what your program might do! Solution: rewrite grammar to equivalent forms

Eliminating ambiguity In programming language syntax, ambiguity often arises from missing operator precedence or associativity * is of high precedence than + both + and * are left-associative Why or why not? Rewrite grammar to take account of this

Example exp -> num -> id -> exp + exp -> exp * exp exp -> exp + term -> term term -> term * factor -> factor factor -> num -> id Q: is the right grammar ambiguous? Why or why not?

Parser A program to check whether a program is derivable from a given grammar expensive in general must be fast to compile a 2000k lines of kernel even for small application code Theorists have developed specialized kind of grammar which may be parsed efficiently LL(k) and LR(k)

Predictive parsing A.K.A: Recursive descent parsing, top-down parsing simple to code by hand efficient can parse a large set of grammar Key idea: one (recursive) function for each nonterminal one clause for each right-hand production rule

Example decs -> dec SEMICOLON decs | dec -> VAL ID ASSIGN exp exp -> ID | NUM (* step #1: represent tokens *) datatype token = Val | Id of string | Num of int | Assign | Semicolon | Eof (* step #2: connect with lexer *) token current = ref getToken (); fun advance () = current := getToken (); fun eat (token t) = if !current = t then advance () else error (“want “, t, “but got “, !current)

decs -> dec SEMICOLON decs | dec -> VAL ID ASSIGN exp exp -> ID | NUM (* step #1: represent tokens *) datatype token = Val | Id of string | Num of int | Assign | Semi | Eof (* step #2: connect with lexer *) token current = ref getToken (); fun advance () = current := getToken (); fun eat (token t) = …; (* step #3: build the parser *) fun parseDecs() = case !current of VAL => parseDec (); eat (Semi); parseDecs (); | EOF => () | _ => error (“want VAL or EOF”) fun parseDec () = … fun parseExp () = …

Moral The key point in predicative parsing is to determine the production rule to use (recursive function to call) must know the “ start ” symbols of each rule “ start ” symbol must not overlap ex: exp -> NUM | ID This motivates the idea of first and follow sets

Moral S -> w1 -> w2 -> … -> wn Current nonterminal is S, and the current input token is t if wk starts with t, then choose wk, or if wk derives empty string, and the string follow S starts with t First symbol sets of wi (1<=i<=n) don ’ t overlap to avoid backtracking

Nullable, First and Follow sets To use predicative parsing, we must compute: Nullable: nonterminals that derive empty string First(ω) : set of terminals that can begin any string derivable from ω Follow(X): set of terminals that can immediately follow any string derivable from nonterminal X Read Dragon sec and Tiger sec 3.2 Fixpoint algorithms

Nullable, First and Follow sets Which symbol X, Y and Z can derive empty string? What terminals may the string derived from X, Y and Z begin with? What terminals may follow X, Y and Z? Z -> d -> X Y Z Y -> c -> X -> Y -> a

Nullable If X can derive an empty string, iff: base case: X -> inductive case: X -> Y1 … Yn Y1, …, Yn are n nonterminals and may all derive empty strings

Computing Nullable Nullable <- {}; while (Φ still change) for (each production X -> α) switch (α) case  : Nullable ∪ = {X}; break; case Y1 … Yn: if (Y1  Nullable && … && Yn  Nullable) Nullable ∪ = {X}; break;

Example: Nullables Z -> d -> X Y Z Y -> c -> X -> Y -> a Round012 Φ{}

Example: Nullables Z -> d -> X Y Z Y -> c -> X -> Y -> a Round012 Φ{}{Y, X}

Example: Nullables Z -> d -> X Y Z Y -> c -> X -> Y -> a Round012 Φ{}{Y, X}

First(X) Set of terminals that X begins with: X => a … Rules base case: X -> a First (X) ∪ = {a} inductive case: X -> Y1 Y2 … Yn First (X) ∪ = First(Y1) if Y1  Nullable, First (X) ∪ = First(Y2) if Y1,Y2  Nullable, First (X) ∪ = First(Y3) …

Computing First // Suppose Nullable has been computed First(X) <- {}; // for each X while (First still change) for (each production X -> α) switch (α) case a: First(X) ∪ = {a}; break; case Y1 … Yn: First(X) ∪ = First(Y1); if (Y1\not\in Nullable) break; First(X) ∪ = First(Y1); …; // Similar as above

Example: First Z -> d -> X Y Z Y -> c -> X -> Y -> a Round0123 First(Z){} First(Y){} First(X){} Nullable = {X, Y}

Example: First Z -> d -> X Y Z Y -> c -> X -> Y -> a Round0123 First(Z){}{d} First(Y){}{c} First(X){}{c, a} Nullable = {X, Y}

Example: First Z -> d -> X Y Z Y -> c -> X -> Y -> a Round0123 First(Z){}{d}{d, c, a} First(Y){}{c} First(X){}{c, a} Nullable = {X, Y}

Example: First Z -> d -> X Y Z Y -> c -> X -> Y -> a Round0123 First(Z){}{d}{d, c, a} First(Y){}{c} First(X){}{c, a} Nullable = {X, Y}

Parsing with First Z -> d {d} -> X Y Z {a, c, d} Y -> c {c} -> {} X -> Y {c} -> a {a} First(Z){d, c, a} First(Y){c} First(X){c, a} Nullable = {X, Y} Now consider this string: d Suppose we choose the production: Z -> X Y Z But we get stuck at: X -> Y -> a neither can accept d! Why?

Follow(X) Set of terminals that may follow X: S => … X a … Rules: Base case: Follow (X) = {} inductive case: Y -> ω1 X ω2 Follow(X) ∪ = Fisrt(ω2) if ω2 is Nullable, Follow(X) ∪ = Follow(Y)

Computing Follow(X) Follow(X) <- {}; while (Follow still change) { for (each production Y -> ω1 X ω2 ) Follow(X) ∪ = First (ω2); if ( ω2 is Nullable) Follow(X) ∪ = Follow (Y);

Example: Follow Z -> d -> X Y Z Y -> c -> X -> Y -> a Round0123 First(Z) Follow(Z) {d, c, a} {} First(Y) Follow(Y) {c} {} First(X) Follow(X) {c, a} {} Nullable = {X, Y}

Example: Follow Z -> d -> X Y Z Y -> c -> X -> Y -> a Round0123 First(Z) Follow(Z) {d, c, a} {}{$} First(Y) Follow(Y) {c} {}{d, c, a} First(X) Follow(X) {c, a} {}{d, c, a} Nullable = {X, Y}

Example: Follow Z -> d -> X Y Z Y -> c -> X -> Y -> a Round0123 First(Z) Follow(Z) {d, c, a} {}{$} First(Y) Follow(Y) {c} {}{d, c, a} First(X) Follow(X) {c, a} {}{d, c, a} Nullable = {X, Y}

Predicative Parsing Table With Nullables, First(), and Follow(), we can make a parsing table P(N,T) each entry contains a set of productions t1 t2 t3 t4 … $(EOF) N1 ri N2 rk N3 rj …

Predicative Parsing Table For each rule X -> ω for each a  First(ω), add X -> ω to P(X, a) if X is nullable, add X -> ω to P(X, b) for each b  Follow (X) all other entries are “ error ” t1 t2 t3 t4 … $(EOF) N1 r1 N2 rk N3 ri …

Example: Predicative Parsing Table First(X) Follow(X) {c, a} {c, d, a} First(Y) Follow(Y) {c} {c, d, a} First(Z) Follow(Z) {d, c, a} {$} Z -> d -> X Y Z Y -> c -> X -> Y -> a Nullable = {X, Y} acd ZZ->X Y Z Z->d Z->X Y Z YY->Y->c Y-> XX->Y X->a X->Y

Example: Predicative Parsing Table First(X) Follow(X) {c, a} {c, d, a} First(Y) Follow(Y) {c} {c, d, a} First(Z) Follow(Z) {d, c, a} {$} Z -> d -> X Y Z Y -> c -> X -> Y -> a Nullable = {X, Y} acd ZZ->X Y Z Z->d Z->X Y Z YY->Y->c Y-> XX->Y X->a X->Y

LL(1) A context-free grammar is called LL(1) if it can be parsed this way: Left-to-right parsing Leftmost derivation 1 token lookahead This means that in the predicative parsing table, there is at most one production in every entry

Speeding up set Construction All these sets (Nullable, First, Follow) can be computed simultaneously see Tiger algorithm 3.13 Order the computation: What ’ s the optimal order to compute these set?

Example: Speeding up set Construction Z -> d -> X Y Z Y -> c -> X -> Y -> a Round0123 First(Z){} First(Y){} First(X){} Nullable = {X, Y} Q1: What ’ s reasonable order here? Q2: How to set this order?

Directed Graph Model Z -> d -> X Y Z Y -> c -> X -> Y -> a Nullable = {X, Y} Q1: What ’ s reasonable order here? Q2: How to set this order? Z X Y {c} {c, a} {d, c, a} Order: Y X Z

Reverse Topological Sort Quasi-topological sort the directed graph Quasi: topo-sort general directed graph is impossible also known as reverse depth-first ordering Reverse: information (First) flows from successors to predecessors Refer to your favorite algorithm book

Problem LL(1) can only be used with grammars in which every production rules for a nonterminal start with different terminals Unfortunately, many grammars don ’ t have this perfect property

Example exp -> num -> id -> exp + exp -> exp * exp exp -> exp + term -> term term -> term * factor -> factor factor -> num -> id Q: is the right grammar LL(1)? Why or why not?

Solutions Left-recursion elimination Left-factoring Read: dragon sec4.3.2, 4.3.3, tiger sec3.2

Example exp -> term exp’ exp’ -> + term exp’ -> term -> factor term’ term’-> * factor term’ -> factor -> num -> id Q: is the right grammar LL(1)? are those two grammars equivalent? exp -> exp + term -> term term -> term * factor -> factor factor -> num -> id

LL(k) LL(1) can be further generalized to LL(k): Left-to-right parsing Leftmost derivation k token lookahead Q: table size? other problems with this approach?

Summary Context-free grammar is a math tool for specifying language syntax and others … Writing parsers for general grammar is hard and costly LL(k) and LR(k) LL(1) grammars can be implemented efficiently table-driven algorithms (again!)