Compiler Structures 4. Syntax Analysis Objectives

Slides:



Advertisements
Similar presentations
Honors Compilers An Introduction to Grammars Feb 12th 2002.
Advertisements

Programming Languages An Introduction to Grammars Oct 18th 2002.
COP4020 Programming Languages
1 Contents Introduction Introduction A Simple Compiler A Simple Compiler Scanning – Theory and Practice Scanning – Theory and Practice Grammars and Parsing.
Chapter 3 Chang Chi-Chung Parse tree intermediate representation The Role of the Parser Lexical Analyzer Parser Source Program Token Symbol.
(2.1) Grammars  Definitions  Grammars  Backus-Naur Form  Derivation – terminology – trees  Grammars and ambiguity  Simple example  Grammar hierarchies.
CS 2104 Prog. Lang. Concepts Dr. Abhik Roychoudhury School of Computing Introduction.
Profs. Necula CS 164 Lecture Top-Down Parsing ICOM 4036 Lecture 5.
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
11 Chapter 4 Grammars and Parsing Grammar Grammars, or more precisely, context-free grammars, are the formalism for describing the structure of.
Parsing Introduction Syntactic Analysis I. Parsing Introduction 2 The Role of the Parser The Syntactic Analyzer, or Parser, is the heart of the front.
Syntax Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
Compilers: syntax/4 1 Compiler Structures Objective – –describe general syntax analysis, grammars, parse trees, FIRST and FOLLOW sets ,
Parsing — Part II (Top-down parsing, left-recursion removal) Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students.
Lecture 3: Parsing CS 540 George Mason University.
Unit-3 Parsing Theory (Syntax Analyzer) PREPARED BY: PROF. HARISH I RATHOD COMPUTER ENGINEERING DEPARTMENT GUJARAT POWER ENGINEERING & RESEARCH INSTITUTE.
1 A Simple Syntax-Directed Translator CS308 Compiler Theory.
Syntax Analysis – Part I EECS 483 – Lecture 4 University of Michigan Monday, September 17, 2006.
Top-Down Parsing.
Top-Down Predictive Parsing We will look at two different ways to implement a non- backtracking top-down parser called a predictive parser. A predictive.
COMP 3438 – Part II-Lecture 5 Syntax Analysis II Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
1 Topic #4: Syntactic Analysis (Parsing) CSC 338 – Compiler Design and implementation Dr. Mohamed Ben Othman ( )
COMP 3438 – Part II-Lecture 6 Syntax Analysis III Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
Compilers: Bottom-up/6 1 Compiler Structures Objective – –describe bottom-up (LR) parsing using shift- reduce and parse tables – –explain how LR.
Chapter 3 – Describing Syntax CSCE 343. Syntax vs. Semantics Syntax: The form or structure of the expressions, statements, and program units. Semantics:
Parsing COMP 3002 School of Computer Science. 2 The Structure of a Compiler syntactic analyzer code generator program text interm. rep. machine code tokenizer.
Introduction to Parsing
Chapter 3 – Describing Syntax
Parsing #1 Leonidas Fegaras.
Parsing & Context-Free Grammars
CS 404 Introduction to Compiler Design
Programming Languages Translator
CS510 Compiler Lecture 4.
Chapter 3 – Describing Syntax
Compiler Construction
Syntax Analysis Chapter 4.
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Compiler Construction
FIRST and FOLLOW Lecture 8 Mon, Feb 7, 2005.
Top-Down Parsing.
CS 363 Comparative Programming Languages
CS416 Compiler Design lec00-outline September 19, 2018
Syntax Analysis Sections :.
Compiler Design 4. Language Grammars
Lexical and Syntax Analysis
Top-Down Parsing CS 671 January 29, 2008.
CPSC 388 – Compiler Design and Construction
CS 540 George Mason University
Syntax Analysis source program lexical analyzer tokens syntax analyzer
Chapter 2: A Simple One Pass Compiler
Lecture 7: Introduction to Parsing (Syntax Analysis)
R.Rajkumar Asst.Professor CSE
Lecture 4: Lexical Analysis & Chomsky Hierarchy
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
CS416 Compiler Design lec00-outline February 23, 2019
Compiler Structures 3. Lex Objectives , Semester 2,
Chapter 3 Syntactic Analysis I.
BNF 9-Apr-19.
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Compiler Structures 2. Lexical Analysis Objectives
Context Free Grammar – Quick Review
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Lec00-outline May 18, 2019 Compiler Design CS416 Compiler Design.
Discrete Maths 13. Grammars Objectives
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Faculty of Computer Science and Information System
Presentation transcript:

Compiler Structures 4. Syntax Analysis Objectives 242-437, Semester 2, 2018-2019 4. Syntax Analysis Objectives describe general syntax analysis, grammars, parse trees, FIRST and FOLLOW sets

Overview 1. What is a Syntax Analyzer? 2. What is a Grammar? 3. Parse Trees 4. Types of CFG Parsing 5. Syntax Analysis Sets

In this lecture Front End Back End Source Program Lexical Analyzer Syntax Analyzer Semantic Analyzer Int. Code Generator Intermediate Code Code Optimizer Back End As I said earlier, there will be 5 homeworks, each of which will contribute to 5% of your final grade. You will have at least 2 weeks to complete each of the homeworks. Talking about algorithms really helps you learn about them, so I encourage you all to work in small groups. If you don’t have anyone to work with please either e-mail me or stop by my office and I will be sure to match you up with others. PLEASE make sure you all work on each problem; you will only be hurting yourself if you leach off of your partners. Problems are HARD! I will take into account the size of your group when grading your homework. Later in the course I will even have a contest for best algorithm and give prizes out for those who are most clever in their construct. I will allow you one late homework. You *must* write on the top that you are taking your late. Homework 1 comes out next class. Target Code Generator Target Lang. Prog.

1. What is a Syntax Analyzer? if (a == 0) a = b; Lexical Analyzer if ( a == ) = b ; Syntax Analyzer IF builds a parse tree EQ ASSIGN a a b

Syntax Analyses that we do - Identify the function of each word - Recognize if a sentence is grammatically correct sentence (subject) (action) (object) grammar types / categories verb phrase (indirect object) noun phrase pronoun verb proper noun article noun I gave Jim the card

Languages We use a natural language to communicate its grammar rules are very complex the rules don’t cover important things We use a formal language to define a programming language its grammar rules are fairly simple the rules cover almost everything

2. What is a Grammar? A grammar is a notation for defining a language, and is made from 4 parts: the terminal symbols the syntactic categories (nonterminal symbols) e.g. statement, expression, noun, verb the grammar rules (productions) e,g, A => B1 B2 ... Bn the starting nonterminal the top-most syntactic category for this grammar continued

We define a grammar G as a 4-tuple: G = (T, N, P, S) T = terminal symbols N = nonterminal symbols P = productions/rules S = starting nonterminal

2.1. Example 1 Consider the grammar: T = {0, 1} N = {S, R} P = { S => 0 S => 0 R R => 1 S } S is the starting nonterminal the right hand sides of productions usually use a mix of terminals and nonterminals

Is “01010” in the language? Start with a S rule: Rule String Generated -- S S => 0 R 0 R R => 1 S 0 1 S S => 0 R 0 1 0 R R => 1 S 0 1 0 1 S S => 0 0 1 0 1 0 No more rules can be applied since there are no more nonterminals left in the string. Yes, it is in the language.

Example 2 Consider the grammar: T = {a, b, c, d, z} N = {S, R, U, V} P = { S => R U z | z R => a | b R U => d V U | c V => b | c } S is the starting nonterminal

is shorthand for the two rules: The notation: X => Y | Z is shorthand for the two rules: X => Y X => Z Read ‘|’ as ‘or’.

Is “adbdbcz” in the language? Rule String Generated -- S S => R U z R U z R => a a U z U => d V U a d V U z V => b a d b U z U => d V U a d b d V U z V => b a d b d b U z U => c a d b d b c z Yes! This grammar has choices about how to rewrite the string.

Example 3: Sums e.g. 5 + 6 - 2 The grammar: N = {L, D} P = { L => L + D | L – D | D D => 0 | 1 | 2 | ... | 9 } L is the starting nonterminal

Example 4: Brackets The grammar: T = { '(', ')' } N = {L} P = { L => '(' L ')' L L => ε } L is the starting nonterminal ε means 'nothing'

2.2. Derivations A sequence of the form: w0  w1  …  wn is a derivation of wn from w0 (or w0 * wn) Example: L rule L => ( L ) L  ( L ) L rule L => e  ( ) L rule L => e  ( ) L * ( ) The sentence ( ) is a derivation of L

so L * (( )) ( )  ( L ) L  ( L ) ( L ) L  ( L ) ( L ) rule L => ( L ) L  ( L ) L  ( L ) ( L ) L rule L => e  ( L ) ( L )  (( L ) L ) ( L )  (( ) L ) ( L )  ( ( ) L ) ( )  ( ( ) ) ( ) so L * (( )) ( )

2.3. Kinds of Grammars There are 4 main kinds of grammar, of increasing expressive power: regular (type 3) grammars context-free (type 2) grammars context-sensitive (type 1) grammars unrestricted (type 0) grammars They vary in the kinds of productions they allow.

Regular Grammars Every production is of the form: S => wT T => xT T => a Every production is of the form: A => a | a B | e A, B are nonterminals, a is a terminal These are sometimes called right linear rules because if a nonterminal appears in the rule body, then it must appear last. Regular grammars are equivalent to REs.

Example Integer => + UInt | - UInt | 0 Digits | 1 Digits | ... | 9 Digits UInt => 0 Digits | 1 Digits | ... | 9 Digits Digits => 0 Digits | 1 Digits | ... | 9 Digits | e

Context-Free Grammars (CFGs) A => a A => aBcd B => ae Every production is of the form: A => d A is a nonterminal, d can be any number of nonterminals or terminals The Syntax Analyzer uses CFGs.

2.4. REs for Syntax Analysis? Why not use REs to describe the syntax of a programming language? they don’t have enough power Examples: nested blocks, if statements, balanced braces We need the ability to 'count', which can be implemented with CFGs but not REs.

3. Parse Trees A parse tree is a graphical way of showing how productions are used to generate a string. The syntax analyzer creates a parse tree to store information about the program being compiled.

Example The grammar: T = { a, b } N = { S } P = { S => S S | a S b | a b | b a } S is the starting nonterminal

Parse Tree for “aabbba” expand the symbol in the circle Parse Tree for “aabbba” S The root of the tree is the start symbol S: Expand using S => S S S S S Expand using S => a S b continued

S S S a S b Expand using S => a b S S S a S b a b Expand using S => b a continued

S S S a S b b a a b Stop when there are no more nonterminals in leaf positions. Read off the string by reading the leaves left to right.

3.1. Ambiguity Two (or more) parse trees for the same string E => E + E E => E – E E => 0 | … | 9 E E or E + E E - E 4 2 E + E E - E 2 – 3 + 4 3 4 2 3

The two derivations: E  E + E E  E – E  E – E + E  2 – E  2 – 3 + 4  2 – 3 + 4

Fixing Ambiguity An ambiguous grammar can sometimes be made unambiguous: E => E + T | E – T | T T => 0 | … | 9 We'll look at some techniques in chapter 5.

4. Types of CFG Parsing Top-down (chapter 5) Bottom-up (chapter 6) recursive descent (predictive) parsing LL methods Bottom-up (chapter 6) operator precedence parsing LR methods SLR, canonical LR, LALR

4.1. A Statement Block Grammar The grammar: T = {begin, end, simplestmt, ;} N = {B, SS, S} P = { B => begin SS end SS => S ; SS | ε S => simplestmt | begin SS end } B is the starting nonterminal

Parse Tree B SS SS SS S S begin simplestmt ; simplestmt ; end B => begin SS end SS => S ; SS SS => e S => simplestmt S => begin SS end SS SS S SS S e begin simplestmt ; simplestmt ; end

4.2. Top Down (LL) Parsing B SS begin simplestmt ; simplestmt ; end B => begin SS end SS => S ; SS SS => e S => simplestmt S => begin SS end SS begin simplestmt ; simplestmt ; end continued

begin simplestmt ; simplestmt ; end B => begin SS end SS => S ; SS SS => e S => simplestmt S => begin SS end SS SS S begin simplestmt ; simplestmt ; end continued

begin simplestmt ; simplestmt ; end B => begin SS end SS => S ; SS SS => e S => simplestmt S => begin SS end SS SS S begin simplestmt ; simplestmt ; end continued

begin simplestmt ; simplestmt ; end B => begin SS end SS => S ; SS SS => e S => simplestmt S => begin SS end SS SS S SS S begin simplestmt ; simplestmt ; end continued

begin simplestmt ; simplestmt ; end B => begin SS end SS => S ; SS SS => e S => simplestmt S => begin SS end SS SS S SS S begin simplestmt ; simplestmt ; end continued

1 2 4 3 6 5 B SS SS SS S S begin simplestmt ; simplestmt ; end e B => begin SS end SS => S ; SS SS => e S => simplestmt S => begin SS end 2 SS 4 SS S 3 SS 6 5 S e begin simplestmt ; simplestmt ; end

4.3. Bottomup (LR) Parsing S begin simplestmt ; simplestmt ; end B => begin SS end SS => S ; SS SS => e S => simplestmt S => begin SS end S begin simplestmt ; simplestmt ; end continued

begin simplestmt ; simplestmt ; end B => begin SS end SS => S ; SS SS => e S => simplestmt S => begin SS end S S begin simplestmt ; simplestmt ; end continued

begin simplestmt ; simplestmt ; end B => begin SS end SS => S ; SS SS => e S => simplestmt S => begin SS end S SS S e begin simplestmt ; simplestmt ; end continued

begin simplestmt ; simplestmt ; end B => begin SS end SS => S ; SS SS => e S => simplestmt S => begin SS end SS S SS S e begin simplestmt ; simplestmt ; end continued

begin simplestmt ; simplestmt ; end B => begin SS end SS => S ; SS SS => e S => simplestmt S => begin SS end SS SS S SS S e begin simplestmt ; simplestmt ; end continued

6 5 4 1 3 2 B SS SS SS S S begin simplestmt ; simplestmt ; end e B => begin SS end SS => S ; SS SS => e S => simplestmt S => begin SS end 5 SS 4 SS S 1 SS 3 2 S e begin simplestmt ; simplestmt ; end

5. Syntax Analysis Sets Syntax analyzers for top-down (LL) and bottom-up (LR) parsing utilize two types of sets: FIRST sets FOLLOW sets These sets are generated from the programming language CFG.

5.1. The FIRST Sets FIRST( <non-terminal> ) = Example: set of all terminals that start productions for that non-terminal Example: S => ping S => begin S end FIRST(S) = { ping, begin }

More Mathematically A is a non-terminal. FIRST(A) = { c | A =>* c w , c is a terminal }  { e } if A =>* e w is the rest of the terminals and nonterminals after 'c'

Building FIRST Sets For each non-terminal A, FIRST(A) = FIRST_SEQ(a)  FIRST_SEQ(b)  ... for all productions A => a, A => b, ... a, b are the bodies of the productions

FIRST_SEQ() FIRST_SEQ(e) = { e } FIRST_SEQ(c w) = { c }, if c is a terminal FIRST_SEQ(A w) = FIRST(A), if e FIRST(A) = (FIRST(A) – {e})  FIRST_SEQ(w), if e FIRST(A) w is a sequence of terminals and non-terminals, and possibly empty

FIRST() Example 1 S => a S e S => B B => b B e B => C C => c C e C => d FIRST(C) = {c,d} FIRST(B) = FIRST(S) = Start with FIRST(C) since its rules only start with terminals continued

S => a S e S => B B => b B e B => C C => c C e C => d FIRST(C) = {c,d} FIRST(B) = {b,c,d} FIRST(S) = do FIRST(B) now that we know FIRST(C) continued

S => a S e S => B B => b B e B => C C => c C e C => d FIRST(C) = {c,d} FIRST(B) = {b,c,d} FIRST(S) = {a,b,c,d} do FIRST(S) now that we know FIRST(B)

FIRST() Example 2 P => i | c | n T S FIRST(P) = {i,c,n} Q => P | a S | b S c S T R => b | e S => c | R n | e T => R S q FIRST(P) = {i,c,n} FIRST(Q) = FIRST(R) = {b,e} FIRST(S) = FIRST(T) = Start with P and R since their rules only start with terminals or e continued

P => i | c | n T S Q => P | a S | b S c S T R => b | e S => c | R n | e T => R S q FIRST(P) = {i,c,n} FIRST(Q) = {i,c,n,a,b} FIRST(R) = {b,e} FIRST(S) = FIRST(T) = do FIRST(Q) now that we know FIRST(P) continued

P => i | c | n T S Q => P | a S | b S c S T R => b | e S => c | R n | e T => R S q FIRST(P) = {i,c,n} FIRST(Q) = {i,c,n,a,b} FIRST(R) = {b,e} FIRST(S) = {c,b,n,e} FIRST(T) = do FIRST(S) now that we know FIRST(R) Note: S  R n  n because R * e continued

P => i | c | n T S Q => P | a S | b S c S T R => b | e S => c | R n | e T => R S q FIRST(P) = {i,c,n} FIRST(Q) = {i,c,n,a,b} FIRST(R) = {b,e} FIRST(S) = {c,b,n,e} FIRST(T) = {b,c,n,q} do FIRST(T) now that we know FIRST(R) and FIRST(S) Note: T  R S q  S q  q because both R and S * e

FIRST() Example 3 S => a S e | S T S T => R S e | Q R => r S r | e Q => S T | e FIRST(S) = {a} FIRST(T) = {r, a, e} FIRST(R) = {r, e} FIRST(Q) = {a, e} Order 1) R, S 2) Q 3) T

5.2. The FOLLOW Sets FOLLOW( <non-terminal> ) = set of all the terminals that follow <non-terminal> in productions the set includes $ if nothing follows <non-terminal>

FOLLOW(A) = { bong, pong, $ } Example: S => bing A bong | ping A pong | zing A A => ha FOLLOW(A) = { bong, pong, $ }

More Mathematically A is a non-terminal. FOLLOW(A) = { c in terminals | S =>+ . . . A c . . . }  { $ } if S =>+ . . . A . . . is a sequence of terminals and non-terminals =>+ is any number of => expansions

Building FOLLOW() Sets To make the FOLLOW(A) set, apply rules 1-4: 1. for all productions (B => . . . A ) add FIRST_SEQ()-{} 2. for all (B => . . . A ) and   FIRST_SEQ() add FOLLOW(B) 3. for all (B => . . . A) add FOLLOW(B) 4. if A is the start symbol then add { $ } b is a sequence of termminals and non-terminals

Small Examples What is in FOLLOW(A) for the productions: B => A C C => s FOLLOW(A) gets FIRST_SEQ(C) == FIRST(C) == { s } uses rule 1 continued

What is in FOLLOW(A) for the productions: C => B r B => t A FOLLOW(A) gets FOLLOW(B) == { r } uses rule 3

FOLLOW() Example 1 S => a S e | B B => b B C f | C S is the start symbol S => a S e | B B => b B C f | C C => c C g | d | e FIRST(C) = {c,d,e} FIRST(B) = {b,c,d,e} FIRST(S) = {a,b,c,d,e} FOLLOW(C) = FOLLOW(B) = FOLLOW(S) = {$, e} continued

FOLLOW(C) = {f,g}  follow(B) FOLLOW(B) = FIRST_SEQ(C f) -{e}  FOLLOW(S) = {c, d, f, $, e} FOLLOW(S) = {$,e} S => a S e | B B => b B C f | C C => c C g | d | e FIRST(C) = {c,d,e} FIRST(B) = {b,c,d,e} FIRST(S) = {a,b,c,d,e} continued

FOLLOW(C) = {f,g,c,d,$,e} FOLLOW(B) = {c, d, f, $, e} FOLLOW(S) = {$,e} S => a S e | B B => b B C f | C C => c C g | d | e FIRST(C) = {c,d,e} FIRST(B) = {b,c,d,e} FIRST(S) = {a,b,c,d,e}

FOLLOW() Example 2 S => ( A ) | e FOLLOW(S) = {$} A => T E E => & T E | e T => ( A ) | a | b | c FIRST(T) = {( ,a,b,c} FIRST(E) = {& , e } FIRST(A) = {( ,a,b,c} FIRST(S) = {( , e} FOLLOW(S) = {$} FOLLOW(A) = {)} FOLLOW(E) = FOLLOW(T) = continued

FOLLOW(A)  FOLLOW(E) = { ) } FOLLOW(S) = { $ } FOLLOW(A) = { ) } FOLLOW(E) = FOLLOW(A)  FOLLOW(E) = { ) } FOLLOW(T) = (FIRST_SEQ(E) – {e})  FOLLOW(A)  FOLLOW(E) = {&, )} S => ( A ) | e A => T E E => & T E | e T => ( A ) | a | b | c FIRST(T) = {(,a,b,c} FIRST(E) = {&, e } FIRST(A) = {(,a,b,c} FIRST(S) = {(, e}

FOLLOW() Example 3 FOLLOW(S) = {$,)} S => T E1 FOLLOW(E1) = FOLLOW(T) = FOLLOW(T1) = FOLLOW(F) = S => T E1 E1 => + T E1 | e T => F T1 T1 => * F T1 | e F => ( S ) | id FIRST(F) = FIRST(T) = FIRST(S) = {(,id} FIRST(T1) = {*,e} FIRST(E1) = {+,e} continued

FOLLOW(E1) = FOLLOW(S)  Follow(E1) = {$,)} FOLLOW(T) = FIRST(E1)  FOLLOW(S)  FOLLOW(E1) = {+,$,)} FOLLOW(T1) = FOLLOW(T) = {+,$,)} FOLLOW(F) = FIRST(T1)  FOLLOW(T)  FOLLOW(T1) = {*,+,$,)} S => T E1 E1 => + T E1 | e T => F T1 T1 => * F T1 | e F => ( S ) | id FIRST(F) = FIRST(T) = FIRST(S) = {(,id} FIRST(T1) = {*,e} FIRST(E1) = {+,e}

FOLLOW() Example 4 S => A B C | A D FOLLOW(S) = {$} A => a | a A B => b | c | e C => D a C D => b b | c c FIRST(D) = FIRST(C) = {b,c} FIRST(B) = {b,c,e} FIRST(A) = FIRST(S) = {a} FOLLOW(S) = {$} FOLLOW(D) = {a,$} FOLLOW(A) = FOLLOW(B) = FOLLOW(C) = continued

S => A B C | A D A => a | a A B => b | c | e C => D a C D => b b | c c FIRST(D) = FIRST(C) = {b,c} FIRST(B) = {b,c,e} FIRST(A) = FIRST(S) = {a} FOLLOW(S) = {$} FOLLOW(D) = {a,$} FOLLOW(A) = {b,c} FOLLOW(B) = {b,c} FOLLOW(C) = {$}