Syntax Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University

Slides:



Advertisements
Similar presentations
1 Parsing The scanner recognizes words The parser recognizes syntactic units Parser operations: Check and verify syntax based on specified syntax rules.
Advertisements

Top-Down Parsing.
CS 310 – Fall 2006 Pacific University CS310 Parsing with Context Free Grammars Today’s reference: Compilers: Principles, Techniques, and Tools by: Aho,
By Neng-Fa Zhou Syntax Analysis lexical analyzer syntax analyzer semantic analyzer source program tokens parse tree parser tree.
Context-Free Grammars Lecture 7
COS 320 Compilers David Walker. last time context free grammars (Appel 3.1) –terminals, non-terminals, rules –derivations & parse trees –ambiguous grammars.
CS 310 – Fall 2006 Pacific University CS310 Parsing with Context Free Grammars Today’s reference: Compilers: Principles, Techniques, and Tools by: Aho,
Parsing — Part II (Ambiguity, Top-down parsing, Left-recursion Removal)
1 The Parser Its job: –Check and verify syntax based on specified syntax rules –Report errors –Build IR Good news –the process can be automated.
Professor Yihjia Tsai Tamkang University
COS 320 Compilers David Walker. last time context free grammars (Appel 3.1) –terminals, non-terminals, rules –derivations & parse trees –ambiguous grammars.
Top-Down Parsing.
CPSC Compiler Tutorial 3 Parser. Parsing The syntax of most programming languages can be specified by a Context-free Grammar (CGF) Parsing: Given.
Chapter 3 Chang Chi-Chung Parse tree intermediate representation The Role of the Parser Lexical Analyzer Parser Source Program Token Symbol.
CPSC 388 – Compiler Design and Construction
CPSC 388 – Compiler Design and Construction Parsers – Context Free Grammars.
Syntax and Semantics Structure of programming languages.
Parsing Chapter 4 Parsing2 Outline Top-down v.s. Bottom-up Top-down parsing Recursive-descent parsing LL(1) parsing LL(1) parsing algorithm First.
Review: –How do we define a grammar (what are the components in a grammar)? –What is a context free grammar? –What is the language defined by a grammar?
Top-Down Parsing - recursive descent - predictive parsing
4 4 (c) parsing. Parsing A grammar describes the strings of tokens that are syntactically legal in a PL A recogniser simply accepts or rejects strings.
BİL 744 Derleyici Gerçekleştirimi (Compiler Design)1 Syntax Analyzer Syntax Analyzer creates the syntactic structure of the given source program. This.
Lexical Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
Parsing Jaruloj Chongstitvatana Department of Mathematics and Computer Science Chulalongkorn University.
PART I: overview material
Profs. Necula CS 164 Lecture Top-Down Parsing ICOM 4036 Lecture 5.
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
Chapter 4 Top-Down Parsing Dr. Frank Lee. Parsing and Parsers Once we have described the syntax of our programming language using a context-free grammar,
1 Compiler Construction Syntax Analysis Top-down parsing.
Lesson 5 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
Syntax and Semantics Structure of programming languages.
Parsing Lecture 5 Fri, Jan 28, Syntax Analysis The syntax of a language is described by a context-free grammar. Each grammar rule has the form A.
11 Chapter 4 Grammars and Parsing Grammar Grammars, or more precisely, context-free grammars, are the formalism for describing the structure of.
Exercise 1 A ::= B EOF B ::=  | B B | (B) Tokens: EOF, (, ) Generate constraints and compute nullable and first for this grammar. Check whether first.
Compilers: syntax/4 1 Compiler Structures Objective – –describe general syntax analysis, grammars, parse trees, FIRST and FOLLOW sets ,
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 3: Introduction to Syntactic Analysis.
Parsing — Part II (Top-down parsing, left-recursion removal) Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students.
Chapter 3 Context-Free Grammars and Parsing. The Parsing Process sequence of tokens syntax tree parser Duties of parser: Determine correct syntax Build.
Top-down Parsing Recursive Descent & LL(1) Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.
Top-Down Parsing CS 671 January 29, CS 671 – Spring Where Are We? Source code: if (b==0) a = “Hi”; Token Stream: if (b == 0) a = “Hi”; Abstract.
Lecture 3: Parsing CS 540 George Mason University.
1 Context free grammars  Terminals  Nonterminals  Start symbol  productions E --> E + T E --> E – T E --> T T --> T * F T --> T / F T --> F F --> (F)
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
1 Week 5 Questions / Concerns What’s due: Lab2 part a due on Sunday Lab1 check-off by appointment Test#1 HW#5 next Thursday Coming up: Lab3 Posted. Discuss.
Top-Down Parsing.
Syntax Analyzer (Parser)
CSE 5317/4305 L3: Parsing #11 Parsing #1 Leonidas Fegaras.
Top-Down Predictive Parsing We will look at two different ways to implement a non- backtracking top-down parser called a predictive parser. A predictive.
Parsing methods: –Top-down parsing –Bottom-up parsing –Universal.
1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations.
LECTURE 7 Lex and Intro to Parsing. LEX Last lecture, we learned a little bit about how we can take our regular expressions (which specify our valid tokens)
1 Topic #4: Syntactic Analysis (Parsing) CSC 338 – Compiler Design and implementation Dr. Mohamed Ben Othman ( )
Chapter 2 (part) + Chapter 4: Syntax Analysis S. M. Farhad 1.
UMBC  CSEE   1 Chapter 4 Chapter 4 (b) parsing.
COMP 3438 – Part II-Lecture 6 Syntax Analysis III Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
Spring 16 CSCI 4430, A Milanova 1 Announcements HW1 due on Monday February 8 th Name and date your submission Submit electronically in Homework Server.
Compiler Construction Lecture Five: Parsing - Part Two CSC 2103: Compiler Construction Lecture Five: Parsing - Part Two Joyce Nakatumba-Nabende 1.
CMSC 330: Organization of Programming Languages Pushdown Automata Parsing.
Syntax Analysis By Noor Dhia Syntax analysis:- Syntax analysis or parsing is the most important phase of a compiler. The syntax analyzer considers.
9/30/2014IT 3271 How to construct an LL(1) parsing table ? 1.S  A S b 2.S  C 3.A  a 4.C  c C 5.C  abc$ S1222 A3 C545 LL(1) Parsing Table What is the.
Programming Languages Translator
Context free grammars Terminals Nonterminals Start symbol productions
Compiler Construction
Top-Down Parsing.
CS 363 Comparative Programming Languages
Top-Down Parsing CS 671 January 29, 2008.
Syntax Analysis source program lexical analyzer tokens syntax analyzer
LL PARSER The construction of a predictive parser is aided by two functions associated with a grammar called FIRST and FOLLOW. These functions allows us.
Compiler Structures 4. Syntax Analysis Objectives
Parsing CSCI 432 Computer Science Theory
Presentation transcript:

Syntax Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University

Adam Doupé, Principles of Programming Languages Syntax Analysis The goal of syntax analysis is to transform the sequence of tokens from the lexer into something useful However, we need a way to specify and check if the sequence of tokens is valid –NUM PLUS NUM –DECIMAL DOT NUM –ID DOT ID –DOT DOT DOT NUM ID DOT ID 2

Adam Doupé, Principles of Programming Languages Using Regular Expressions PROGRAM = STATEMENT* STATEMENT = EXPRESSION | IF_STMT | WHILE_STMT | … OP = + | - | * | / EXPRESSION = (NUM | ID | DECIMAL) OP (NUM | ID | DECIMAL) foo - bar

Adam Doupé, Principles of Programming Languages Using Regular Expressions Regular expressions are not sufficient to capture all programming constructs –We will not go into the details in this class, but the reason is that regular languages (the set of all languages that can be described by regular expressions) cannot express languages with properties that we care about How to write a regular expression for matching parenthesis? –L(R) = {, (), (()), ((())), …} –Regular expressions (as we have defined them in this class) have no concept of counting (to ensure balanced parenthesis), therefore it is impossible to create R 4

Adam Doupé, Principles of Programming Languages Context-Free Grammars Syntax for context-free grammars –Each row is called a production Non-terminals on the left Right arrow Non-terminals and terminals on the right –Non-terminals will start with an upper case in our examples, terminals will be lowercase and are tokens –S will typically be the starting non-terminal Example for matching parenthesis S → S → ( S ) Can also write more succinctly by combining production rules with the same starting non-terminals S→ ( S ) | 5

Adam Doupé, Principles of Programming Languages CFG Example S→ ( S ) | Derivations of the CFG S⇒S⇒ S ⇒ ( S ) ⇒ ( ) ⇒ () S ⇒ ( S) ⇒ ( ( S ) ) ⇒ ( ( ) ) ⇒ (()) 6

Adam Doupé, Principles of Programming Languages CFG Example Exp→ Exp + Exp Exp→ Exp * Exp Exp→ NUM Exp ⇒ Exp * Exp ⇒ Exp * 3 ⇒ Exp + Exp * 3 ⇒ Exp + 2 * 3 ⇒ * 3 7

Adam Doupé, Principles of Programming Languages CFG Example Exp→ Exp + Exp Exp→ Exp * Exp Exp→ NUM Exp ⇒ Exp * Exp ⇒ Exp * 3 ⇒ Exp + Exp * 3 ⇒ Exp + 2 * 3 ⇒ * 3 8

Adam Doupé, Principles of Programming Languages Leftmost Derivation Always expand the leftmost nonterminal Exp→ Exp + Exp Exp→ Exp * Exp Exp→ NUM Is this a leftmost derivation? Exp ⇒ Exp * Exp ⇒ Exp * 3 ⇒ Exp + Exp * 3 ⇒ Exp + 2 * 3 ⇒ * 3 Exp ⇒ Exp * Exp ⇒ Exp + Exp * Exp ⇒ 1 + Exp * Exp ⇒ * Exp ⇒ * 3 9

Adam Doupé, Principles of Programming Languages Rightmost Derivation Always expand the rightmost nonterminal Exp→ Exp + Exp Exp→ Exp * Exp Exp→ NUM Exp ⇒ Exp * Exp ⇒ Exp * 3 ⇒ Exp + Exp * 3 ⇒ Exp + 2 * 3 ⇒ * 3 10

Adam Doupé, Principles of Programming Languages Parse Tree We can also represent derivations using a parse tree –May sound familiar 11 Source Lexer Tokens Parser Parse Tree Bytes

Adam Doupé, Principles of Programming Languages Parse Tree Exp ⇒ Exp * Exp ⇒ Exp * 3 ⇒ Exp + Exp * 3 ⇒ Exp + 2 * 3 ⇒ * 3 12 Exp * *

Adam Doupé, Principles of Programming Languages Parsing Derivations and parse tree can show how to generate strings that are in the language described by the grammar However, we need to turn a sequence of tokens into a parse tree Parsing is the process of determining the derivation or parse tree from a sequence of tokens Two major parsing problems: –Ambiguous grammars –Efficient parsing 13

Adam Doupé, Principles of Programming Languages Ambiguous Grammars Exp→ Exp + Exp Exp→ Exp * Exp Exp→ NUM How to parse * 3? Exp ⇒ Exp * Exp ⇒ Exp + Exp * Exp ⇒ 1 + Exp * Exp ⇒ * Exp ⇒ * 3 Exp ⇒ Exp + Exp ⇒ 1 + Exp ⇒ 1 + Exp * Exp ⇒ * Exp ⇒ * 3 14

Adam Doupé, Principles of Programming Languages Ambiguous Grammars * 3 15 Exp * * * *

Adam Doupé, Principles of Programming Languages Ambiguous Grammars A grammar is ambiguous if there exists two different leftmost derivations, or two different rightmost derivations, or two different parse trees for any string in the grammar Is English ambiguous? –I saw a man on a hill with a telescope. Ambiguity is not desirable in a programming language –Unlike in English, we don't want the compiler to read your mind and try to infer what you meant 16

Adam Doupé, Principles of Programming Languages Parsing Approaches Various ways to turn strings into parse tree –Bottom-up parsing, where you start from the terminals and work your way up –Top-down parsing, where you start from the starting non-terminal and work your way down In this class, we will focus exclusively on top-down parsing 17

Adam Doupé, Principles of Programming Languages Top-Down Parsing S → A | B | C A → a B → Bb | b C → Cc | parse_S() { t_type = getToken() if (t_type == a) { ungetToken() parse_A() check_eof() } else if (t_type == b) { ungetToken() parse_B() check_eof() } 18 else if (t_type == c) { ungetToken() parse_C() check_eof() } else if (t_type == eof) { // do EOF stuff } else { syntax_error() } }

Adam Doupé, Principles of Programming Languages Predictive Recursive Descent Parsers Predictive recursive descent parser are efficient top-down parsers –Efficient because they only look at next token, no backtracking/guessing To determine if a language allows a predictive recursive descent parser, we need to define the following functions FIRST(α), where α is a sequence of grammar symbols (non- terminals, terminals, and ) –FIRST(α) returns the set of terminals and that begin strings derived from α FOLLOW(A), where A is a non-terminal –FOLLOW(A) returns the set of terminals and $ (end of file) that can appear immediately after the non-terminal A 19

Adam Doupé, Principles of Programming Languages FIRST() Example S → A | B | C A → a B → Bb | b C → Cc | FIRST(S) = { a, b, c, } FIRST(A) = { a } FIRST(B) = { b } FIRST(C) = {, c } 20

Adam Doupé, Principles of Programming Languages Calculating FIRST(α) First, start out with empty FIRST() sets for all non-terminals in the grammar Then, apply the following rules until the FIRST() sets do not change: 1.FIRST(x) = { x } if x is a terminal 2.FIRST() = { } 3.If A → Bα is a production rule, then add FIRST(B) – { } to FIRST(A) 4.If A → B 0 B 1 B 2 …B i B i+1 …B k and ∈ FIRST(B 0 ) and ∈ FIRST(B 1 ) and ∈ FIRST(B 2 ) and … and ∈ FIRST(B i ), then add FIRST(B i+1 ) – { } to FIRST(A) 5.If A → B 0 B 1 B 2 …B k and FIRST(B 0 ) and ∈ FIRST(B 1 ) and ∈ FIRST(B 2 ) and … and ∈ FIRST(B k ), then add ∈ to FIRST(A) 21

Adam Doupé, Principles of Programming Languages Calculating FIRST Sets S → ABCD A → CD | aA B → b C → cC | D → dD | 22 INITIAL FIRST(S) = {} FIRST(S) = { a } FIRST(S) = { a, c, d, b} FIRST(A) = {} FIRST(A) = { a } FIRST(A) = { a, c, d, } FIRST(B) = {} FIRST(B) = { b } FIRST(C) = {} FIRST(C) = { c, } FIRST(D) = {} FIRST(D) = { d, }

Adam Doupé, Principles of Programming Languages Calculating FIRST Sets S → ABCD A → CD | aA B → b C → cC | D → dD | 23 INITIAL FIRST(S) = {} FIRST(S) = { a } FIRST(S) = { a, c, d, b} FIRST(A) = {} FIRST(A) = { a } FIRST(A) = { a, c, d, } FIRST(B) = {} FIRST(B) = { b } FIRST(C) = {} FIRST(C) = { c, } FIRST(D) = {} FIRST(D) = { d, }

Adam Doupé, Principles of Programming Languages Calculating FIRST Sets S → ABCD A → CD | aA B → b C → cC | D → dD | 24 INITIAL FIRST(S) = {} FIRST(S) = { a } FIRST(S) = { a, c, d, b} FIRST(A) = {} FIRST(A) = { a } FIRST(A) = { a, c, d, } FIRST(B) = {} FIRST(B) = { b } FIRST(C) = {} FIRST(C) = { c, } FIRST(D) = {} FIRST(D) = { d, }

Adam Doupé, Principles of Programming Languages S → ABCD A → CD | aA B → b C → cC | D → dD | 25 INITIAL FIRST(S) = {} FIRST(S) = { a } FIRST(S) = { a, c, d, b} FIRST(A) = {} FIRST(A) = { a } FIRST(A) = { a, c, d, } FIRST(B) = {} FIRST(B) = { b } FIRST(C) = {} FIRST(C) = { c, } FIRST(D) = {} FIRST(D) = { d, } 1.FIRST(x) = { x } if x is a terminal 2.FIRST() = { } 3.If A → Bα is a production rule, then add FIRST(B) – { } to FIRST(A) 4.If A → B 0 B 1 B 2 …B i B i+1 …B k and ∈ FIRST(B 0 ) and ∈ FIRST(B 1 ) and ∈ FIRST(B 2 ) and … and ∈ FIRST(B i ), then add FIRST(B i+1 ) – { } to FIRST(A) 5.If A → B 0 B 1 B 2 …B k and FIRST(B 0 ) and ∈ FIRST(B 1 ) and ∈ FIRST(B 2 ) and … and ∈ FIRST(B k ), then add ∈ to FIRST(A)

Adam Doupé, Principles of Programming Languages FOLLOW() Example FOLLOW(A), where A is a non-terminal, returns the set of terminals and $ (end of file) that can appear immediately after the non-terminal A S → A | B | C A → a B → Bb | b C → Cc | FOLLOW(S) = { $ } FOLLOW(A) = { $ } FOLLOW(B) = { b, $ } FOLLOW(C) = { c, $ } 26

Adam Doupé, Principles of Programming Languages Calculating FOLLOW(A) First, calculate FIRST sets. Then, initialize empty FOLLOW sets for all non-terminals in the grammar Finally, apply the following rules until the FOLLOW sets do not change: 1.If S is the starting symbol of the grammar, then add $ to FOLLOW(S) 2.If B → αA, then add FOLLOW(B) to FOLLOW(A) 3.If B → αAC 0 C 1 C 2 …C k and ∈ FIRST(C 0 ) and ∈ FIRST(C 1 ) and ∈ FIRST(C 2 ) and … and ∈ FIRST(C k ), then add FOLLOW(B) to FOLLOW(A) 4.If B → αAC 0 C 1 C 2 …C k, then add FIRST(C 0 ) – { } to FOLLOW(A) 5.If B → αAC 0 C 1 C 2 …C i C i+1 …C k and ∈ FIRST(C 0 ) and ∈ FIRST(C 1 ) and ∈ FIRST(C 2 ) and … and ∈ FIRST(C i ), then add FIRST(C i+1 ) – { } to FOLLOW(A) 27

Adam Doupé, Principles of Programming Languages Calculating FOLLOW Sets 28 S → ABCD A → CD | aA B → b C → cC | D → dD | FIRST(S) = { a, c, d, b } FIRST(A) = { a, c, d, } FIRST(B) = { b } FIRST(C) = { c, } FIRST(D) = { d, } INITIAL FOLLOW(S) = {} FOLLOW(S) = { $ } FOLLOW(A) = {} FOLLOW(A) = { b } FOLLOW(B) = {} FOLLOW(B) = { $, c, d } FOLLOW(C) = {} FOLLOW(C) = { $, d, b } FOLLOW(D) = {} FOLLOW(D) = { $, b }

Adam Doupé, Principles of Programming Languages Calculating FOLLOW Sets 29 S → ABCD A → CD | aA B → b C → cC | D → dD | FIRST(S) = { a, c, d, b } FIRST(A) = { a, c, d, } FIRST(B) = { b } FIRST(C) = { c, } FIRST(D) = { d, } INITIAL FOLLOW(S) = {} FOLLOW(S) = { $ } FOLLOW(A) = {} FOLLOW(A) = { b } FOLLOW(B) = {} FOLLOW(B) = { $, c, d } FOLLOW(C) = {} FOLLOW(C) = { $, d, b } FOLLOW(D) = {} FOLLOW(D) = { $, b }

Adam Doupé, Principles of Programming Languages Calculating FOLLOW Sets 30 S → ABCD A → CD | aA B → b C → cC | D → dD | FIRST(S) = { a, c, d, b } FIRST(A) = { a, c, d, } FIRST(B) = { b } FIRST(C) = { c, } FIRST(D) = { d, } INITIAL FOLLOW(S) = {} FOLLOW(S) = { $ } FOLLOW(A) = {} FOLLOW(A) = { b } FOLLOW(B) = {} FOLLOW(B) = { $, c, d } FOLLOW(C) = {} FOLLOW(C) = { $, d, b } FOLLOW(D) = {} FOLLOW(D) = { $, b }

Adam Doupé, Principles of Programming Languages 31 S → ABCD A → CD | aA B → b C → cC | D → dD | FIRST(S) = { a, c, d, b } FIRST(A) = { a, c, d, } FIRST(B) = { b } FIRST(C) = { c, } FIRST(D) = { d, } INITIAL FOLLOW(S) = {} FOLLOW(S) = { $ } FOLLOW(A) = {} FOLLOW(A) = { b } FOLLOW(B) = {} FOLLOW(B) = { $, c, d } FOLLOW(C) = {} FOLLOW(C) = { $, d, b } FOLLOW(D) = {} FOLLOW(D) = { $, b } 1.If S is the starting symbol of the grammar, then add $ to FOLLOW(S) 2.If B → αA, then add FOLLOW(B) to FOLLOW(A) 3.If B → αAC0C1C2…Ck and ∈ FIRST(C0) and ∈ FIRST(C1) and ∈ FIRST(C2) and … and ∈ FIRST(Ck), then add FOLLOW(B) to FOLLOW(A) 4.If B → αAC0C1C2…Ck, then add FIRST(C0) – { } to FOLLOW(A) 5.If B → αAC0C1C2…CiCi+1…Ck and ∈ FIRST(C0) and ∈ FIRST(C1) and ∈ FIRST(C2) and … and ∈ FIRST(Ci), then add FIRST(Ci+1) – { } to FOLLOW(A)

Adam Doupé, Principles of Programming Languages 32 S → ABCD A → CD | aA B → b C → cC | D → dD | FIRST(S) = { a, c, d, b } FIRST(A) = { a, c, d, } FIRST(B) = { b } FIRST(C) = { c, } FIRST(D) = { d, } INITIAL FOLLOW(S) = {} FOLLOW(S) = { $ } FOLLOW(A) = {} FOLLOW(A) = { b } FOLLOW(B) = {} FOLLOW(B) = { $, c, d } FOLLOW(C) = {} FOLLOW(C) = { $, d, b } FOLLOW(D) = {} FOLLOW(D) = { $, b } 1.If S is the starting symbol of the grammar, then add $ to FOLLOW(S) 2.If B → αA, then add FOLLOW(B) to FOLLOW(A) 3.If B → αAC 0 C 1 C 2 …C k and ∈ FIRST(C 0 ) and ∈ FIRST(C 1 ) and ∈ FIRST(C 2 ) and … and ∈ FIRST(C k ), then add FOLLOW(B) to FOLLOW(A) 4.If B → αAC 0 C 1 C 2 …C k, then add FIRST(C 0 ) – { } to FOLLOW(A) 5.If B → αAC 0 C 1 C 2 …C i C i+1 …C k and ∈ FIRST(C 0 ) and ∈ FIRST(C 1 ) and ∈ FIRST(C 2 ) and … and ∈ FIRST(C i ), then add FIRST(C i+1 ) – { } to FOLLOW(A)

Adam Doupé, Principles of Programming Languages Predictive Recursive Descent Parsers At each parsing step, there is only one grammar rule that can be chosen, and there is no need for backtracking The conditions for a predictive parser are both of the following –If A → α and A → β, then FIRST(α) ∩ FIRST(β) = ∅ –If ∈ FIRST(A), then FIRST(A) ∩ FOLLOW(A) = ∅ 33

Adam Doupé, Principles of Programming Languages Creating a Predictive Recursive Descent Parser Create a CFG Calculate FIRST and FOLLOW sets Prove that CFG allows a Predictive Recursive Descent Parser Write the predictive recursive descent parser using the FIRST and FOLLOW sets 34

Adam Doupé, Principles of Programming Languages Addresses How to parse/validate addresses? domain.tld Turns out, it is not so simple –"cse –test In fact, a company called Mailgun, which provides services as an API, released an open-source tool to validate addresses, based on their experience with real-world –How did they implement their parser? –A recursive descent parser – 35

Adam Doupé, Principles of Programming Languages Address CFG quoted-string atom dot-atom whitespace Address → Name-addr-rfc | Name-addr-lax | Addr-spec Name-addr-rfc → Display-name-rfc Angle-addr-rfc | Angle-addr-rfc Display-name-rfc → Word Display-name-rfc-list | whitespace Word Display-name-rfc-list Display-name-rfc-list → whitespace Word Display-name-rfc-list | epsilon Angle-addr-rfc → | whitespace | whitespace whitespace | whitespace Name-addr-lax → Display-name-lax Angle-addr-lax | Angle-addr-lax Display-name-lax → whitespace Word Display-name-lax-list whitespace | Word Display-name-lax-list whitespace Display-name-lax-list → whitespace Word Display-name-lax-list | epsilon Angle-addr-lax → Addr-spec | Addr-spec whitespace Addr-spec → Domain | whitespace Domain | whitespace Domain whitespace | Domain whitespace Local-part → dot-atom | quoted-string Domain → dot-atom Word → atom | quoted-string 36 CFG taken from

Adam Doupé, Principles of Programming Languages Simplified Address CFG quoted-string (q-s) atom dot-atom (d-a) quoted-string-at (q-s-a) dot-atom-at (d-a-a) Address → Name-addr | Addr-spec Name-addr → Display-name Angle-addr | Angle-addr Display-name → Word Display-name-list Display-name-list → Word Display-name-list | Angle-addr → Addr-spec → d-a-a Domain | q-s-a Domain Domain → d-a Word → atom | q-s 37

Adam Doupé, Principles of Programming Languages Address → Name-addr | Addr-spec Name-addr → Display-name Angle-addr | Angle-addr Display-name → Word Display-name-list Display-name-list → Word Display-name-list | Angle-addr → Addr-spec → d-a-a Domain | q-s-a Domain Domain → d-a Word → atom | q-s 38 FIRSTINITIAL Address{} { d-a-a, q-s-a }{ d-a-a, q-s-a, < }{ d-a-a, q-s-a, <, atom, q-s } Name-addr{} { < }{ <, atom, q-s } Display- name {} { atom, q-s } Display- name-list {} {, atom, q-s } Angle-addr{}{ < } Addr-spec{}{ d-a-a, q-s-a } Domain{}{ d-a } Word{}{ atom, q-s }

Adam Doupé, Principles of Programming Languages Address → Name-addr | Addr-spec Name-addr → Display-name Angle-addr | Angle-addr Display-name → Word Display-name-list Display-name-list → Word Display-name-list | Angle-addr → Addr-spec → d-a-a Domain | q-s-a Domain Domain → d-a Word → atom | q-s 39 FIRSTINITIAL Address{} { d-a-a, q-s-a }{ d-a-a, q-s-a, < }{ d-a-a, q-s-a, <, atom, q-s } Name-addr{} { < }{ <, atom, q-s } Display- name {} { atom, q-s } Display- name-list {} {, atom, q-s } Angle-addr{}{ < } Addr-spec{}{ d-a-a, q-s-a } Domain{}{ d-a } Word{}{ atom, q-s }

Adam Doupé, Principles of Programming Languages 40 Address → Name-addr | Addr-spec Name-addr → Display-name Angle-addr | Angle-addr Display-name → Word Display-name-list Display-name-list → Word Display-name-list | Angle-addr → Addr-spec → d-a-a Domain | q-s-a Domain Domain → d-a Word → atom | q-s FOLLOWINITIAL Address{}{ $ } Name-addr{}{ $ } Display-name{}{ < } Display-name-list{}{ < } Angle-addr{}{ $ } Addr-spec{}{ $, > } Domain{}{ $, > } Word{}{ atom, q-s, < } FIRST(Address) = { d-a-a, q-s-a, <, atom, q-s } FIRST(Name-addr) = { <, atom, q-s } FIRST(Display-name) = { atom, q-s } FIRST(Display-name-list) = {, atom, q-s } FIRST(Angle-addr) = { < } FIRST(Addr-spec) = { d-a-a, q-s-a } FIRST(Domain) = { d-a } FIRST(Word) = { atom, q-s }

Adam Doupé, Principles of Programming Languages 41 Address → Name-addr | Addr-spec Name-addr → Display-name Angle-addr | Angle-addr Display-name → Word Display-name-list Display-name-list → Word Display-name-list | Angle-addr → Addr-spec → d-a-a Domain | q-s-a Domain Domain → d-a Word → atom | q-s FIRST(Name-addr) ∩ FIRST(Addr-spec) FIRST(Display-name Angle-addr) ∩ FIRST(Angle-addr) FIRST(Word Display-name-list) ∩ FIRST() FIRST(d-a-a Domain) ∩ FIRST(q-s-a Domain) FIRST(atom) ∩ FIRST(q-s) FIRST(Display-name-list) ∩ FOLLOW(Display-name- list) FIRST(Address) = { d-a-a, q-s-a, <, atom, q-s } FIRST(Name-addr) = { <, atom, q-s } FIRST(Display-name) = { atom, q-s } FIRST(Display-name-list) = {, atom, q-s } FIRST(Angle-addr) = { < } FIRST(Addr-spec) = { d-a-a, q-s-a } FIRST(Domain) = { d-a } FIRST(Word) = { atom, q-s } FOLLOW(Address) = { $ } FOLLOW(Name-addr) = { $ } FOLLOW(Display-name) = { < } FOLLOW(Display-name-list) = { < } FOLLOW(Angle-addr) = { $ } FOLLOW(Addr-spec) = { $, > } FOLLOW(Domain) = { $, > } FOLLOW(Word) = { atom, q-s, < }

Adam Doupé, Principles of Programming Languages parse_Address() { t_type = getToken(); // Check FIRST(Name-addr) if (t_type == < || t_type == atom || t_type == q-s ) { ungetToken(); parse_Name-addr(); printf("Address -> Name-addr"); } // Check FIRST(Addr-spec) else if (t_type == d-a-a || t_type == q-s-a) { ungetToken(); parse_Addr-spec(); printf("Address -> Addr-spec"); } else { syntax_error(); } } 42 Address → Name-addr | Addr-spec FIRST(Address) = { d-a-a, q-s-a, <, atom, q-s } FIRST(Name-addr) = { <, atom, q-s } FIRST(Addr-spec) = { d-a-a, q-s-a } FOLLOW(Address) = { $ } FOLLOW(Name-addr) = { $ } FOLLOW(Addr-spec) = { $, > }

Adam Doupé, Principles of Programming Languages parse_Name-addr() { t_type = getToken(); // Check FIRST(Display-name Angle-addr) if (t_type == atom || t_type == q-s) { ungetToken(); parse_Display-name(); parse_Angle-addr(); printf("Name-addr -> Display-name Angle-addr"); } // Check FIRST(Angle-addr) else if (t_type == <) { ungetToken(); parse_Angle-addr(); printf("Name-addr -> Angle-addr"); } else { syntax_error(); } } 43 Name-addr → Display-name Angle-addr | Angle- addr FIRST(Name-addr) = { <, atom, q-s } FIRST(Display-name) = { atom, q-s } FIRST(Angle-addr) = { < } FOLLOW(Name-addr) = { $ } FOLLOW(Display-name) = { < } FOLLOW(Angle-addr) = { $ }

Adam Doupé, Principles of Programming Languages parse_Display-name() { t_type = getToken(); // Check FIRST(Word Display-name-list) if (t_type == atom || t_type == q-s) { ungetToken(); parse_Word(); parse_Display-name-list(); printf("Display-name -> Word Display-name-list"); } else { syntax_error(); } } 44 Display-name → Word Display-name-list FIRST(Display-name) = { atom, q-s } FIRST(Display-name-list) = {, atom, q-s } FIRST(Word) = { atom, q-s } FOLLOW(Display-name) = { < } FOLLOW(Display-name-list) = { < } FOLLOW(Word) = { atom, q-s, < }

Adam Doupé, Principles of Programming Languages parse_Display-name-list() { t_type = getToken(); // Check FIRST( Word Display-name-list) if (t_type == atom || t_type == q-s) { ungetToken(); parse_Word(); parse_Display-name-list(); printf("Display-name-list -> Word Display-name-list"); } // Check FOLLOW(Display-name-list) else if (t_type == <) { ungetToken(); printf("Display-name-list -> "); } else { syntax_error(); } } 45 Display-name-list → Word Display-name-list | FIRST(Display-name-list) = {, atom, q-s } FIRST(Word) = { atom, q-s } FOLLOW(Display-name-list) = { < } FOLLOW(Word) = { atom, q-s, < }

Adam Doupé, Principles of Programming Languages parse_Angle-addr() { t_type = getToken(); // Check FIRST( ) if (t_type == <) { // ungetToken()? parse_Addr-spec(); t_type = getToken(); if (t_type != >) { syntax_error(); } printf("Angle-addr -> "); } else { syntax_error(); } } 46 Angle-addr → FIRST(Angle-addr) = { < } FIRST(Addr-spec) = { d-a-a, q-s-a } FOLLOW(Angle-addr) = { $ } FOLLOW(Addr-spec) = { $, > }

Adam Doupé, Principles of Programming Languages parse_Addr-spec() { t_type = getToken(); // Check FIRST(d-a-a Domain) if (t_type == d-a-a) { // ungetToken()? parse_Domain(); printf("Addr-spec -> d-a-a Domain"); } // Check FIRST(q-s-a Domain) else if (t_type == q-s-a) { parse_Domain(); printf("Addr-spec -> q-s-a Domain"); } else { syntax_error(); } } 47 Addr-spec → d-a-a Domain | q-s-a Domain FIRST(Addr-spec) = { d-a-a, q-s-a } FIRST(Domain) = { d-a } FOLLOW(Addr-spec) = { $, > } FOLLOW(Domain) = { $, > }

Adam Doupé, Principles of Programming Languages parse_Domain() { t_type = getToken(); // Check FIRST(d-a) if (t_type == d-a) { printf("Domain -> d-a"); } else { syntax_error(); } } 48 Domain → d-a FIRST(Domain) = { d-a } FOLLOW(Domain) = { $, > }

Adam Doupé, Principles of Programming Languages parse_Word() { t_type = getToken(); // Check FIRST(atom) if (t_type == atom) { printf("Word -> atom"); } // Check FIRST(q-s) else if (t_type == q-s) { printf("Word -> q-s"); } else { syntax_error(); } } 49 Word → atom | q-s FIRST(Word) = { atom, q-s } FOLLOW(Word) = { atom, q-s, < }

Adam Doupé, Principles of Programming Languages Predictive Recursive Descent Parsers For every non-terminal A in the grammar, create a function called parse_A For each production rule A → α (where α is a sequence of terminals and non-terminals), if getToken() ∈ FIRST(α) then choose the production rule A → α –For every terminal and non-terminal a in α, if a is a non- terminal call parse_a, if a is a terminal check that getToken() == a –If ∈ FIRST(α), then check that getToken() ∈ FOLLOW(A), then choose the production A → If getToken() ∉ FIRST(A), then syntax_error(), unless ∈ FIRST(A), then getToken() ∉ FOLLOW(A) is syntax_error() 50