CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic2: Parsing and Lexical Analysis José Nelson Amaral

CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic2: Parsing and Lexical Analysis José Nelson Amaral http://www.cs.ualberta.ca/~amaral/courses/680

CMPUT 680 - Compiler Design and Optimization2 Reading List Appel, Chapter 2, 3, 4, and 5 AhoSethiUllman, Chapter 2, 3, 4, and 5

CMPUT 680 - Compiler Design and Optimization3 Some Important Basic Definitions lexical: of or relating to the morphemes of a language. morpheme: a meaningul linguistic unit that cannot be divided into smaller meaningful parts. lexical analysis: the task concerned with breaking an input into its smallest meaningful units, called tokens.

CMPUT 680 - Compiler Design and Optimization4 Some Important Basic Definitions syntax: the way in which words are put together to form phrases, clauses, or sentences. The rules governing the formation of statements in a programming language. syntax analysis: the task concerned with fitting a sequence of tokens into a specified syntax. parsing: To break a sentence down into its component parts of speech with an explanation of the form, function, and syntactical relationship of each part.

CMPUT 680 - Compiler Design and Optimization5 Some Important Basic Definitions parsing = lexical analysis + syntax analysis semantic analysis: the task concerned with calculating the program’s meaning.

CMPUT 680 - Compiler Design and Optimization6 Regular Expressions Symbol: aA regular expression formed by a. Alternation: M | NA regular expression formed by M or N. Concatenation: M NA regular expression formed by M followed by N. Epsilon:  The empty string. Repetition: M* A regular expression formed by zero or more repetitions of M.

CMPUT 680 - Compiler Design and Optimization7 Building a Recognizer for a Language General approach: 1. Build a deterministic finite automaton (DFA) from regular expression E 2. Execute the DFA to determine whether an input string belongs to L(E) Note: The DFA construction is done automatically by a tool such as lex.

CMPUT 680 - Compiler Design and Optimization8 Finite Automata A nondeterministic finite automaton A = {S, , s 0, F, move } consists of: 1. A set of states S 2. A set of input symbols  (the input symbol alphabet) 3. A state s 0 that is distinguished as the start state 4. A state F distinguished as the accepting state 5. A transition function move that maps state-symbol pairs into sets of state. In a Deterministic Finite State Automata (DFA), the function move maps each state-symbol pair into a unique state.

CMPUT 680 - Compiler Design and Optimization9 Finite Automata A Deterministic Finite Automaton (DFA): A Nondeterministic Finite Automaton (NFA): 0123 a b bbstart 0123 a a b bb What languages are accepted by these automata? b*abb (a|b)*abb (Aho,Sethi,Ullman, pp. 114)

CMPUT 680 - Compiler Design and Optimization10 Another NFA start a b a b   An  -transition is taken without consuming any character from the input. What does the NFA above accepts? aa*|bb* (Aho,Sethi,Ullman, pp. 116)

CMPUT 680 - Compiler Design and Optimization11 Constructing NFA It is very simple. Remember that a regular expression is formed by the use of alternation, concatenation, and repetition. How do we define an NFA that accepts a regular expression? Thus all we need to do is to know how to build the NFA for a single symbol, and how to compose NFAs.

CMPUT 680 - Compiler Design and Optimization12 Composing NFAs with Alternation The NFA for a symbol a is: a i f start Given two NFA N(s) and N(t) N(s) N(t) (Aho,Sethi,Ullman, pp. 122) start i   f  , the NFA N(s|t) is:

CMPUT 680 - Compiler Design and Optimization13 Composing NFAs with Concatenation start Given two NFA N(s) and N(t), the NFA N(st) is: N(s)N(t) if (Aho,Sethi,Ullman, pp. 123)

CMPUT 680 - Compiler Design and Optimization14 Composing NFAs with Repetition The NFA for N(s*) is N(s)     fi (Aho,Sethi,Ullman, pp. 123)

CMPUT 680 - Compiler Design and Optimization15 Properties of the NFA zFollowing this construction rules, we obtain an NFA N(r) with these properties: yN(r) has at most twice as many states as the number of symbols and operators in r; yN(r) has exactly one starting and one accepting state; yEach state of N(r) has at most one outgoing transition on a symbol of the alphabet  or at most two outgoing  -transitions. (Aho,Sethi,Ullman, pp. 124)

CMPUT 680 - Compiler Design and Optimization16 How to Parse a Regular Expression? Given a DFA, we can generate an automaton that recognizes the longest substring of an input that is a valid token. Using the three simple rules presented, it is easy to generate an NFA to recognize a regular expression. Given a regular expression, how do we generate an automaton to recognize tokens? Create an NFA and convert it to a DFA.

CMPUT 680 - Compiler Design and Optimization17 aAn ordinary character stands for itself.  The empty string. Another way to write the empty string. M | NAlternation, Choosing from M or N. M N Concatenation, an M followed by an N. M*Repetition (zero or more times). M+Repetition (one or more times). M?Optional, zero or one occurrence of M. [a -zA -Z]Character set alternation.. Stands for any single character except newline. “a.+*”Quotation, a string in quotes stands for itself literally. Regular expression notation: An Example (Appel, pp. 20)

CMPUT 680 - Compiler Design and Optimization18 if {return IF;} [a - z] [a - z0 - 9 ] * {return ID;} [0 - 9] + {return NUM;} ([0 - 9] + “.” [0 - 9] *) | ([0 - 9] * “.” [0 - 9] +) {return REAL;} (“--” [a - z]* “\n”) | (“ ” | “ \n ” | “ \t ”) + {/* do nothing*/}. {error ();} (Appel, pp. 20) Regular expressions for some tokens

CMPUT 680 - Compiler Design and Optimization19 Building Finite Automatas for Lexical Tokens (Appel, pp. 21) The NFA for a symbol i is: i 12 start The NFA for the regular expression if is: f 3 1 start 2 i The NFA for a symbol f is: f 2 start 1 IF if {return IF;}

CMPUT 680 - Compiler Design and Optimization20 Building Finite Automatas for Lexical Tokens (Appel, pp. 21) a-z 2 1 start ID [a-z] [a-z0-9 ] * {return ID;} 0-9 a-z

CMPUT 680 - Compiler Design and Optimization21 Building Finite Automatas for Lexical Tokens (Appel, pp. 21) 0-9 2 1 start NUM [0 - 9] + {return NUM;} 0-9

CMPUT 680 - Compiler Design and Optimization22 Building Finite Automatas for Lexical Tokens (Appel, pp. 21) 1 start REAL ([0 - 9] + “.” [0 - 9] *) | ([0 - 9] * “.” [0 - 9] +) {return REAL;} 0-9 2 3. 5 4.

CMPUT 680 - Compiler Design and Optimization23 Building Finite Automatas for Lexical Tokens (Appel, pp. 21) 1 start /* do nothing */ (“--” [a - z]* “\n”) | (“ ” | “ \n ” | “ \t ”) + {/* do nothing*/} - 2 a-z - 3 4 \n \t 5 blank \n \t blank

CMPUT 680 - Compiler Design and Optimization24 ID 1 2 0 - 9 NUM 0 - 9 1 2 3 4 5 REAL 1243 5 a-z \n - - blank, etc. White space 21 any but \n error IF 1 2 a-z 0-9 Building Finite Automatas for Lexical Tokens 1 2 i f 3.. (Appel, pp. 21)

CMPUT 680 - Compiler Design and Optimization25 Conversion of NFA into DFA (Appel, pp. 27) What states can be reached from state 1 without consuming a character? 238 4 56713 9 101112 14 151 a-z 0-9 a-z 0-9 i f          IF error NUM ID any character

CMPUT 680 - Compiler Design and Optimization26 Conversion of NFA into DFA What states can be reached from state 1 without consuming a character? {1,4,9,14} form the  -closure of state 1 (Appel, pp. 27) 23 8 4 5 67 13 9 101112 14 15 1 a-z 0-9 a-z 0-9 i f          IF error NUM ID any character

CMPUT 680 - Compiler Design and Optimization27 Conversion of NFA into DFA What are all the state closures in this NFA? closure(1) = {1,4,9,14} closure(5) = {5,6,8} closure(8) = {6,8} closure(7) = {7,8} (Appel, pp. 27) closure(10) = {10,11,13} closure(13) = {11,13} closure(12) = {12,13} 23 8 4 5 67 13 9 101112 14 15 1 a-z 0-9 a-z 0-9 i f          IF error NUM ID any character

CMPUT 680 - Compiler Design and Optimization28 Conversion of NFA into DFA Given a set of NFA states T, the  -closure(T) is the set of states that are reachable through  -transiton from any state s  T. Given a set of NFA states T, move(T, a) is the set of states that are reachable on input a from any state s  T. (Aho,Sethi,Ullman, pp. 118)

CMPUT 680 - Compiler Design and Optimization29 Problem Statement for Conversion of NFA into DFA Given an NFA find the DFA with the minimum number of states that has the same behavior as the NFA for all inputs. If the initial state in the NFA is s 0, then the set of states in the DFA, Dstates, is initialized with a state representing  -closure(s 0 ). (Aho,Sethi,Ullman, pp. 118)

CMPUT 680 - Compiler Design and Optimization30 Conversion of NFA into DFA (Appel, pp. 27) Dstates = {1-4-9-14} 1-4-9-14 Now we need to compute: move(1-4-9-14,a-h) = ? 23 8 4 5 67 13 9 101112 14 15 1 a-z 0-9 a-z 0-9 i f          IF error NUM ID any character

CMPUT 680 - Compiler Design and Optimization31 Conversion of NFA into DFA (Appel, pp. 27) Dstates = {1-4-9-14} 1-4-9-14 Now we need to compute: move(1-4-9-14,a-h) = {5,15}  -closure ({5,15}) = ? 23 8 4 5 67 13 9 101112 14 15 1 a-z 0-9 a-z 0-9 i f          IF error NUM ID any character

CMPUT 680 - Compiler Design and Optimization32 Conversion of NFA into DFA (Appel, pp. 27) Dstates = {1-4-9-14} 1-4-9-14 Now we need to compute: move(1-4-9-14,a-h) = {5,15}  -closure ({5,15}) = {5,6,8,15} a-h 5-6-8-15 23 8 4 5 67 13 9 101112 14 15 1 a-z 0-9 a-z 0-9 i f          IF error NUM ID any character

CMPUT 680 - Compiler Design and Optimization33 Conversion of NFA into DFA (Appel, pp. 27) Dstates = {1-4-9-14} move(1-4-9-14, i) = ? 23 8 4 5 67 13 9 101112 14 15 1 a-z 0-9 a-z 0-9 i f          IF error NUM ID any character 1-4-9-14 a-h 5-6-8-15

CMPUT 680 - Compiler Design and Optimization34 Conversion of NFA into DFA (Appel, pp. 27) Dstates = {1-4-9-14} move(1-4-9-14, i) = {2,5,15}  -closure ({2,5,15}) = ? 23 8 4 5 67 13 9 101112 14 15 1 a-z 0-9 a-z 0-9 i f          IF error NUM ID any character 1-4-9-14 a-h 5-6-8-15

CMPUT 680 - Compiler Design and Optimization35 Conversion of NFA into DFA (Appel, pp. 27) Dstates = {1-4-9-14} move(1-4-9-14, i) = {2,5,15}  -closure ({2,5,15}) = {2,5,6,8,15} 23 8 4 5 67 13 9 101112 14 15 1 a-z 0-9 a-z 0-9 i f          IF error NUM ID any character 1-4-9-14 a-h 5-6-8-15 2-5-6-8-15 i

CMPUT 680 - Compiler Design and Optimization36 Conversion of NFA into DFA (Appel, pp. 27) Dstates = {1-4-9-14} move(1-4-9-14, j-z) = ? 23 8 4 5 67 13 9 101112 14 15 1 a-z 0-9 a-z 0-9 i f          IF error NUM ID any character 1-4-9-14 a-h 5-6-8-15 2-5-6-8-15 i

CMPUT 680 - Compiler Design and Optimization37 Conversion of NFA into DFA (Appel, pp. 27) Dstates = {1-4-9-14} move(1-4-9-14, j-z) = {5,15}  -closure ({5,15}) = ? 23 8 4 5 67 13 9 101112 14 15 1 a-z 0-9 a-z 0-9 i f          IF error NUM ID any character 1-4-9-14 a-h 5-6-8-15 2-5-6-8-15 i

CMPUT 680 - Compiler Design and Optimization38 Conversion of NFA into DFA (Appel, pp. 27) Dstates = {1-4-9-14} move(1-4-9-14, j-z) = {5,15}  -closure ({5,15}) = {5,6,8,15} 23 8 4 5 67 13 9 101112 14 15 1 a-z 0-9 a-z 0-9 i f          IF error NUM ID any character 1-4-9-14 a-h 5-6-8-15 2-5-6-8-15 i j-z

CMPUT 680 - Compiler Design and Optimization39 Conversion of NFA into DFA (Appel, pp. 27) Dstates = {1-4-9-14} move(1-4-9-14, 0-9) = {10,15}  -closure ({10,15}) = {10,11,13,15} 23 8 4 5 67 13 9 101112 14 15 1 a-z 0-9 a-z 0-9 i f          IF error NUM ID any character 1-4-9-14 a-h 5-6-8-15 2-5-6-8-15 i j-z 10-11-13-15 0-9

CMPUT 680 - Compiler Design and Optimization40 Conversion of NFA into DFA (Appel, pp. 27) Dstates = {1-4-9-14} move(1-4-9-14, other ) = {15}  -closure ({15}) = {15} 23 8 4 5 67 13 9 101112 14 15 1 a-z 0-9 a-z 0-9 i f          IF error NUM ID any character 1-4-9-14 a-h 5-6-8-15 2-5-6-8-15 i j-z 10-11-13-15 0-9 15 other

CMPUT 680 - Compiler Design and Optimization41 Conversion of NFA into DFA (Appel, pp. 27) Dstates = {1-4-9-14} The analysis for 1-4-9-14 is complete. We mark it and pick another state in the DFA to analyse. 23 8 4 5 67 13 9 101112 14 15 1 a-z 0-9 a-z 0-9 i f          IF error NUM ID any character 1-4-9-14 a-h 5-6-8-15 2-5-6-8-15 i j-z 10-11-13-15 0-9 15 other

CMPUT 680 - Compiler Design and Optimization42 The corresponding DFA 5-6-8-15 2-5-6-8-15 10-11-13-153-6-7-8 11-12-13 6-7-8 15 1-4-9-14 a-e, g-z, 0-9 a-z,0-9 0-9 f i a-h j-z 0-9 other ID NUM IF error ID a-z,0-9 (Appel, pp. 29) See pp. 118 of Aho-Sethi-Ullman and pp. 29 of Appel.

CMPUT 680 - Compiler Design and Optimization43 Lexical Analyzer and Parser lexical analyzer Syntax analyzer symbol table get next token (Aho,Sethi,Ullman, pp. 160) token: smallest meaningful sequence of characters of interest in source program Source Program get next char next char next token (Contains a record for each identifier)

CMPUT 680 - Compiler Design and Optimization44 Definition of Context-Free Grammars A context-free grammar G = (T, N, S, P) consists of: 1. T, a set of terminals (scanner tokens). 2. N, a set of nonterminals (syntactic variables generated by productions). 3. S, a designated start nonterminal. 4. P, a set of productions. Each production has the form, A::= , where A is a nonterminal and  is a sentential form, i.e., a string of zero or more grammar symbols (terminals/nonterminals).

CMPUT 680 - Compiler Design and Optimization45 Syntax Analysis Syntax Analysis Problem Statement: To find a derivation sequence in a grammar G for the input token stream (or say that none exists).

CMPUT 680 - Compiler Design and Optimization46 Tree nodes represent symbols of the grammar (nonterminals or terminals) and tree edges represent derivation steps. Parse trees A parse tree is a graphical representation of a derivation sequence of a sentential form.

CMPUT 680 - Compiler Design and Optimization47 Derivation E  E + E | E  E | ( E ) | - E | id Given the following grammar: Is the string -(id + id) a sentence in this grammar? Yes because there is the following derivation: E  -E  -(E)  -(E + E)  -(id + id) Where  reads “derives in one step”. (Aho,Sethi,Ullman, pp. 168)

CMPUT 680 - Compiler Design and Optimization48 Derivation E  E + E | E  E | ( E ) | - E | id Lets examine this derivation: E  -E  -(E)  -(E + E)  -(id + id) EE E - E E - E() E E - E() +EE E E - E() +EE id This is a top-down derivation because we start building the parse tree at the top parse tree (Aho,Sethi,Ullman, pp. 170)

CMPUT 680 - Compiler Design and Optimization49 Which derivation tree is correct? Another Derivation Example Find a derivation for the expression: id + id  id EE +EE E +EE  EE E +EE  EE id EE  EE E  EE + EE E  EE + EE E  E + E | E  E | ( E ) | - E | id (Aho,Sethi,Ullman, pp. 171)

CMPUT 680 - Compiler Design and Optimization50 According to the grammar, both are correct. Another Derivation Example Find a derivation for the expression: id + id  id E +EE  EE id E +EE  EE A grammar that produces more than one parse tree for any input sentence is said to be an ambiguous grammar. E  E + E | E  E | ( E ) | - E | id (Aho,Sethi,Ullman, pp. 171)

CMPUT 680 - Compiler Design and Optimization51 Left Recursion Consider the grammar: E  E + T | T T  T  F | F F  ( E ) | id A top-down parser might loop forever when parsing an expression using this grammar EE +ET E +ET + ET E +ET + ET + ET (Aho,Sethi,Ullman, pp. 176)

CMPUT 680 - Compiler Design and Optimization52 Left Recursion Consider the grammar: E  E + T | T T  T  F | F F  ( E ) | id A grammar that has at least one production of the form A  A  is a left recursive grammar. Top-down parsers do not work with left-recursive grammars. Left-recursion can often be eliminated by rewriting the grammar. (Aho,Sethi,Ullman, pp. 176)

CMPUT 680 - Compiler Design and Optimization54 Predictive Parsing Consider the grammar: stm  if expr then stmt else stmt | while expr do stmt | begin stmt_list end A parser for this grammar can be written with the following simple structure: switch(gettoken()) { case if: …. break; case while: …. break; case begin: …. break; default: reject input; } Based only on the first token, the parser knows which rule to use to derive a statement. Therefore this is called a predictive parser. (Aho,Sethi,Ullman, pp. 183)

CMPUT 680 - Compiler Design and Optimization55 Left Factoring The following grammar: stmt  if expr then stmt else stmt | if expr then stmt Cannot be parsed by a predictive parser that looks one element ahead. But the grammar can be re-written: stmt  if expr then stmt stmt’ stmt‘  else stmt |  Where  is the empty string. (Aho,Sethi,Ullman, pp. 178) Rewriting a grammar to eliminate multiple productions starting with the same token is called left factoring.

CMPUT 680 - Compiler Design and Optimization56 A Predictive Parser E  TE’ E’  +TE’ |  T  FT’ T’   FT’ |  F  ( E ) | id Grammar: Parsing Table: (Aho,Sethi,Ullman, pp. 188)

CMPUT 680 - Compiler Design and Optimization57 A Predictive Parser STACK: id +  INPUT: Predictive Parsing Program E $ $ OUTPUT: E T E’ $ T PARSING TABLE:

CMPUT 680 - Compiler Design and Optimization58 T E’ $ T $ A Predictive Parser STACK: id +  INPUT: Predictive Parsing Program $ OUTPUT: E F T’ E’ $ FT’TE’ PARSING TABLE: (Aho,Sethi, Ullman, pp. 186)

CMPUT 680 - Compiler Design and Optimization59 (Aho,Sethi, Ullman, pp. 188) T E’ $ T $ A Predictive Parser STACK: id +  INPUT: Predictive Parsing Program $ OUTPUT: E F T’ E’ $ FT’TE’ id T’ E’ $ id PARSING TABLE:

CMPUT 680 - Compiler Design and Optimization60 A Predictive Parser STACK: id +  INPUT: Predictive Parsing Program $ OUTPUT: E F T’ E’ $ FT’TE’ id T’ E’ $ id Action when Top(Stack) = input  $ : Pop stack, advance input. PARSING TABLE: (Aho,Sethi, Ullman, pp. 188)

CMPUT 680 - Compiler Design and Optimization61 A Predictive Parser STACK: id +  INPUT: Predictive Parsing Program $ OUTPUT: E FT’TE’ id  T’ E’ $ $ PARSING TABLE: (Aho,Sethi, Ullman, pp. 188)

CMPUT 680 - Compiler Design and Optimization62 A Predictive Parser E FT’ TE’ id  T+E’FT’ id F  T’ id  The predictive parser proceeds in this fashion emiting the following productions: E’  +TE’ T  FT’ F  id T’   FT’ F  id T’   E’   When Top(Stack) = input = $ the parser halts and accepts the input string. (Aho,Sethi, Ullman, pp. 188)

CMPUT 680 - Compiler Design and Optimization63 LL(k) Parser This parser parses from left to right, and does a leftmost-derivation. It looks up 1 symbol ahead to choose its next action. Therefore, it is known as a LL(1) parser. An LL(k) parser looks k symbols ahead to decide its action.

CMPUT 680 - Compiler Design and Optimization64 The Parsing Table E  TE’ E’  +TE’ |  T  FT’ T’   FT’ |  F  ( E ) | id Given this grammar: PARSING TABLE: How is this parsing table built?

CMPUT 680 - Compiler Design and Optimization65 FIRST and FOLLOW We need to build a FIRST set and a FOLLOW set for each symbol in the grammar. FIRST(  ) is the set of terminal symbols that can begin any string derived from . The elements of FIRST and FOLLOW are terminal symbols. FOLLOW(  ) is the set of terminal symbols that can follow  : t  FOLLOW(  )   derivation containing  t (Aho,Sethi,Ullman, pp. 189)

CMPUT 680 - Compiler Design and Optimization66 Rules to Create FIRST E  TE’ E’  +TE’ |  T  FT’ T’   FT’ |  F  ( E ) | id GRAMMAR: 1. If X is a terminal, FIRST(X) = {X} FIRST(id) = {id} FIRST(  ) = {  } FIRST(+) = {+} SETS: 2. If X  , then   FIRST(X) 3. If X  Y 1 Y 2 Y k FIRST(() = {(} FIRST()) = {)} FIRST rules: * and Y 1 Y i-1   and a  FIRST(Y i ) then a  FIRST(X) FIRST(F) = { (, id } FIRST(T) = FIRST(F) = { (, id } FIRST(E) = FIRST(T) = { (, id } FIRST(E’) = {  }{+,  } FIRST(T’) = {  }{ ,  } (Aho,Sethi,Ullman, pp. 189)

CMPUT 680 - Compiler Design and Optimization67 Rules to Create FOLLOW E  TE’ E’  +TE’ |  T  FT’ T’   FT’ |  F  ( E ) | id GRAMMAR: 1. If S is the start symbol, then $  FOLLOW(S) FOLLOW(E) = {$} FOLLOW(E’) = { ), $} SETS: 2. If A   B , and a  FIRST(  ) and a   then a  FOLLOW(B) 3. If A   B and a  FOLLOW(A) then a  FOLLOW(B) FOLLOW rules: { ), $ } 3a. If A   B  and and a  FOLLOW(A) then a  FOLLOW(B) *      FOLLOW(T) = { ), $} FIRST(F) = { (, id } FIRST(T) = { (, id } FIRST(E) = { (, id } FIRST(E’) = {+,  } FIRST(T’) = { ,  } A and B are non-terminals,  and  are strings of grammar symbols (Aho,Sethi,Ullman, pp. 189)

CMPUT 680 - Compiler Design and Optimization68 Rules to Create FOLLOW E  TE’ E’  +TE’ |  T  FT’ T’   FT’ |  F  ( E ) | id GRAMMAR: 1. If S is the start symbol, then $  FOLLOW(S) FOLLOW(E) = {), $} FOLLOW(E’) = { ), $} SETS: 3. If A   B and a  FOLLOW(A) then a  FOLLOW(B) FOLLOW rules: 3a. If A   B  and and a  FOLLOW(A) then a  FOLLOW(B) *      FOLLOW(T) = { ), $} FIRST(F) = { (, id } FIRST(T) = { (, id } FIRST(E) = { (, id } FIRST(E’) = {+,  } FIRST(T’) = { ,  } 2. If A   B , and a  FIRST(  ) and a   then a  FOLLOW(B) {+, ), $} (Aho,Sethi,Ullman, pp. 189)

CMPUT 680 - Compiler Design and Optimization69 Rules to Create FOLLOW E  TE’ E’  +TE’ |  T  FT’ T’   FT’ |  F  ( E ) | id GRAMMAR: 1. If S is the start symbol, then $  FOLLOW(S) FOLLOW(E) = {), $} FOLLOW(E’) = { ), $} SETS: FOLLOW rules: FOLLOW(T) = {+, ), $} FIRST(F) = { (, id } FIRST(T) = { (, id } FIRST(E) = { (, id } FIRST(E’) = {+,  } FIRST(T’) = { ,  } 2. If A   B , and a  FIRST(  ) and a   then a  FOLLOW(B) 3. If A   B and a  FOLLOW(A) then a  FOLLOW(B) FOLLOW(T’) = {+, ), $} 3a. If A   B  and and a  FOLLOW(A) then a  FOLLOW(B) *      (Aho,Sethi,Ullman, pp. 189)

CMPUT 680 - Compiler Design and Optimization70 Rules to Create FOLLOW E  TE’ E’  +TE’ |  T  FT’ T’   FT’ |  F  ( E ) | id GRAMMAR: 1. If S is the start symbol, then $  FOLLOW(S) FOLLOW(E) = {), $} FOLLOW(E’) = { ), $} SETS: FOLLOW rules: FOLLOW(T) = {+, ), $} FIRST(F) = { (, id } FIRST(T) = { (, id } FIRST(E) = { (, id } FIRST(E’) = {+,  } FIRST(T’) = { ,  } 2. If A   B , and a  FIRST(  ) and a   then a  FOLLOW(B) 3. If A   B and a  FOLLOW(A) then a  FOLLOW(B) FOLLOW(T’) = {+, ), $} 3a. If A   B  and and a  FOLLOW(A) then a  FOLLOW(B) *      FOLLOW(F) = {+, ), $} (Aho,Sethi,Ullman, pp. 189)

CMPUT 680 - Compiler Design and Optimization71 Rules to Create FOLLOW E  TE’ E’  +TE’ |  T  FT’ T’   FT’ |  F  ( E ) | id GRAMMAR: 1. If S is the start symbol, then $  FOLLOW(S) FOLLOW(E) = {), $} FOLLOW(E’) = { ), $} SETS: FOLLOW rules: FOLLOW(T) = {+, ), $} FIRST(F) = { (, id } FIRST(T) = { (, id } FIRST(E) = { (, id } FIRST(E’) = {+,  } FIRST(T’) = { ,  } 3. If A   B and a  FOLLOW(A) then a  FOLLOW(B) FOLLOW(T’) = {+, ), $} 3a. If A   B  and and a  FOLLOW(A) then a  FOLLOW(B) *      FOLLOW(F) = {+, ), $} 2. If A   B , and a  FIRST(  ) and a   then a  FOLLOW(B) {+, , ), $} (Aho,Sethi,Ullman, pp. 189)

CMPUT 680 - Compiler Design and Optimization72 Rules to Build Parsing Table E  TE’ E’  +TE’ |  T  FT’ T’   FT’ |  F  ( E ) | id GRAMMAR: FOLLOW(E) = {), $} FOLLOW(E’) = { ), $} FOLLOW SETS: FOLLOW(T) = {+, ), $} FOLLOW(T’) = {+, ), $} FOLLOW(F) = {+, , ), $} FIRST(F) = { (, id } FIRST(T) = { (, id } FIRST(E) = { (, id } FIRST(E’) = {+,  } FIRST(T’) = { ,  } FIRST SETS: PARSING TABLE: 1. If A   : if a  FIRST(  ), add A   to M[A, a] (Aho,Sethi,Ullman, pp. 190)

CMPUT 680 - Compiler Design and Optimization77 Rules to Build Parsing Table E  TE’ E’  +TE’ |  T  FT’ T’   FT’ |  F  ( E ) | id GRAMMAR: FOLLOW(E) = {), $} FOLLOW(E’) = { ), $} FOLLOW SETS: FOLLOW(T) = {+, ), $} FOLLOW(T’) = {+, ), $} FOLLOW(F) = {+, , ), $} FIRST(F) = { (, id } FIRST(T) = { (, id } FIRST(E) = { (, id } FIRST(E’) = {+,  } FIRST(T’) = { ,  } FIRST SETS: PARSING TABLE: 1. If A   : if a  FIRST(  ), add A   to M[A, a] 2. If A   : if   FIRST(  ), add A   to M[A, b] for each terminal b  FOLLOW(A), (Aho,Sethi,Ullman, pp. 190)

CMPUT 680 - Compiler Design and Optimization78 Rules to Build Parsing Table E  TE’ E’  +TE’ |  T  FT’ T’   FT’ |  F  ( E ) | id GRAMMAR: FOLLOW(E) = {), $} FOLLOW(E’) = { ), $} FOLLOW SETS: FOLLOW(T) = {+, ), $} FOLLOW(T’) = {+, ), $} FOLLOW(F) = {+, , ), $} FIRST(F) = { (, id } FIRST(T) = { (, id } FIRST(E) = { (, id } FIRST(E’) = {+,  } FIRST(T’) = { ,  } FIRST SETS: PARSING TABLE: 1. If A   : if a  FIRST(  ), add A   to M[A, a] 2. If A   : if   FIRST(  ), add A   to M[A, b] for each terminal b  FOLLOW(A), (Aho,Sethi,Ullman, pp. 190)

CMPUT 680 - Compiler Design and Optimization79 Rules to Build Parsing Table E  TE’ E’  +TE’ |  T  FT’ T’   FT’ |  F  ( E ) | id GRAMMAR: FOLLOW(E) = {), $} FOLLOW(E’) = { ), $} FOLLOW SETS: FOLLOW(T) = {+, ), $} FOLLOW(T’) = {+, ), $} FOLLOW(F) = {+, , ), $} FIRST(F) = { (, id } FIRST(T) = { (, id } FIRST(E) = { (, id } FIRST(E’) = {+,  } FIRST(T’) = { ,  } FIRST SETS: PARSING TABLE: 1. If A   : if a  FIRST(  ), add A   to M[A, a] 2. If A   : if   FIRST(  ), add A   to M[A, b] for each terminal b  FOLLOW(A), 3. If A   : if   FIRST(  ), and $  FOLLOW(A), add A   to M[A, $] (Aho,Sethi,Ullman, pp. 190)

CMPUT 680 - Compiler Design and Optimization80 Bottom-Up and Top-Down Parsers Top-down parsers: starts constructing the parse tree at the top (root) of the tree and move down towards the leaves. Easy to implement by hand, but work with restricted grammars. example: predictive parsers Bottom-up parsers: build the nodes on the bottom of the parse tree first. Suitable for automatic parser generation, handle a larger class of grammars. examples: shift-reduce parser (or LR(k) parsers) (Aho,Sethi,Ullman, pp. 195)

CMPUT 680 - Compiler Design and Optimization81 Bottom-Up Parser A bottom-up parser, or a shift-reduce parser, begins at the leaves and works up to the top of the tree. The reduction steps trace a rightmost derivation on reverse. S  aABe A  Abc | b B  d Consider the Grammar: We want to parse the input string abbcde. (Aho,Sethi,Ullman, pp. 195)

CMPUT 680 - Compiler Design and Optimization82 Bottom-Up Parser Example adbbc INPUT: Bottom-Up Parsing Program e OUTPUT: $ Production S  aABe A  Abc A  b B  d (Aho,Sethi,Ullman, pp. 195)

CMPUT 680 - Compiler Design and Optimization83 Bottom-Up Parser Example adbbc INPUT: Bottom-Up Parsing Program e OUTPUT: A b $ Production S  aABe A  Abc A  b B  d (Aho,Sethi,Ullman, pp. 195)

CMPUT 680 - Compiler Design and Optimization84 Bottom-Up Parser Example adbAc INPUT: Bottom-Up Parsing Program e OUTPUT: A b $ Production S  aABe A  Abc A  b B  d (Aho,Sethi,Ullman, pp. 195)

CMPUT 680 - Compiler Design and Optimization85 Bottom-Up Parser Example adbAc INPUT: Bottom-Up Parsing Program e OUTPUT: A b $ Production S  aABe A  Abc A  b B  d We are not reducing here in this example. A parser would reduce, get stuck and then backtrack! (Aho,Sethi,Ullman, pp. 195)

CMPUT 680 - Compiler Design and Optimization86 Bottom-Up Parser Example adbAc INPUT: Bottom-Up Parsing Program e OUTPUT: A b $ Production S  aABe A  Abc A  b B  d c A b (Aho,Sethi,Ullman, pp. 195)

CMPUT 680 - Compiler Design and Optimization87 Bottom-Up Parser Example adA INPUT: Bottom-Up Parsing Program e OUTPUT: Ac A b $ Production S  aABe A  Abc A  b B  d b (Aho,Sethi,Ullman, pp. 195)

CMPUT 680 - Compiler Design and Optimization88 Bottom-Up Parser Example adA INPUT: Bottom-Up Parsing Program e OUTPUT: Ac A b $ Production S  aABe A  Abc A  b B  d b B d (Aho,Sethi,Ullman, pp. 195)

CMPUT 680 - Compiler Design and Optimization89 Bottom-Up Parser Example aBA INPUT: Bottom-Up Parsing Program e OUTPUT: Ac A b $ Production S  aABe A  Abc A  b B  d b B d (Aho,Sethi,Ullman, pp. 195)

CMPUT 680 - Compiler Design and Optimization90 Bottom-Up Parser Example aBA INPUT: Bottom-Up Parsing Program e OUTPUT: Ac A b $ Production S  aABe A  Abc A  b B  d b B d a S e (Aho,Sethi,Ullman, pp. 195)

CMPUT 680 - Compiler Design and Optimization91 Bottom-Up Parser Example S INPUT: Bottom-Up Parsing Program OUTPUT: Ac A b $ Production S  aABe A  Abc A  b B  d b B d a S e This parser is known as an LR Parser because it scans the input from Left to right, and it constructs a Rightmost derivation in reverse order. (Aho,Sethi,Ullman, pp. 195)

CMPUT 680 - Compiler Design and Optimization92 Bottom-Up Parser Example The scanning of productions for matching with handles in the input string, and backtracking makes the method used in the previous example very inneficient. Can we do better?

CMPUT 680 - Compiler Design and Optimization93 LR Parser Example Input StackStack LR Parsing Program actiongoto Output (Aho,Sethi,Ullman, pp. 217)

CMPUT 680 - Compiler Design and Optimization94 LR Parser Example The following grammar: (1) E  E + T (2) E  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id Can be parsed with this action and goto table (Aho,Sethi,Ullman, pp. 219)

CMPUT 680 - Compiler Design and Optimization95 LR Parser Example id +  INPUT: $ STACK: E0 (1) E  E + T (2) E  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program GRAMMAR: OUTPUT: (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization96 OUTPUT: LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program E5 id 0 F GRAMMAR: (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization97 OUTPUT: 0 LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program F id GRAMMAR: (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization98 OUTPUT: E3 F 0 LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program T F id GRAMMAR: (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization99 OUTPUT: 0 LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program T F id GRAMMAR: (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization100 OUTPUT: LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program E2 T 0 T F id GRAMMAR: (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization101 OUTPUT: LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E’  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program E7  2 T 0 T F id GRAMMAR: (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization102 OUTPUT: LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E’  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program E5 id 7  2 T 0 T F F GRAMMAR: (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization103 OUTPUT: LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E’  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program E7  2 T 0 T F id F GRAMMAR: (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization104 OUTPUT: LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E’  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program E10 F 7  2 T 0 T  TF F id GRAMMAR: (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization105 OUTPUT: 0 LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program T  TF F id GRAMMAR: (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization106 OUTPUT: LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program 2 T 0 T  TF F id E GRAMMAR: (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization107 OUTPUT: 0 LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program T  TF F id E GRAMMAR: (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization108 OUTPUT: LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E’  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program 1 E 0 T  TF F id E GRAMMAR: (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization109 OUTPUT: LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E’  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program T  TF F id E 6 + 1 E 0 GRAMMAR: (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization110 LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E’  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program OUTPUT: T  TF F id E 5 6 + 1 E 0 F GRAMMAR: (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization111 LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E’  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program OUTPUT: T  TF F id E 6 + 1 E 0 F GRAMMAR: (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization112 LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E’  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program OUTPUT: T  TF F id E 3 F 6 + 1 E 0 F GRAMMAR: T (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization113 LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E’  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program OUTPUT: T  TF F id E 6 + 1 E 0 F GRAMMAR: (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization114 LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E’  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program OUTPUT: T  TF F id E 9 T 6 + 1 E 0 F GRAMMAR: T E + (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization115 LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program 0 GRAMMAR: OUTPUT: T  TF F id E F T E + (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization116 LR Parser Example id  + INPUT: $ STACK: (1) E  E + T (2) E’  T (3) T  T  F (4) T  F (5) F  ( E ) (6) F  id LR Parsing Program OUTPUT: T  TF F id E 1 E 0 F GRAMMAR: T E + (Aho,Sethi,Ullman, pp. 220)

CMPUT 680 - Compiler Design and Optimization117 Constructing Parsing Tables All LR parsers use the same parsing program that we demonstrated in the previous slides. What differentiates the LR parsers are the action and the goto tables: Simple LR (SLR): succeds for the fewest grammars, but is the easiest to implement. Canonical LR: succeds for the most grammars, but is the hardest to implement. It splits states when necessary to prevent reductions that would get the parser stuck. Lookahead LR (LALR): succeds for most common syntatic constructions used in programming languages, but produces LR tables much smaller than canonical LR. (See AhoSethiUllman pp. 221-230). (See AhoSethiUllman pp. 236-247). (See AhoSethiUllman pp. 230-236). (Aho,Sethi,Ullman, pp. 221)

CMPUT 680 - Compiler Design and Optimization118 Using Lex Lex compiler Lex source program lex.l lex.yy.c C compiler lex.yy.ca.out Input stream sequence of tokens (Aho-Sethi-Ullman, pp. 258)

CMPUT 680 - Compiler Design and Optimization119 Parsing Action Conflicts If the grammar specified is ambiguous, yacc will report parsing action conflicts. These conflicts can be reduce/reduce conflicts or shift/reduce conflicts. Yacc has rules to resolve such conflicts automatically (see AhoSethiUllman, pp. 262-264), but the resulting parser might not have the behavior intended by the grammar writer.. Whenever you see a conflict report, rerun yacc with the -v flag, examine the y.output file, and re-write your grammar to eliminate the conflicts. (Aho-Sethi-Ullman, pp. 262)

CMPUT 680 - Compiler Design and Optimization120 Three-Address Statements A popular form of intermediate code used in optimizing compilers is three- address statements (or variations, such as quadruples). Source statement: x = a + b  c + d Three address statements with temporaries t 1 and t 2 : t 1 = b  c t 2 = a + t 1 x = t 2 + d (Aho-Sethi-Ullman, pp. 466)

CMPUT 680 - Compiler Design and Optimization121 Intermediate Code Generation Reading List: Aho-Sethi-Ullman: Chapter 8.1 ~ 8.3, Chapter 8.7

CMPUT 680 - Compiler Design and Optimization122 Lexical Analyzer (Scanner) + Syntax Analyzer (Parser) + Semantic Analyzer Abstract Syntax Tree with attributes Intermediate-code Generator Non-optimized Intermediate Code Front End Error Message Front End of a Compiler

CMPUT 680 - Compiler Design and Optimization123 Component-Based Approach to Building Compilers Target-1 Code GeneratorTarget-2 Code Generator Intermediate-code Optimizer Language-1 Front End Source program in Language-1 Language-2 Front End Source program in Language-2 Non-optimized Intermediate Code Optimized Intermediate Code Target-1 machine code Target-2 machine code

CMPUT 680 - Compiler Design and Optimization124 Advantages of Using an Intermediate Language 1. Retargeting - Build a compiler for a new machine by attaching a new code generator to an existing front-end. 2. Optimization - reuse intermediate code optimizers in compilers for different languages and different machines. Note: the terms “intermediate code”, “intermediate language”, and “intermediate representation” are all used interchangeably.

position := initial + rate * 60 The Phases of a Compiler lexical analyzer id 1 := id 2 + id 3 * 60 syntax analyzer := id 1 + id 2 * id 3 60 semantic analyzer := id 1 + id 2 * id 3 inttoreal 60 intermediate code generator temp1 := inttoreal (60) temp2 := id 3 * temp1 temp3 := id 2 + temp2 id1 := temp3 code optimizer temp1 := id 3 * 60.0 id1 := id 2 + temp1 code generator MOVF id3, R2 MULF #60.0, R2 MOVF id2, R1 ADDF R2, R1 MOVF R1, id1

CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic2: Parsing and Lexical Analysis José Nelson Amaral

Similar presentations

Presentation on theme: "CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic2: Parsing and Lexical Analysis José Nelson Amaral"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic2: Parsing and Lexical Analysis José Nelson Amaral

Similar presentations

Presentation on theme: "CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic2: Parsing and Lexical Analysis José Nelson Amaral"— Presentation transcript:

Similar presentations

About project

Feedback