1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #5 Introduction to Parsing
2 First Phase: Lexical Analysis (Scanning) Scanner Maps stream of characters into tokens –Basic unit of syntax Characters that form a word are its lexeme Its syntactic category is called its token Scanner discards white space and comments Scanner works as a subroutine of the parser Source code Scanner IR Parser Errors token get next token
3 Lexical Analysis Specify tokens using Regular Expressions Translate Regular Expressions to Finite Automata Use Finite Automata to generate tables or code for the scanner Scanner Generator specifications (regular expressions) source code tokens tables or code
4 Automating Scanner Construction To build a scanner: 1Write down the RE that specifies the tokens 2Translate the RE to an NFA 3Build the DFA that simulates the NFA 4Minimize the DFA 5Turn it into code or table Scanner generators Lex, Flex, Jlex work along these lines Algorithms are well-known and well-understood Interface to parser is important
5 Automating Scanner Construction RE NFA ( Thompson’s construction ) Build an NFA for each term Combine them with -moves NFA DFA ( subset construction ) Build the simulation DFA Minimal DFA Hopcroft’s algorithm DFA RE All pairs, all paths problem Union together paths from s 0 to a final state minimal DFA RENFADFA The Cycle of Constructions
6 Scanner Generators: JLex, Lex, FLex user code % JLex directives % regular expression rules directly copied to the output file macro (regular) definitions (e.g., digits = [0-9]+ ) and state names each rule: optinal state list, regular expression, action States can be mixed with regular expressions For each regular expression we can define a set of states where it is valid (JLex, Flex) Typical format of regular expression rules: regular_expression { actions }
7 JLex, FLex, Lex r_1{ action_1 } r_2{ action_2 }. r_n{ action_n } Java code for JLex, C code for FLex and Lex A r_1 A r_2 A r_n... s0s0 Automata for regular expression r_1 Rules used by scanner generators 1) Continue scanning the input until reaching an error state 2) Accept the longest prefix that matches to a regular expression and execute the corresponding action 3) If two patterns match the longest prefix, then the action which is specified earlier will be executed 4) After a match, go back to the end of the accepted prefix in the input and start scanning for the next token Regular expression rules: For faster scanning, convert this NFA to a DFA and minimize the states error new final states new start sate
8 Limits of Regular Languages Advantages of Regular Expressions Simple & powerful notation for specifying patterns Automatic construction of fast recognizers Many kinds of syntax can be specified with RE s If REs are so useful … Why not use them for everything? Example — an expression grammar Id [a-zA-Z] ([a-zA-z] | [0-9])* Num [0-9]+ Term Id | Num Op “+” | “-” | “ ” | “/” Expr ( Term Op )* Term
9 Limits of Regular Languages If we add balanced parentheses to the expressions grammar, we cannot represent it using regular expressions: A DFA of size n cannot recognize balanced parentheses with nesting depth greater than n Not all languages are regular: RL’s CFL’s CSL’s Solution: Use a more powerful formalism, context-free grammars Id [a-zA-Z] ([a-zA-z] | [0-9])* Num [0-9]+ Term Id | Num Op “+” | “-” | “ ” | “/” Expr Term | Expr Op Expr | “(“ Expr “)”
10 The Front End: Parser Parser Input: a sequence of tokens representing the source program Output: A parse tree (in practice an abstract syntax tree) While generating the parse tree parser checks the stream of tokens for grammatical correctness –Checks the context-free syntax Parser builds an IR representation of the code –Generates an abstract syntax tree Guides checking at deeper levels than syntax Source code Scanner IR Parser IRType Checker Errors token get next token
11 The Study of Parsing Need a mathematical model of syntax — a grammar G –Context-free grammars Need an algorithm for testing membership in L(G) –Parsing algorithms Parsing is the process of discovering a derivation for some sentence from the rules of the grammar –Equivalently, it is the process of discovering a parse tree Natural language analogy –Lexical rules correspond to rules that define the valid words –Grammar rules correspond to rules that define valid sentences
12 An Example Grammar 1Start Expr 2Expr Expr Op Expr 3 | num 4 | id 5Op + 6 |- 7 |* 8 |/ Start Symbol: S = Start Nonterminal Symbols: N = { Start, Expr, Op } Terminal symbols: T = { num, id, +, -, *, / } Productions: P = { 1, 2, 3, 4, 5, 6, 7, 8 } (shown above )
13 Specifying Syntax with a Grammar Context-free syntax is specified with a context-free grammar Formally, a grammar is a four tuple, G = (S,N,T,P) T is a set of terminal symbols –These correspond to tokens returned by the scanner –For the parser tokens are indivisible units of syntax N is a set of non-terminal symbols –These are syntactic variables that can be substituted during a derivation –Variables that denote sets of substrings occurring in the language S is the start symbol : S N –All the strings in L(G) are derived from the start symbol P is a set of productions or rewrite rules : P : N ( N T)*
14 Production Rules Restriction on production rules determines the expressive power Regular grammars: productions are either left-linear or right-linear –Right-linear: Productions are of the form A wB, or A w where A,B are nonterminals and w is a string of terminals –Left-linear: Productions are of the form A Bw, or A w where A,B are nonterminals and w is a string of terminals –Regular grammars recognize regular sets –One can automatically construct a regular grammar from an NFA that accepts the same language (and visa versa) Context-free grammars: Productions are of the form A where A is a nonterminal symbol and is a string of terminal and nonterminal symbols Context-sensitive grammars: Productions are of the form where and are arbitrary strings of terminal and nonterminal symbols with and | | | | Unrestricted grammars: Productions are of the form where and are arbitrary strings of terminal and nonterminal symbols with –Unrestricted grammars are as powerful as Turing machines
15 An NFA can be translated to a Regular Grammar For each state i of the NFA create a nonterminal symbol A i If state i has a transition to state j on symbol a, introduce the production A i a A j If state i goes to state j on symbol , introduce the production A i A j If i is an accepting state, introduce A i If i is the start state make A i be the start symbol of the grammar a S0S0 S1S1 S4S4 S2S2 S3S3 abb b 1A 0 A 1 2A 1 a A 1 3 | b A 1 4 | a A 2 5A 2 b A 3 6A 3 b A 4 7A 4
16 Derivations Such a sequence of rewrites is called a derivation Process of discovering a derivation is called parsing We denote this as: S * id - num * id An example grammar An example derivation for x - 2* y RuleSentential Form —S 1Expr 2Expr Op Expr 4 Op Expr 6 - Expr 2 - Expr Op Expr 3 - Op Expr 7 - * Expr 4 - * 1 S Expr 2 Expr Expr Op Expr 3 | num 4 | id 5 Op + 6 | - 7 | * 8 | / A B means A derives B after applying one production A *B means A derives B after applying zero or more productions
17 Sentences and Sentential Forms Given a grammar G with a start symbol S A string of terminal symbols than can be derived from S by applying the productions is called a sentence of the grammar –These strings are the members of set L(G), the language defined by the grammar A string of terminal and nonterminal symbols that can be derived from S by applying the productions of the grammar is called a sentential form of the grammar –Each step of derivation forms a sentential form –Sentences are sentential forms with no nonterminal symbols
18 Derivations At each step, we make two choices 1.Choose a non-terminal to replace 2.Choose a production to apply Different choices lead to different derivations Two types of derivation are of interest Leftmost derivation — replace leftmost non-terminal at each step Rightmost derivation — replace rightmost non-terminal at each step These are the two systematic derivations (the first choice is fixed) The example on the earlier slide was a leftmost derivation Of course, there is a rightmost derivation (next slide)
19 Two Derivations for x - 2 * y In both cases, S * id - num * id Note that, these two derivations produce different parse trees The parse trees imply different evaluation orders! Leftmost derivationRightmost derivation RuleSentential Form —S 1Expr 2Expr Op Expr 4Expr Op 7Expr * 2Expr Op Expr * 3Expr Op * 6Expr - * 4 - * RuleSentential Form —S 1Expr 2Expr Op Expr 4 Op Expr 6 - Expr 2 - Expr Op Expr 3 - Op Expr 7 - * Expr 4 - *
20 Derivations and Parse Trees Leftmost derivation S Expr Op - Expr Op * This evaluates as x - ( 2 * y ) RuleSentential Form —S 1Expr 2Expr Op Expr 4 Op Expr 6 - Expr 2 - Expr Op Expr 3 - Op Expr 7 - * Expr 4 - *
21 Derivations and Parse Trees Rightmost derivation S E OpEE E E - * This evaluates as ( x - 2 ) * y RuleSentential Form —S 1Expr 2Expr Op Expr 4Expr Op 7Expr * 2Expr Op Expr * 3Expr Op * 6Expr - * 4 - *
22 Another Rightmost Derivation Another Rightmost derivation S Expr Op - Expr Op * This evaluates as x - ( 2 * y ) RuleSentential Form —S 1Expr 2Expr Op Expr 2Expr Op Expr Op Expr 4Expr Op Expr Op 7Expr Op Expr * 3 Expr Op * 6 Expr - * Expr 4 - * This parse tree is different than the parse tree for the previous rightmost derivation, but it is the same as the parse tree for the earlier leftmost derivation
23 Derivation and Parse Trees A parse tree does not show in which order the productions were applied, it ignores the variations in the order Each parse tree has a corresponding unique leftmost derivation Each parse tree has a corresponding unique rightmost derivation
24 Parse Trees and Precedence These two parse trees point out a problem with the expression grammar: It has no notion of precedence (implied order of evaluation between different operators) To add precedence Create a non-terminal for each level of precedence Isolate the corresponding part of the grammar Force parser to recognize high precedence subexpressions first For algebraic expressions Multiplication and division, first Subtraction and addition, next
25 Another Problem: Parse Trees and Associativity S E EOp - E E E Op - S E OpE E E E - - Result is 1Result is 5
26 Precedence and Associativity Adding the standard algebraic precedence and using left recursion produces: This grammar is slightly larger Takes more rewriting to reach some of the terminal symbols Encodes expected precedence Enforces left-associativity Produces same parse tree under leftmost & rightmost derivations Let’s see how it parses our example 1 S Expr 2 Expr Expr + Term 3 | Expr - Term 4 | Term 5 Term Term * Factor 6 | Term / Factor 7 | Factor 8 Factor num 9 | id
27 Precedence The leftmost derivation This produces x - ( 2 * y ), along with an appropriate parse tree. Both the leftmost and rightmost derivations give the same parse tree and the same evaluation order, because the grammar directly encodes the desired precedence. S E - E T F T T F F * Its parse tree RuleSentential Form S 1Expr 3Expr - Term 7Term - Term 8Factor - Term 3 - Term 7 - Term * Factor 8 - Factor * Factor 4 - * Factor 7 - *
28 Associativity The rightmost derivation This produces ( ) - 2, along with an appropriate parse tree. Both the leftmost and rightmost derivations give the same parse tree and the same evaluation order S E - E T F T Its parse tree F E - T F RuleSentential Form S 1Expr 3Expr - Term 7Expr - Factor 8Expr - 3Expr - Term - 7Expr - Factor - 8Expr - - 4Term - - 7Factor
29 Ambiguous Grammars 1 S Expr 2 Expr Expr Op Expr 3 | num 4 | id 5 Op + 6 | - 7 | * 8 | / Rule Sentential Form — S 1 Expr 2 Expr Op Expr 4 Expr Op 7 Expr * 2 Expr Op Expr * 3 Expr Op * 6 Expr - * 4 - * This grammar allows multiple rightmost derivations for x - 2 * y Equivalently, this grammar generates multiple parse trees for x - 2 * y The grammar is ambiguous Rule Sentential Form — S 1 Expr 2 Expr Op Expr 2 Expr Op Expr Op Expr 4 Expr Op Expr Op 7 Expr Op Expr * 3 Expr Op * 6 Expr - * Expr 4 - * different choices What was the problem with the original grammar?
30 Ambiguous Grammars If a grammar has more than one leftmost derivation for some sentence (or sentential form), then the grammar is ambiguous If a grammar has more than one rightmost derivation for some sentence (or sentential form), then the grammar is ambiguous If a grammar produces more than one parse tree for some sentence (or sentential form), then it is ambiguous Classic example — the dangling-else problem 1Stmt if Expr then Stmt 2 | if Expr then Stmt else Stmt | … other stmts …
31 Ambiguity The following sentential form has two parse trees: if Expr 1 then if Expr 2 then Stmt 1 else Stmt 2 Stmt Expr 1 Expr 2 production 2, then production 1 production 1, then production 2 if thenelse Stmt 2 Stmt ifthen Stmt 1 Stmt Expr 1 Expr 2 if then else Stmt 2 Stmt ifthen Stmt 1
32 Ambiguity Removing the ambiguity Must rewrite the grammar to avoid generating the problem Match each else to innermost unmatched if (common sense rule) With this grammar, the example has only one parse tree 1 Stmt Matched 2 | Unmatched 3 Matched If Expr then Matched else Matched 4 | … other kinds of stmts … 5 Unmatched If Expr then Stmt 6 |If Expr then Matched else Unmatched
33 Ambiguity if Expr 1 then if Expr 2 then Stmt 1 else Stmt 2 This binds the else to the inner if RuleSentential Form —Stmt 2Unmatched 5if Expr then Stmt ?if Expr 1 then Stmt 1if Expr 1 then Matched 3if Expr 1 then if Expr then Matched else Matched ?if Expr 1 then if Expr 2 then Matched else Matched 4if Expr 1 then if Expr 2 then Stmt 1 else Matched 4if Expr 1 then if Expr 2 then Stmt 1 else Stmt 2
34 Ambiguity Theoretical results: It is undecidable whether an arbitrary CFG is ambiguous There exists CFLs for which every CFG is ambiguous. These are called inherenlty ambiguous CFLs. –Example: { 0 i 1 j 2 k | i = j or j = k }
35 Ambiguity Ambiguity usually refers to confusion in the CFG Overloading can create deeper ambiguity a = f(17) In many Algol-like languages, f could be either a function or a subscripted variable Disambiguating this one requires context Need values of declarations Really an issue of type, not context-free syntax Requires an extra-grammatical solution (not in CFG ) Must handle these with a different mechanism –Step outside grammar rather than use a more complex grammar
36 Ambiguity Ambiguity can arise from two distinct sources Confusion in the context-free syntax Confusion that requires context to resolve Resolving ambiguity To remove context-free ambiguity, rewrite the grammar To handle context-sensitive ambiguity takes cooperation –Knowledge of declarations, types, … –Accept a superset of the input language then check it with other means (type checking, context-sensitive analysis) –This is a language design problem Sometimes, the compiler writer accepts an ambiguous grammar –Parsing algorithms can be hacked so that they “do the right thing”