1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #5 Introduction.

Slides:



Advertisements
Similar presentations
Parsing II : Top-down Parsing
Advertisements

CPSC Compiler Tutorial 4 Midterm Review. Deterministic Finite Automata (DFA) Q: finite set of states Σ: finite set of “letters” (input alphabet)
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis.
Scanner wrap-up and Introduction to Parser. Automating Scanner Construction RE  NFA (Thompson’s construction) Build an NFA for each term Combine them.
From Cooper & Torczon1 Automating Scanner Construction RE  NFA ( Thompson’s construction )  Build an NFA for each term Combine them with  -moves NFA.
1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical.
Context-Free Grammars Lecture 7
Parsing V Introduction to LR(1) Parsers. from Cooper & Torczon2 LR(1) Parsers LR(1) parsers are table-driven, shift-reduce parsers that use a limited.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Parsing — Part II (Ambiguity, Top-down parsing, Left-recursion Removal)
COP4020 Programming Languages
ParsingParsing. 2 Front-End: Parser  Checks the stream of words and their parts of speech for grammatical correctness scannerparser source code tokens.
Parsing II : Top-down Parsing Lecture 7 CS 4318/5331 Apan Qasem Texas State University Spring 2015 *some slides adopted from Cooper and Torczon.
CSE 413 Programming Languages & Implementation Hal Perkins Autumn 2012 Context-Free Grammars and Parsing 1.
Compiler Construction Parsing Part I
EECS 6083 Intro to Parsing Context Free Grammars
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 7 Mälardalen University 2010.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 5 Mälardalen University 2005.
CPSC 388 – Compiler Design and Construction Parsers – Context Free Grammars.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Introduction to Parsing Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Lecture 5 Grammars Topics Moving on from Lexical Analysis Grammars Derivations CFLs Readings: 4.1 January 25, 2006 CSCE 531 Compiler Construction.
Introduction to Parsing Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Introduction to Parsing Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Grammars CPSC 5135.
PART I: overview material
Lexical Analysis Constructing a Scanner from Regular Expressions.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
Bernd Fischer RW713: Compiler and Software Language Engineering.
CFG1 CSC 4181Compiler Construction Context-Free Grammars Using grammars in parsers.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
CPS 506 Comparative Programming Languages Syntax Specification.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 3: Introduction to Syntactic Analysis.
Lexical Analysis: DFA Minimization & Wrap Up. Automating Scanner Construction PREVIOUSLY RE  NFA ( Thompson’s construction ) Build an NFA for each term.
Overview of Previous Lesson(s) Over View  In our compiler model, the parser obtains a string of tokens from the lexical analyzer & verifies that the.
Unit-3 Parsing Theory (Syntax Analyzer) PREPARED BY: PROF. HARISH I RATHOD COMPUTER ENGINEERING DEPARTMENT GUJARAT POWER ENGINEERING & RESEARCH INSTITUTE.
11 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 7 School of Innovation, Design and Engineering Mälardalen University 2012.
Syntax Analysis – Part I EECS 483 – Lecture 4 University of Michigan Monday, September 17, 2006.
Syntax Analyzer (Parser)
LECTURE 4 Syntax. SPECIFYING SYNTAX Programming languages must be very well defined – there’s no room for ambiguity. Language designers must use formal.
1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations.
1 Topic #4: Syntactic Analysis (Parsing) CSC 338 – Compiler Design and implementation Dr. Mohamed Ben Othman ( )
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Spring 16 CSCI 4430, A Milanova 1 Announcements HW1 will be out this evening Due Monday, 2/8 Submit in HW Server AND at start of class on 2/8 A review.
Compiler Construction Lecture Five: Parsing - Part Two CSC 2103: Compiler Construction Lecture Five: Parsing - Part Two Joyce Nakatumba-Nabende 1.
Syntax Analysis Or Parsing. A.K.A. Syntax Analysis –Recognize sentences in a language. –Discover the structure of a document/program. –Construct (implicitly.
Last Chapter Review Source code characters combination lexemes tokens pattern Non-Formalization Description Formalization Description Regular Expression.
Introduction to Parsing
Parsing & Context-Free Grammars
CS510 Compiler Lecture 4.
Introduction to Parsing (adapted from CS 164 at Berkeley)
Syntax Specification and Analysis
Lexical and Syntax Analysis
(Slides copied liberally from Ruth Anderson, Hal Perkins and others)
Lexical Analysis - An Introduction
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lecture 7: Introduction to Parsing (Syntax Analysis)
R.Rajkumar Asst.Professor CSE
Introduction to Parsing
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Introduction to Parsing
Lexical Analysis - An Introduction
COMPILER CONSTRUCTION
Presentation transcript:

1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #5 Introduction to Parsing

2 First Phase: Lexical Analysis (Scanning) Scanner Maps stream of characters into tokens –Basic unit of syntax Characters that form a word are its lexeme Its syntactic category is called its token Scanner discards white space and comments Scanner works as a subroutine of the parser Source code Scanner IR Parser Errors token get next token

3 Lexical Analysis Specify tokens using Regular Expressions Translate Regular Expressions to Finite Automata Use Finite Automata to generate tables or code for the scanner Scanner Generator specifications (regular expressions) source code tokens tables or code

4 Automating Scanner Construction To build a scanner: 1Write down the RE that specifies the tokens 2Translate the RE to an NFA 3Build the DFA that simulates the NFA 4Minimize the DFA 5Turn it into code or table Scanner generators Lex, Flex, Jlex work along these lines Algorithms are well-known and well-understood Interface to parser is important

5 Automating Scanner Construction RE  NFA ( Thompson’s construction ) Build an NFA for each term Combine them with  -moves NFA  DFA ( subset construction ) Build the simulation DFA  Minimal DFA Hopcroft’s algorithm DFA  RE All pairs, all paths problem Union together paths from s 0 to a final state minimal DFA RENFADFA The Cycle of Constructions

6 Scanner Generators: JLex, Lex, FLex user code % JLex directives % regular expression rules directly copied to the output file macro (regular) definitions (e.g., digits = [0-9]+ ) and state names each rule: optinal state list, regular expression, action States can be mixed with regular expressions For each regular expression we can define a set of states where it is valid (JLex, Flex) Typical format of regular expression rules: regular_expression { actions }

7 JLex, FLex, Lex r_1{ action_1 } r_2{ action_2 }. r_n{ action_n } Java code for JLex, C code for FLex and Lex A r_1 A r_2 A r_n... s0s0    Automata for regular expression r_1 Rules used by scanner generators 1) Continue scanning the input until reaching an error state 2) Accept the longest prefix that matches to a regular expression and execute the corresponding action 3) If two patterns match the longest prefix, then the action which is specified earlier will be executed 4) After a match, go back to the end of the accepted prefix in the input and start scanning for the next token Regular expression rules: For faster scanning, convert this NFA to a DFA and minimize the states error new final states new start sate

8 Limits of Regular Languages Advantages of Regular Expressions Simple & powerful notation for specifying patterns Automatic construction of fast recognizers Many kinds of syntax can be specified with RE s If REs are so useful … Why not use them for everything? Example — an expression grammar Id  [a-zA-Z] ([a-zA-z] | [0-9])* Num  [0-9]+ Term  Id | Num Op  “+” | “-” | “  ” | “/” Expr  ( Term Op )* Term

9 Limits of Regular Languages If we add balanced parentheses to the expressions grammar, we cannot represent it using regular expressions: A DFA of size n cannot recognize balanced parentheses with nesting depth greater than n Not all languages are regular: RL’s  CFL’s  CSL’s Solution: Use a more powerful formalism, context-free grammars Id  [a-zA-Z] ([a-zA-z] | [0-9])* Num  [0-9]+ Term  Id | Num Op  “+” | “-” | “  ” | “/” Expr  Term | Expr Op Expr | “(“ Expr “)”

10 The Front End: Parser Parser Input: a sequence of tokens representing the source program Output: A parse tree (in practice an abstract syntax tree) While generating the parse tree parser checks the stream of tokens for grammatical correctness –Checks the context-free syntax Parser builds an IR representation of the code –Generates an abstract syntax tree Guides checking at deeper levels than syntax Source code Scanner IR Parser IRType Checker Errors token get next token

11 The Study of Parsing Need a mathematical model of syntax — a grammar G –Context-free grammars Need an algorithm for testing membership in L(G) –Parsing algorithms Parsing is the process of discovering a derivation for some sentence from the rules of the grammar –Equivalently, it is the process of discovering a parse tree Natural language analogy –Lexical rules correspond to rules that define the valid words –Grammar rules correspond to rules that define valid sentences

12 An Example Grammar 1Start  Expr 2Expr  Expr Op Expr 3 | num 4 | id 5Op  + 6 |- 7 |* 8 |/ Start Symbol: S = Start Nonterminal Symbols: N = { Start, Expr, Op } Terminal symbols: T = { num, id, +, -, *, / } Productions: P = { 1, 2, 3, 4, 5, 6, 7, 8 } (shown above )

13 Specifying Syntax with a Grammar Context-free syntax is specified with a context-free grammar Formally, a grammar is a four tuple, G = (S,N,T,P) T is a set of terminal symbols –These correspond to tokens returned by the scanner –For the parser tokens are indivisible units of syntax N is a set of non-terminal symbols –These are syntactic variables that can be substituted during a derivation –Variables that denote sets of substrings occurring in the language S is the start symbol : S  N –All the strings in L(G) are derived from the start symbol P is a set of productions or rewrite rules : P : N  ( N  T)*

14 Production Rules Restriction on production rules determines the expressive power Regular grammars: productions are either left-linear or right-linear –Right-linear: Productions are of the form A  wB, or A  w where A,B are nonterminals and w is a string of terminals –Left-linear: Productions are of the form A  Bw, or A  w where A,B are nonterminals and w is a string of terminals –Regular grammars recognize regular sets –One can automatically construct a regular grammar from an NFA that accepts the same language (and visa versa) Context-free grammars: Productions are of the form A   where A is a nonterminal symbol and  is a string of terminal and nonterminal symbols Context-sensitive grammars: Productions are of the form    where  and  are arbitrary strings of terminal and nonterminal symbols with    and |  |  |  | Unrestricted grammars: Productions are of the form    where  and  are arbitrary strings of terminal and nonterminal symbols with    –Unrestricted grammars are as powerful as Turing machines

15 An NFA can be translated to a Regular Grammar For each state i of the NFA create a nonterminal symbol A i If state i has a transition to state j on symbol a, introduce the production A i  a A j If state i goes to state j on symbol , introduce the production A i  A j If i is an accepting state, introduce A i   If i is the start state make A i be the start symbol of the grammar a S0S0 S1S1 S4S4 S2S2 S3S3  abb b 1A 0  A 1 2A 1  a A 1 3 | b A 1 4 | a A 2 5A 2  b A 3 6A 3  b A 4 7A 4  

16 Derivations Such a sequence of rewrites is called a derivation Process of discovering a derivation is called parsing We denote this as: S  * id - num * id An example grammar An example derivation for x - 2* y RuleSentential Form —S 1Expr 2Expr Op Expr 4 Op Expr 6 - Expr 2 - Expr Op Expr 3 - Op Expr 7 - * Expr 4 - * 1 S  Expr 2 Expr  Expr Op Expr 3 | num 4 | id 5 Op  + 6 | - 7 | * 8 | / A  B means A derives B after applying one production A  *B means A derives B after applying zero or more productions

17 Sentences and Sentential Forms Given a grammar G with a start symbol S A string of terminal symbols than can be derived from S by applying the productions is called a sentence of the grammar –These strings are the members of set L(G), the language defined by the grammar A string of terminal and nonterminal symbols that can be derived from S by applying the productions of the grammar is called a sentential form of the grammar –Each step of derivation forms a sentential form –Sentences are sentential forms with no nonterminal symbols

18 Derivations At each step, we make two choices 1.Choose a non-terminal to replace 2.Choose a production to apply Different choices lead to different derivations Two types of derivation are of interest Leftmost derivation — replace leftmost non-terminal at each step Rightmost derivation — replace rightmost non-terminal at each step These are the two systematic derivations (the first choice is fixed) The example on the earlier slide was a leftmost derivation Of course, there is a rightmost derivation (next slide)

19 Two Derivations for x - 2 * y In both cases, S  * id - num * id Note that, these two derivations produce different parse trees The parse trees imply different evaluation orders! Leftmost derivationRightmost derivation RuleSentential Form —S 1Expr 2Expr Op Expr 4Expr Op 7Expr * 2Expr Op Expr * 3Expr Op * 6Expr - * 4 - * RuleSentential Form —S 1Expr 2Expr Op Expr 4 Op Expr 6 - Expr 2 - Expr Op Expr 3 - Op Expr 7 - * Expr 4 - *

20 Derivations and Parse Trees Leftmost derivation S Expr Op - Expr Op * This evaluates as x - ( 2 * y ) RuleSentential Form —S 1Expr 2Expr Op Expr 4 Op Expr 6 - Expr 2 - Expr Op Expr 3 - Op Expr 7 - * Expr 4 - *

21 Derivations and Parse Trees Rightmost derivation S E OpEE E E - * This evaluates as ( x - 2 ) * y RuleSentential Form —S 1Expr 2Expr Op Expr 4Expr Op 7Expr * 2Expr Op Expr * 3Expr Op * 6Expr - * 4 - *

22 Another Rightmost Derivation Another Rightmost derivation S Expr Op - Expr Op * This evaluates as x - ( 2 * y ) RuleSentential Form —S 1Expr 2Expr Op Expr 2Expr Op Expr Op Expr 4Expr Op Expr Op 7Expr Op Expr * 3 Expr Op * 6 Expr - * Expr 4 - * This parse tree is different than the parse tree for the previous rightmost derivation, but it is the same as the parse tree for the earlier leftmost derivation

23 Derivation and Parse Trees A parse tree does not show in which order the productions were applied, it ignores the variations in the order Each parse tree has a corresponding unique leftmost derivation Each parse tree has a corresponding unique rightmost derivation

24 Parse Trees and Precedence These two parse trees point out a problem with the expression grammar: It has no notion of precedence (implied order of evaluation between different operators) To add precedence Create a non-terminal for each level of precedence Isolate the corresponding part of the grammar Force parser to recognize high precedence subexpressions first For algebraic expressions Multiplication and division, first Subtraction and addition, next

25 Another Problem: Parse Trees and Associativity S E EOp - E E E Op - S E OpE E E E - - Result is 1Result is 5

26 Precedence and Associativity Adding the standard algebraic precedence and using left recursion produces: This grammar is slightly larger Takes more rewriting to reach some of the terminal symbols Encodes expected precedence Enforces left-associativity Produces same parse tree under leftmost & rightmost derivations Let’s see how it parses our example 1 S  Expr 2 Expr  Expr + Term 3 | Expr - Term 4 | Term 5 Term  Term * Factor 6 | Term / Factor 7 | Factor 8 Factor  num 9 | id

27 Precedence The leftmost derivation This produces x - ( 2 * y ), along with an appropriate parse tree. Both the leftmost and rightmost derivations give the same parse tree and the same evaluation order, because the grammar directly encodes the desired precedence. S E - E T F T T F F * Its parse tree RuleSentential Form S 1Expr 3Expr - Term 7Term - Term 8Factor - Term 3 - Term 7 - Term * Factor 8 - Factor * Factor 4 - * Factor 7 - *

28 Associativity The rightmost derivation This produces ( ) - 2, along with an appropriate parse tree. Both the leftmost and rightmost derivations give the same parse tree and the same evaluation order S E - E T F T Its parse tree F E - T F RuleSentential Form S 1Expr 3Expr - Term 7Expr - Factor 8Expr - 3Expr - Term - 7Expr - Factor - 8Expr - - 4Term - - 7Factor

29 Ambiguous Grammars 1 S  Expr 2 Expr  Expr Op Expr 3 | num 4 | id 5 Op  + 6 | - 7 | * 8 | / Rule Sentential Form — S 1 Expr 2 Expr Op Expr 4 Expr Op 7 Expr * 2 Expr Op Expr * 3 Expr Op * 6 Expr - * 4 - * This grammar allows multiple rightmost derivations for x - 2 * y Equivalently, this grammar generates multiple parse trees for x - 2 * y The grammar is ambiguous Rule Sentential Form — S 1 Expr 2 Expr Op Expr 2 Expr Op Expr Op Expr 4 Expr Op Expr Op 7 Expr Op Expr * 3 Expr Op * 6 Expr - * Expr 4 - * different choices What was the problem with the original grammar?

30 Ambiguous Grammars If a grammar has more than one leftmost derivation for some sentence (or sentential form), then the grammar is ambiguous If a grammar has more than one rightmost derivation for some sentence (or sentential form), then the grammar is ambiguous If a grammar produces more than one parse tree for some sentence (or sentential form), then it is ambiguous Classic example — the dangling-else problem 1Stmt  if Expr then Stmt 2 | if Expr then Stmt else Stmt | … other stmts …

31 Ambiguity The following sentential form has two parse trees: if Expr 1 then if Expr 2 then Stmt 1 else Stmt 2 Stmt Expr 1 Expr 2 production 2, then production 1 production 1, then production 2 if thenelse Stmt 2 Stmt ifthen Stmt 1 Stmt Expr 1 Expr 2 if then else Stmt 2 Stmt ifthen Stmt 1

32 Ambiguity Removing the ambiguity Must rewrite the grammar to avoid generating the problem Match each else to innermost unmatched if (common sense rule) With this grammar, the example has only one parse tree 1 Stmt  Matched 2 | Unmatched 3 Matched  If Expr then Matched else Matched 4 | … other kinds of stmts … 5 Unmatched  If Expr then Stmt 6 |If Expr then Matched else Unmatched

33 Ambiguity if Expr 1 then if Expr 2 then Stmt 1 else Stmt 2 This binds the else to the inner if RuleSentential Form —Stmt 2Unmatched 5if Expr then Stmt ?if Expr 1 then Stmt 1if Expr 1 then Matched 3if Expr 1 then if Expr then Matched else Matched ?if Expr 1 then if Expr 2 then Matched else Matched 4if Expr 1 then if Expr 2 then Stmt 1 else Matched 4if Expr 1 then if Expr 2 then Stmt 1 else Stmt 2

34 Ambiguity Theoretical results: It is undecidable whether an arbitrary CFG is ambiguous There exists CFLs for which every CFG is ambiguous. These are called inherenlty ambiguous CFLs. –Example: { 0 i 1 j 2 k | i = j or j = k }

35 Ambiguity Ambiguity usually refers to confusion in the CFG Overloading can create deeper ambiguity a = f(17) In many Algol-like languages, f could be either a function or a subscripted variable Disambiguating this one requires context Need values of declarations Really an issue of type, not context-free syntax Requires an extra-grammatical solution (not in CFG ) Must handle these with a different mechanism –Step outside grammar rather than use a more complex grammar

36 Ambiguity Ambiguity can arise from two distinct sources Confusion in the context-free syntax Confusion that requires context to resolve Resolving ambiguity To remove context-free ambiguity, rewrite the grammar To handle context-sensitive ambiguity takes cooperation –Knowledge of declarations, types, … –Accept a superset of the input language then check it with other means (type checking, context-sensitive analysis) –This is a language design problem Sometimes, the compiler writer accepts an ambiguous grammar –Parsing algorithms can be hacked so that they “do the right thing”