CS 363 – Chapter 2 The first 2 phases of compilation Scanning Parsing

Slides:



Advertisements
Similar presentations
Compiler construction in4020 – lecture 4 Koen Langendoen Delft University of Technology The Netherlands.
Advertisements

Honors Compilers An Introduction to Grammars Feb 12th 2002.
6/12/2015Prof. Hilfinger CS164 Lecture 111 Bottom-Up Parsing Lecture (From slides by G. Necula & R. Bodik)
Parsing III (Eliminating left recursion, recursive descent parsing)
Parsing — Part II (Ambiguity, Top-down parsing, Left-recursion Removal)
Syntax and Semantics Structure of programming languages.
Copyright © 2009 Elsevier Chapter 2 :: Programming Language Syntax Programming Language Pragmatics Michael L. Scott.
Parsing III (Top-down parsing: recursive descent & LL(1) )
Profs. Necula CS 164 Lecture Top-Down Parsing ICOM 4036 Lecture 5.
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
Review 1.Lexical Analysis 2.Syntax Analysis 3.Semantic Analysis 4.Code Generation 5.Code Optimization.
Syntax and Semantics Structure of programming languages.
CS 461 – Oct. 12 Parsing Running a parse machine –“Goto” (or shift) actions –Reduce actions: backtrack to earlier state –Maintain stack of visited states.
Top-down Parsing lecture slides from C OMP 412 Rice University Houston, Texas, Fall 2001.
Parsing — Part II (Top-down parsing, left-recursion removal) Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students.
Syntax and Grammars.
Top-down Parsing Recursive Descent & LL(1) Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.
Top-Down Parsing CS 671 January 29, CS 671 – Spring Where Are We? Source code: if (b==0) a = “Hi”; Token Stream: if (b == 0) a = “Hi”; Abstract.
Lecture 3: Parsing CS 540 George Mason University.
Top-down Parsing. 2 Parsing Techniques Top-down parsers (LL(1), recursive descent) Start at the root of the parse tree and grow toward leaves Pick a production.
Top-Down Parsing.
LECTURE 4 Syntax. SPECIFYING SYNTAX Programming languages must be very well defined – there’s no room for ambiguity. Language designers must use formal.
Parsing III (Top-down parsing: recursive descent & LL(1) )
Bottom Up Parsing CS 671 January 31, CS 671 – Spring Where Are We? Finished Top-Down Parsing Starting Bottom-Up Parsing Lexical Analysis.
Compilers: Bottom-up/6 1 Compiler Structures Objective – –describe bottom-up (LR) parsing using shift- reduce and parse tables – –explain how LR.
Syntax and Semantics Structure of programming languages.
CSE 3302 Programming Languages
CS 3304 Comparative Languages
Parsing #1 Leonidas Fegaras.
lec02-parserCFG May 8, 2018 Syntax Analyzer
CS 326 Programming Languages, Concepts and Implementation
Programming Languages Translator
CS510 Compiler Lecture 4.
Chapter 2 :: Programming Language Syntax
Introduction to Parsing (adapted from CS 164 at Berkeley)
Parsing IV Bottom-up Parsing
Syntax Specification and Analysis
Table-driven parsing Parsing performed by a finite state machine.
Parsing — Part II (Top-down parsing, left-recursion removal)
CS 404 Introduction to Compiler Design
Top-Down Parsing.
CS 363 Comparative Programming Languages
4 (c) parsing.
Parsing Techniques.
CS 3304 Comparative Languages
Compiler Design 4. Language Grammars
Lexical and Syntax Analysis
Top-Down Parsing CS 671 January 29, 2008.
Parsing #2 Leonidas Fegaras.
COP4020 Programming Languages
Lecture 8 Bottom Up Parsing
Lecture 7: Introduction to Parsing (Syntax Analysis)
Ambiguity, Precedence, Associativity & Top-Down Parsing
R.Rajkumar Asst.Professor CSE
Programming Language Syntax 5
LL and Recursive-Descent Parsing Hal Perkins Autumn 2011
CS 3304 Comparative Languages
Parsing IV Bottom-up Parsing
Parsing #2 Leonidas Fegaras.
Chapter 2 :: Programming Language Syntax
Kanat Bolazar February 16, 2010
BNF 9-Apr-19.
Chapter 2 :: Programming Language Syntax
LL and Recursive-Descent Parsing Hal Perkins Autumn 2009
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
LL and Recursive-Descent Parsing Hal Perkins Winter 2008
lec02-parserCFG May 27, 2019 Syntax Analyzer
CS 461 – Oct. 17 Creating parse machine On what input do we reduce?
CS 44 – Jan. 31 Parsing Running a parse machine √
Parsing CSCI 432 Computer Science Theory
Presentation transcript:

CS 363 – Chapter 2 The first 2 phases of compilation Scanning Parsing Need to define tokens Making a scanner Parsing Need to define the language Making a parser

Defining tokens Kinds of tokens Keywords, identifiers Operators, punctuation symbols Constants, string literals Not practical to enumerate all possible tokens Use “regular expression” as shorthand notation

Regular expression Used to define what tokens look like A reg.expr. is A single character or empty string Built from other reg.expr. by concatenating, using “|” or “*” operators Stuff enclosed in [ ] is optional Ex. Define a number: number  [ – ] digit (digit)* digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Let’s expand our definition of “number” number  [ – ] digit (digit)* digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 How would we handle Decimals? Scientific notation? Octal and hexadecimal integers?

Let’s write regular expression for identifier. Identifiers start with a letter. Each subsequent character is letter, digit, or underscore. letter  a | b | c | d | … | A | B | C | … | Z digit  0 | 1 | 2 | … | 9 id  ________________

Ready to scan Once you know the tokens, can write scanner Straightforward task (Skip section 2.2.1 on theory) Read input one character at a time Create token as you finish reading it Ex. “int dog= 4;” Has 5 tokens (what kinds?) “dog” is a token, but not “do” or “dog=”

Scanner lookahead In some languages need to look ahead to see when a token ends In Pascal, “3.” could be followed by digits to continue number another “.” to create “..” token In Fortran, spaces ignored - BAD “do 5 i…” could be identifier or a do loop depending on context Incidentally, Fortran limits identifiers to 6 characters (another bad rule)

Defining a language A PL is a set of valid programs (strings) in that language We need recursive definition: Theoretically infinite set Programs are hierarchical/nested Reg.expr. can’t handle nested definitions Usually we use a grammar to define a complex set of strings. The general type of grammar is CFG, a.k.a. BNF.

Grammars Assume that “number” and “id” already defined Ex. Define an expression: expr  number | id | expr op expr op  + | – | * | / Do you see the base case & recursion? What strings are generated by this definition? Can use a tree – a parse tree!

Let’s find derivation of 2 + x * 3 expr  number | id | expr op expr op  + | – | * | / Let’s find derivation of 2 + x * 3 1. Begin with start symbol. expr  ? expr  expr op expr 2. Now, try to resolve each nonterminal on right side. expr  2 op expr expr  2 + expr expr  2 + expr op expr expr  2 + x * 3 3. We can summarize these steps with parse tree.

Problem: Ambiguity! expr  expr op expr expr  2 op expr expr  2 + x * 3 expr  expr op expr expr  expr op 3 expr  expr * 3 expr  expr op expr * 3 expr  2 + x * 3

Better grammar To make long story short… (p. 52) Order of operations requires levels of expr: An expression is one or more terms (+, –) A term is one or more factors (*, /) expr  term | expr add_op term term  factor | term mult_op factor factor  id | number add_op  + | – mult_op  * | /

Practice Let’s write a grammar for a var declaration. Ex. double x, y, avg; Assume type and id already defined. Consider base case first.  Will this work… decl  type vars ; vars  id | id , vars Incidentally, the grammar given on p. 70 allows for no variables to be declared. How?

decl  type vars ; vars  id | id , vars Let’s derive the string “double x, y, avg;” decl  double vars ; vars  id , vars vars  x , vars vars  x , id , vars vars  x , y , vars vars  x , y , avg decl  double x , y , avg ;

Summary Defining a language has 2 steps “tokens” or lexical elements, alphabet grammar of the language Rest of chapter: two kinds of parsing Top-down (LL) Bottom-up (LR) Purpose of parsing is to enforce grammar: recognize whether input is legal program

Parsing (section 2.3) Top-down Bottom-up Recursive-descent technique Table-driven Bottom-up Using parse tables Parsing is the 2nd stage of compilation. Probably most crucial. BTW, do you prefer “nonterminal” or “variable” to describe left side of grammar rule?

Grammars & parsing Grammar: defines language We can generate (derive) a program Parser: see if a program obeys grammar We want to recognize a program Parsing algorithms general CYK 1965 algorithm runs in O(n3)  realistic language: parser can run in O(n) Top-down technique Bottom-up technique

Two approaches Top-down Bottom-up See illustration, p. 71 Construct parse tree from root down Predict which grammar rule to use Bottom-up From tokens (leaves), match them to nonterminals in grammar Read tokens until you recognize which nonterminal See illustration, p. 71

Example (p.71) id_list  id id_list_tail id_list_tail  , id id_list_tail | ; Top down We know we must start with an id token Check to see if next token is , or ; id_list  A id_list_tail id_list  A, B id_list_tail id_list  A, B, C id_list_tail id_list  A, B, C ; Bottom up Push tokens until we see ; Then, work backwards from the stack: id_list_tail  ; Id_list_tail  , C ; Id_list_tail  , B , C ; Id_list  A , B, C ;

Recursive descent See handout example Each variable we define in the grammar gets its own function Function consists of choices Need to see what this variable can start with Match expected token, or else syntax error! Why called “recursive descent” ?  Handout!

Limitations Parsing techniques like recursive descent won’t work for all grammars Left recursion Ex. term  factor | term mult_op factor Ambiguous starting symbol Ex. decl  type vars vars  id | id , var See what’s wrong in these situations? (more on this coming up)

Table-driven top-down parsing (sect. 2.3.3) Alternative to recursive-descent Constructing the table Based on: first, follow, predict sets Making sure the grammar is LL If not, need to change grammar

Where are we… 5 phases of compilation Scanning Parsing  Semantic analysis Code generation Optimization Parsing techniques Top-down Recursive descent Table driven  Bottom-up The last 3 compilation phases are covered in later chapters (4, 15, 16 respectively).

LL parse table (p. 84) Non-recursive Uses “parse stack” to maintain input Table contains guide for how to parse Rows correspond to nonterminal on top of stack Columns correspond to current input token (terminal) Entry in table is a production number

Procedure Put the start symbol on the parse stack While (parse stack != eof) If top of stack is nonterminal: Look up production number from parse table Replace left side of rule with right side onto stack If top of stack is terminal, match with the input & consume it.

Create parse table Using a parse table is faster than writing a recursive descent parser, but we first need to create the parse table. We rely on definitions p. 84: First (nonterminal) Follow (grammar symbol) Predict (production) Definitions first & follow are also used in bottom-up parsing.

Set definitions First (nonterminal) Follow (grammar symbol) what tokens can the nonterminal start with? possibly nothing (if A  ε) Follow (grammar symbol) what tokens can come after this symbol? possibly nothing (end of input) Predict (production) What can right side start with? If right side can be empty, what can follow left side? For first & follow, the answer could be “nothing” – i.e. the empty string.

Example id_list  id id_list_tail id_list_tail  , id id_list_tail | ; First(id_list) = id First(id_list_tail) = , ; Follow(id_list) = eof Follow(id_list_tail) = eof Predict = id Predict = , Predict = ;

Example con’d Top of Stack Id , ; Id_list 1 - Id_list_tail 2 3 id_list  id id_list_tail id_list_tail  , id id_list_tail | ; Let’s parse A,B; Parse stack input action id_list A,B; use 1 id id_list_tail A,B; consume A id_list_tail ,B; use 2 , id id_list_tail ,B; consume , id id_list_tail B; consume id Id_list_tail ; use 3 ; ; consume ; <empty> <empty>  Top of Stack Id , ; Id_list 1 - Id_list_tail 2 3 Let’s try the type grammar from previous lesson.

LL obstacles Left recursion Common prefixes There are algorithms to eliminate, but ugly Common prefixes Stmt  id := expr | id (args) convert to Stmt  id stmt_tail Stmt_tail  := expr | (args) ε in grammar can lead to ambiguity…

Dangling else (p.81) Ex. if c1 then if c2 then s1 else s2 stmt  “if” cond then else then  “then” stmt else  “else” stmt | ε Ex. if c1 then if c2 then s1 else s2 If we draw parse tree, we get two nodes called else. Which else is ε, and which is “else s2” – ambiguous! Can resolve by choosing the rule that comes first. So the else matches the closest “then”.

Bottom-up parsing (sect 2.3.4) Also uses parse table and stack Need to compute first & follow Concept of “state” while reading input.

First( ) To calculate first(A), look at A’s rules. If you see A  c…, add c to first(A) If you see A  B…, add first(B) to first(A). If B can yield ε, continue to next symbol in rule until you reach a symbol that can represent a terminal. If A can yield ε, add ε to first(A). Note: don’t put $ in first( ).

Follow( ) What should be included in follow(A) ? If A is start symbol, add $. If you see Q  …Ac…, add c. If you see Q  …AB…, add first(B). If you see Q  …A, add follow(Q). If you see Q  …ABC, and B yields ε, add first (C). If you see Q  …AB, and B yields ε, add follow(Q). Note: don’t put ε in follow( ).

Example Try this grammar: S  AB A  ε | 1A2 B  ε | 3B Follow(S) = $ First(B) = ε, 3 First(A) = ε, 1 First(S) = ε, 1, 3 (note in this case A  ε) Follow(S) = $ since S is start symbol Follow(A) = 2, 3, $ we need first(B) since B  ε, we need $ Follow(B) = $

Try this one Let’s try the language ((1*2(3+4)*(56)*)* Rules First Follow S  ε | SABC A  2 | 1A B  ε | 3B | 4B C  ε | 56C

answer Let’s try the language ((1*2(3+4)*(56)*)* Rules First Follow S  ε | SABC ε, 1, 2 1, 2, $ A  2 | 1A 1, 2 3, 4, 5, $, 1, 2 B  ε | 3B | 4B ε, 3, 4 5, $, 1, 2 C  ε | 56C ε, 5 $, 1, 2

Bottom-up parsing Learning objectives Running a parse machine “Goto” (or shift) actions Reduce actions: backtrack to earlier state Maintain stack of visited states Creating a parse machine Find the states: sets of items Find transitions between states, including reduce. If many states, write table instead of drawing 

State Let’s say we have a rule for reading two token parens: P  ( ) There are 3 possible states: P  • ( ) P  ( • ) P  ( ) • The dot • is the cursor, and a grammar rule containing a cursor is called an item. A state may contain more than one item.

Parse stack Bottom-up Stack keeps track of states where we’ve been Top-down Began with the start symbol Replace left sides with right sides Consume token if matched input Parsing stopped when stack empty Bottom-up Stack keeps track of states where we’ve been Two types of operations goto next state & advance • by 1 input symbol reduce: pop states & replace input Parsing ends when we reach “happy” state.

Simple example S  AB Consider this grammar See handouts for details. A  aaa B  bb At any point in time, think about where we could be while parsing the string “aaabb”. When we arrive at aaabb. We can reduce the “aaa” to A. When we arrive at Abb, we can reduce the “bb” to B. Knowing that we’ve just read AB, we can reduce this to S. See handouts for details.

Sets of items We’re creating states. We start with a grammar. First step is to augment it with the rule S’  S. The first state I0 will contain S’   S Important rule: Any time you write  before a variable, you must “expand” that variable. So, we add items from the rules of S to I0. Example: { 0n 1n+1 } S  1 | 0S1 We add new start rule S’  S State 0 has these 3 items: I0: S’   S S   1 S   0S1 Expand S

continued Next, determine transitions out of state 0. δ(0, S) = 1 δ(0, 1) = 2 δ(0, 0) = 3 I’ve written destinations along the right side. Now we’re ready for state 1. Move cursor to right to become S’  S  State 0 has these 3 items: I0: S’   S 1 S   1 2 S   0S1 3 I1: S’  S 

continued Any time an item ends with , this represents a reduce, not a goto. Now, we’re ready for state 2. The item S  1 moves its cursor to the right: S  1  This also become a reduce. I0: S’   S 1 S   1 2 S   0S1 3 I1: S’  S  r I2: S  1  r

continued Next is state 3. From S  0S1, move cursor. Notice that now the  is in front of a variable, so we need to expand. Once we’ve written the items, fill in the transitions. Create new state only if needed. δ(3, S) = 4 (a new state) δ(3, 1) = 2 (as before) δ(3, 0) = 3 (as before) I0: S’   S 1 S   1 2 S   0S1 3 I1: S’  S  r I2: S  1  r I3: S  0  S1 4

continued Next is state 4. From item S  0  S1, move cursor. Determine transition. δ(4, 1) = 5 Notice we need new state since we’ve never seen “0 S  1” before. I0: S’   S 1 S   1 2 S   0S1 3 I1: S’  S  r I2: S  1  r I3: S  0  S1 4 I4: S  0S  1 5

Last state! Our last state is #5. Since the cursor is at the end of the item, our transition is a reduce. Now, we are done finding states and transitions! One question remains, concerning the reduce transitions: On what input should we reduce? I0: S’   S 1 S   1 2 S   0S1 3 I1: S’  S  r I2: S  1  r I3: S  0  S1 4 I4: S  0S  1 5 I5: S  0S1  r

When to reduce If you are at the end of an item such as S  1 , there is no symbol after the  telling us what input to wait for. The next symbol should be whatever “follows” the variable we are reducing. In this case, what follows S. We need to look at the original grammar to find out. For example, if you were reducing A, and you saw a rule S  A1B, you would say that 1 follows A. Since S is start symbol, $ (end of input) follows S. For more info, see parser worksheet. for each grammar variable, what follows?

Examples Let’s run a bottom-up parse table for a couple grammars. Reading parentheses P  ( ) Reading a binary number S  0A0 A  1 | 1A

Wrap-up chapter 2 on parsing Finish bottom-up example Precedence & associativity C language grammar Homework Actually, precedence & associavity are covered a bit at the start of chapter 6.

Bottom-up parsing Convert grammar to sets of items This will give us our states Augment grammar if start symbol ever on right Compute first() and follow() We need follow so we know on what input to reduce For S  AB, follow(A) needs first(B) Fill in parse table Ready to parse! Begin in state 0. Finish example from previous lesson. Do more if needed.

Precedence We want to define operations in a grammar, but need to make right choices! Ex. 2 + 3 * 4: * is performed first, while + is a better separator. Need nested structure to define levels of precedence, but which way? expr  term | expr + term expr  term | expr * term term  token | term * token term  token | term + token Draw parse trees to see which is correct. Now, we’re getting away from parsing for the moment to look at a general grammar issue. (also could dispel the possibility that it doesn’t matter which way you do it!)

Associativity Notice that our expression grammar is recursive. Should it be left or right recursive? Let’s look at minus. (3 – 2) – 1 != 3 – (2 – 1) Which way should it be? expr  token | expr – token expr  token | token – expr … …

Moral Precedence Associativity Operators of lower precedence should be defined earlier or higher level in grammar (because they separate well) Associativity Left associativity = left recursive Right associativity = right recursive Most operators are left (LR) associative. Can you think of some that are right associative?

C language grammar Concise, compared to other PL ! We see what PL consists of Declarations of functions, data structures, variables Function syntax Data types Kinds of statements Expressions (many diff operators) Future chapters will go deeper on data types, etc. Finally, talk about homework.