CS 363 – Chapter 2 The first 2 phases of compilation Scanning Parsing

CS 363 – Chapter 2 The first 2 phases of compilation Scanning Parsing
Need to define tokens Making a scanner Parsing Need to define the language Making a parser

Defining tokens Kinds of tokens
Keywords, identifiers Operators, punctuation symbols Constants, string literals Not practical to enumerate all possible tokens Use “regular expression” as shorthand notation

Regular expression Used to define what tokens look like A reg.expr. is
A single character or empty string Built from other reg.expr. by concatenating, using “|” or “*” operators Stuff enclosed in [ ] is optional Ex. Define a number: number  [ – ] digit (digit)* digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Let’s expand our definition of “number”
number  [ – ] digit (digit)* digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 How would we handle Decimals? Scientific notation? Octal and hexadecimal integers?

Let’s write regular expression for identifier.
Identifiers start with a letter. Each subsequent character is letter, digit, or underscore. letter  a | b | c | d | … | A | B | C | … | Z digit  0 | 1 | 2 | … | 9 id  ________________

Ready to scan Once you know the tokens, can write scanner
Straightforward task (Skip section on theory) Read input one character at a time Create token as you finish reading it Ex. “int dog= 4;” Has 5 tokens (what kinds?) “dog” is a token, but not “do” or “dog=”

Scanner lookahead In some languages need to look ahead to see when a token ends In Pascal, “3.” could be followed by digits to continue number another “.” to create “..” token In Fortran, spaces ignored - BAD “do 5 i…” could be identifier or a do loop depending on context Incidentally, Fortran limits identifiers to 6 characters (another bad rule)

Defining a language A PL is a set of valid programs (strings) in that language We need recursive definition: Theoretically infinite set Programs are hierarchical/nested Reg.expr. can’t handle nested definitions Usually we use a grammar to define a complex set of strings. The general type of grammar is CFG, a.k.a. BNF.

Grammars Assume that “number” and “id” already defined
Ex. Define an expression: expr  number | id | expr op expr op  + | – | * | / Do you see the base case & recursion? What strings are generated by this definition? Can use a tree – a parse tree!

Let’s find derivation of 2 + x * 3
expr  number | id | expr op expr op  + | – | * | / Let’s find derivation of 2 + x * 3 1. Begin with start symbol. expr  ? expr  expr op expr 2. Now, try to resolve each nonterminal on right side. expr  2 op expr expr  2 + expr expr  2 + expr op expr expr  x * 3 3. We can summarize these steps with parse tree.

Problem: Ambiguity! expr  expr op expr expr  2 op expr
expr  x * 3 expr  expr op expr expr  expr op 3 expr  expr * 3 expr  expr op expr * 3 expr  x * 3

Better grammar To make long story short… (p. 52)
Order of operations requires levels of expr: An expression is one or more terms (+, –) A term is one or more factors (*, /) expr  term | expr add_op term term  factor | term mult_op factor factor  id | number add_op  + | – mult_op  * | /

Practice Let’s write a grammar for a var declaration.
Ex. double x, y, avg; Assume type and id already defined. Consider base case first.  Will this work… decl  type vars ; vars  id | id , vars Incidentally, the grammar given on p. 70 allows for no variables to be declared. How?

decl  type vars ; vars  id | id , vars Let’s derive the string “double x, y, avg;” decl  double vars ; vars  id , vars vars  x , vars vars  x , id , vars vars  x , y , vars vars  x , y , avg decl  double x , y , avg ;

Summary Defining a language has 2 steps
“tokens” or lexical elements, alphabet grammar of the language Rest of chapter: two kinds of parsing Top-down (LL) Bottom-up (LR) Purpose of parsing is to enforce grammar: recognize whether input is legal program

Parsing (section 2.3) Top-down Bottom-up Recursive-descent technique
Table-driven Bottom-up Using parse tables Parsing is the 2nd stage of compilation. Probably most crucial. BTW, do you prefer “nonterminal” or “variable” to describe left side of grammar rule?

Grammars & parsing Grammar: defines language
We can generate (derive) a program Parser: see if a program obeys grammar We want to recognize a program Parsing algorithms general CYK 1965 algorithm runs in O(n3)  realistic language: parser can run in O(n) Top-down technique Bottom-up technique

Two approaches Top-down Bottom-up See illustration, p. 71
Construct parse tree from root down Predict which grammar rule to use Bottom-up From tokens (leaves), match them to nonterminals in grammar Read tokens until you recognize which nonterminal See illustration, p. 71

Example (p.71) id_list  id id_list_tail
id_list_tail  , id id_list_tail | ; Top down We know we must start with an id token Check to see if next token is , or ; id_list  A id_list_tail id_list  A, B id_list_tail id_list  A, B, C id_list_tail id_list  A, B, C ; Bottom up Push tokens until we see ; Then, work backwards from the stack: id_list_tail  ; Id_list_tail  , C ; Id_list_tail  , B , C ; Id_list  A , B, C ;

Recursive descent See handout example
Each variable we define in the grammar gets its own function Function consists of choices Need to see what this variable can start with Match expected token, or else syntax error! Why called “recursive descent” ?  Handout!

Limitations Parsing techniques like recursive descent won’t work for all grammars Left recursion Ex. term  factor | term mult_op factor Ambiguous starting symbol Ex. decl  type vars vars  id | id , var See what’s wrong in these situations? (more on this coming up)

Table-driven top-down parsing (sect. 2.3.3)
Alternative to recursive-descent Constructing the table Based on: first, follow, predict sets Making sure the grammar is LL If not, need to change grammar

Where are we… 5 phases of compilation Scanning Parsing 
Semantic analysis Code generation Optimization Parsing techniques Top-down Recursive descent Table driven  Bottom-up The last 3 compilation phases are covered in later chapters (4, 15, 16 respectively).

LL parse table (p. 84) Non-recursive
Uses “parse stack” to maintain input Table contains guide for how to parse Rows correspond to nonterminal on top of stack Columns correspond to current input token (terminal) Entry in table is a production number

Procedure Put the start symbol on the parse stack
While (parse stack != eof) If top of stack is nonterminal: Look up production number from parse table Replace left side of rule with right side onto stack If top of stack is terminal, match with the input & consume it.

Create parse table Using a parse table is faster than writing a recursive descent parser, but we first need to create the parse table. We rely on definitions p. 84: First (nonterminal) Follow (grammar symbol) Predict (production) Definitions first & follow are also used in bottom-up parsing.

Set definitions First (nonterminal) Follow (grammar symbol)
what tokens can the nonterminal start with? possibly nothing (if A  ε) Follow (grammar symbol) what tokens can come after this symbol? possibly nothing (end of input) Predict (production) What can right side start with? If right side can be empty, what can follow left side? For first & follow, the answer could be “nothing” – i.e. the empty string.

Example id_list  id id_list_tail id_list_tail  , id id_list_tail | ;
First(id_list) = id First(id_list_tail) = , ; Follow(id_list) = eof Follow(id_list_tail) = eof Predict = id Predict = , Predict = ;

Example con’d Top of Stack Id , ; Id_list 1 - Id_list_tail 2 3
id_list  id id_list_tail id_list_tail  , id id_list_tail | ; Let’s parse A,B; Parse stack input action id_list A,B; use 1 id id_list_tail A,B; consume A id_list_tail ,B; use 2 , id id_list_tail ,B; consume , id id_list_tail B; consume id Id_list_tail ; use 3 ; ; consume ; <empty> <empty>  Top of Stack Id , ; Id_list 1 - Id_list_tail 2 3 Let’s try the type grammar from previous lesson.

LL obstacles Left recursion Common prefixes
There are algorithms to eliminate, but ugly Common prefixes Stmt  id := expr | id (args) convert to Stmt  id stmt_tail Stmt_tail  := expr | (args) ε in grammar can lead to ambiguity…

Dangling else (p.81) Ex. if c1 then if c2 then s1 else s2
stmt  “if” cond then else then  “then” stmt else  “else” stmt | ε Ex. if c1 then if c2 then s1 else s2 If we draw parse tree, we get two nodes called else. Which else is ε, and which is “else s2” – ambiguous! Can resolve by choosing the rule that comes first. So the else matches the closest “then”.

Bottom-up parsing (sect 2.3.4)
Also uses parse table and stack Need to compute first & follow Concept of “state” while reading input.

First( ) To calculate first(A), look at A’s rules.
If you see A  c…, add c to first(A) If you see A  B…, add first(B) to first(A). If B can yield ε, continue to next symbol in rule until you reach a symbol that can represent a terminal. If A can yield ε, add ε to first(A). Note: don’t put $ in first( ).

Follow( ) What should be included in follow(A) ?
If A is start symbol, add $. If you see Q  …Ac…, add c. If you see Q  …AB…, add first(B). If you see Q  …A, add follow(Q). If you see Q  …ABC, and B yields ε, add first (C). If you see Q  …AB, and B yields ε, add follow(Q). Note: don’t put ε in follow( ).

Example Try this grammar: S  AB A  ε | 1A2 B  ε | 3B Follow(S) = $
First(B) = ε, 3 First(A) = ε, 1 First(S) = ε, 1, 3 (note in this case A  ε) Follow(S) = $ since S is start symbol Follow(A) = 2, 3, $ we need first(B) since B  ε, we need $ Follow(B) = $

Try this one Let’s try the language ((1*2(3+4)*(56)*)* Rules First
Follow S  ε | SABC A  2 | 1A B  ε | 3B | 4B C  ε | 56C

answer Let’s try the language ((1*2(3+4)*(56)*)* Rules First Follow
S  ε | SABC ε, 1, 2 1, 2, $ A  2 | 1A 1, 2 3, 4, 5, $, 1, 2 B  ε | 3B | 4B ε, 3, 4 5, $, 1, 2 C  ε | 56C ε, 5 $, 1, 2

Bottom-up parsing Learning objectives Running a parse machine
“Goto” (or shift) actions Reduce actions: backtrack to earlier state Maintain stack of visited states Creating a parse machine Find the states: sets of items Find transitions between states, including reduce. If many states, write table instead of drawing 

State Let’s say we have a rule for reading two token parens: P  ( )
There are 3 possible states: P  • ( ) P  ( • ) P  ( ) • The dot • is the cursor, and a grammar rule containing a cursor is called an item. A state may contain more than one item.

Parse stack Bottom-up Stack keeps track of states where we’ve been
Top-down Began with the start symbol Replace left sides with right sides Consume token if matched input Parsing stopped when stack empty Bottom-up Stack keeps track of states where we’ve been Two types of operations goto next state & advance • by 1 input symbol reduce: pop states & replace input Parsing ends when we reach “happy” state.

Simple example S  AB Consider this grammar See handouts for details.
A  aaa B  bb At any point in time, think about where we could be while parsing the string “aaabb”. When we arrive at aaabb. We can reduce the “aaa” to A. When we arrive at Abb, we can reduce the “bb” to B. Knowing that we’ve just read AB, we can reduce this to S. See handouts for details.

Sets of items We’re creating states.
We start with a grammar. First step is to augment it with the rule S’  S. The first state I0 will contain S’   S Important rule: Any time you write  before a variable, you must “expand” that variable. So, we add items from the rules of S to I0. Example: { 0n 1n+1 } S  1 | 0S1 We add new start rule S’  S State 0 has these 3 items: I0: S’   S S   1 S   0S1 Expand S

continued Next, determine transitions out of state 0. δ(0, S) = 1
δ(0, 1) = 2 δ(0, 0) = 3 I’ve written destinations along the right side. Now we’re ready for state 1. Move cursor to right to become S’  S  State 0 has these 3 items: I0: S’   S 1 S   1 2 S   0S1 3 I1: S’  S 

continued Any time an item ends with , this represents a reduce, not a goto. Now, we’re ready for state 2. The item S  1 moves its cursor to the right: S  1  This also become a reduce. I0: S’   S 1 S   1 2 S   0S1 3 I1: S’  S  r I2: S  1  r

continued Next is state 3. From S  0S1, move cursor.
Notice that now the  is in front of a variable, so we need to expand. Once we’ve written the items, fill in the transitions. Create new state only if needed. δ(3, S) = 4 (a new state) δ(3, 1) = 2 (as before) δ(3, 0) = 3 (as before) I0: S’   S 1 S   1 2 S   0S1 3 I1: S’  S  r I2: S  1  r I3: S  0  S1 4

continued Next is state 4. From item S  0  S1, move cursor.
Determine transition. δ(4, 1) = 5 Notice we need new state since we’ve never seen “0 S  1” before. I0: S’   S 1 S   1 2 S   0S1 3 I1: S’  S  r I2: S  1  r I3: S  0  S1 4 I4: S  0S  1 5

Last state! Our last state is #5.
Since the cursor is at the end of the item, our transition is a reduce. Now, we are done finding states and transitions! One question remains, concerning the reduce transitions: On what input should we reduce? I0: S’   S 1 S   1 2 S   0S1 3 I1: S’  S  r I2: S  1  r I3: S  0  S1 4 I4: S  0S  1 5 I5: S  0S1  r

When to reduce If you are at the end of an item such as S  1 , there is no symbol after the  telling us what input to wait for. The next symbol should be whatever “follows” the variable we are reducing. In this case, what follows S. We need to look at the original grammar to find out. For example, if you were reducing A, and you saw a rule S  A1B, you would say that 1 follows A. Since S is start symbol, $ (end of input) follows S. For more info, see parser worksheet. for each grammar variable, what follows?

Examples Let’s run a bottom-up parse table for a couple grammars.
Reading parentheses P  ( ) Reading a binary number S  0A0 A  1 | 1A

Wrap-up chapter 2 on parsing Finish bottom-up example
Precedence & associativity C language grammar Homework Actually, precedence & associavity are covered a bit at the start of chapter 6.

Bottom-up parsing Convert grammar to sets of items
This will give us our states Augment grammar if start symbol ever on right Compute first() and follow() We need follow so we know on what input to reduce For S  AB, follow(A) needs first(B) Fill in parse table Ready to parse! Begin in state 0. Finish example from previous lesson. Do more if needed.

Precedence We want to define operations in a grammar, but need to make right choices! Ex * 4: * is performed first, while + is a better separator. Need nested structure to define levels of precedence, but which way? expr  term | expr + term expr  term | expr * term term  token | term * token term  token | term + token Draw parse trees to see which is correct. Now, we’re getting away from parsing for the moment to look at a general grammar issue. (also could dispel the possibility that it doesn’t matter which way you do it!)

Associativity Notice that our expression grammar is recursive. Should it be left or right recursive? Let’s look at minus. (3 – 2) – 1 != 3 – (2 – 1) Which way should it be? expr  token | expr – token expr  token | token – expr … …

Moral Precedence Associativity
Operators of lower precedence should be defined earlier or higher level in grammar (because they separate well) Associativity Left associativity = left recursive Right associativity = right recursive Most operators are left (LR) associative. Can you think of some that are right associative?

C language grammar Concise, compared to other PL !
We see what PL consists of Declarations of functions, data structures, variables Function syntax Data types Kinds of statements Expressions (many diff operators) Future chapters will go deeper on data types, etc. Finally, talk about homework.

CS 363 – Chapter 2 The first 2 phases of compilation Scanning Parsing

Similar presentations

Presentation on theme: "CS 363 – Chapter 2 The first 2 phases of compilation Scanning Parsing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 363 – Chapter 2 The first 2 phases of compilation Scanning Parsing

Similar presentations

Presentation on theme: "CS 363 – Chapter 2 The first 2 phases of compilation Scanning Parsing"— Presentation transcript:

Similar presentations

About project

Feedback