Syntax Analysis: Bottom-Up Parsing

Slides:

Advertisements

Similar presentations

Compiler construction in4020 – lecture 4 Koen Langendoen Delft University of Technology The Netherlands.

Advertisements

Compilation (Semester A, 2013/14) Lecture 6a: Syntax (Bottom–up parsing) Noam Rinetzky 1 Slides credit: Roman Manevich, Mooly Sagiv, Eran Yahav.

Compiler Principles Fall Compiler Principles Lecture 4: Parsing part 3 Roman Manevich Ben-Gurion University.

Compilation (Semester A, 2013/14) Lecture 6b: Context Analysis (aka Semantic Analysis) Noam Rinetzky 1 Slides credit: Mooly Sagiv and Eran Yahav.

Lecture 05 – Syntax analysis & Semantic Analysis Eran Yahav 1.

Mooly Sagiv and Roman Manevich School of Computer Science

Lecture 06 – Semantic Analysis Eran Yahav 1. 2 You are here Executable code exe Source text txt Compiler Lexical Analysis Syntax Analysis Parsing Semantic.

Theory of Compilation Erez Petrank Lecture 4: Semantic Analysis: 1.

6/12/2015Prof. Hilfinger CS164 Lecture 111 Bottom-Up Parsing Lecture (From slides by G. Necula & R. Bodik)

Bottom-Up Syntax Analysis Mooly Sagiv Textbook:Modern Compiler Design Chapter (modified)

Bottom-Up Syntax Analysis Mooly Sagiv & Greta Yorsh Textbook:Modern Compiler Design Chapter (modified)

Table-driven parsing Parsing performed by a finite state machine. Parsing algorithm is language-independent. FSM driven by table (s) generated automatically.

Cs164 Prof. Bodik, Fall Symbol Tables and Static Checks Lecture 14.

2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.

Review 1.Lexical Analysis 2.Syntax Analysis 3.Semantic Analysis 4.Code Generation 5.Code Optimization.

Theory of Compilation Erez Petrank Lecture 5: Semantic Analysis: 1.

Compilation /15a Lecture 5 1 * Syntax Analysis: Bottom-Up parsing Noam Rinetzky.

Compiler Principles Fall Compiler Principles Lecture 6: Parsing part 5 Roman Manevich Ben-Gurion University.

Review: Syntax directed translation. –Translation is done according to the parse tree. Each production (when used in the parsing) is a sub- structure of.

Top-Down Parsing CS 671 January 29, CS 671 – Spring Where Are We? Source code: if (b==0) a = “Hi”; Token Stream: if (b == 0) a = “Hi”; Abstract.

1 Syntax-Directed Translation Part I Chapter 5 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2007.

Compiler Principles Winter Compiler Principles Syntax Analysis (Parsing) – Part 3 Mayer Goldberg and Roman Manevich Ben-Gurion University.

Compiler Principles Fall Compiler Principles Lecture 5: Parsing part 4 Roman Manevich Ben-Gurion University.

Compilation Lecture 4: Syntax Analysis Bottom Up parsing Noam Rinetzky 1.

Lecture 5: LR Parsing CS 540 George Mason University.

Compilers: Bottom-up/6 1 Compiler Structures Objective – –describe bottom-up (LR) parsing using shift- reduce and parse tables – –explain how LR.

Lecture 05 – Semantic Analysis Eran Yahav 1. 2 You are here Executable code exe Source text txt Compiler Lexical Analysis Syntax Analysis Parsing Semantic.

CSE 420 Lecture Program is lexically well-formed: ▫Identifiers have valid names. ▫Strings are properly terminated. ▫No stray characters. Program.

LECTURE 10 Semantic Analysis. REVIEW So far, we’ve covered the following: Compilation methods: compilation vs. interpretation. The overall compilation.

Compiler Principles Fall Compiler Principles Lecture 5: Parsing part 4 Roman Manevich Ben-Gurion University.

Lecture 9 Symbol Table and Attributed Grammars

Announcements/Reading

Semantic analysis Jakub Yaghob

Compilation /15a Lecture 6

Syntax-Directed Translation

Context-Sensitive Analysis

A Simple Syntax-Directed Translator

Constructing Precedence Table

Compiler Baojian Hua LR Parsing Compiler Baojian Hua

UNIT - 3 SYNTAX ANALYSIS - II

Table-driven parsing Parsing performed by a finite state machine.

Ch. 4 – Semantic Analysis Errors can arise in syntax, static semantics, dynamic semantics Some PL features are impossible or infeasible to specify in grammar.

Compiler Construction

Fall Compiler Principles Lecture 4: Parsing part 3

Syntax-Directed Translation Part I

Syntax Analysis Part II

Fall Compiler Principles Lecture 4: Parsing part 3

CS 3304 Comparative Languages

Subject Name:COMPILER DESIGN Subject Code:10CS63

Top-Down Parsing CS 671 January 29, 2008.

Syntax-Directed Translation Part I

Syntax-Directed Translation Part I

Parsing #2 Leonidas Fegaras.

Lecture 4: Syntax Analysis: Botom-Up Parsing Noam Rinetzky

Chapter 6 Intermediate-Code Generation

Compilers Principles, Techniques, & Tools Taught by Jing Zhang

R.Rajkumar Asst.Professor CSE

LALR Parsing Adapted from Notes by Profs Aiken and Necula (UCB) and

Compilers Principles, Techniques, & Tools Taught by Jing Zhang

Parsing #2 Leonidas Fegaras.

Syntax-Directed Translation Part I

Fall Compiler Principles Lecture 4: Parsing part 3

5. Bottom-Up Parsing Chih-Hung Wang

Kanat Bolazar February 16, 2010

Announcements HW2 due on Tuesday Fall 18 CSCI 4430, A Milanova.

Compilers Principles, Techniques, & Tools Taught by Jing Zhang

Syntax-Directed Translation Part I

Chap. 3 BOTTOM-UP PARSING

Compiler design Bottom-up parsing: Canonical LR and LALR

Presentation transcript:

Syntax Analysis: Bottom-Up Parsing Compilation 0368-3133 Lecture 5: Syntax Analysis: Bottom-Up Parsing Semantic Analysis Noam Rinetzky

Model of an LR parser Input Stack LR Parser Output $ id + T 2 + 7 id 5 state T 2 + 7 id 5 Output symbol Terminals and Non-terminals Goto Table Action

LR(0) Parsing

N  αβ LR item Already matched To be matched Input Hypothesis about αβ being a possible handle: so far we’ve matched α, expecting to see β

LR(0) Shift/Reduce Items N  αβ Shift Item N  αβ Reduce Item

GOTO/ACTION tables ACTION Table GOTO Table State i + ( ) $ E T action q0 q5 q7 q1 q6 shift q3 q2 ZE$ q4 Shift EE+T Ti ET q8 q9 TE

LR(0) parser tables Two types of rows: Shift row – tells which state to GOTO for current token Reduce row – tells which rule to reduce (independent of current token) GOTO entries are blank

LR parser data structures Input – remainder of text to be processed Stack – sequence of pairs N, qi N – symbol (terminal or non-terminal) qi – state at which decisions are made Initial stack contains q0 Input suffix + i $ stack q0 i q5 Stack grows this way

LR(0) pushdown automaton Two moves: shift and reduce Shift move Remove first token from input Push it on the stack Compute next state based on GOTO table Push new state on the stack If new state is error – report error shift input i + i $ input + i $ stack q0 stack q0 i q5 Stack grows this way State i + ( ) $ E T action q0 q5 q7 q1 q6 shift

LR(0) pushdown automaton Reduce move Using a rule N α Symbols in α and their following states are removed from stack New state computed based on GOTO table (using top of stack, before pushing N) N is pushed on the stack New state pushed on top of N Reduce T  i input + i $ input + i $ stack q0 i q5 stack q0 T q6 Stack grows this way State i + ( ) $ E T action q0 q5 q7 q1 q6 shift

GOTO/ACTION table Warning: numbers mean different things! State i + ( ) $ E T q0 s5 s7 s1 s6 q1 s3 s2 q2 r1 q3 s4 q4 r3 q5 r4 q6 r2 q7 s8 q8 s9 q9 r5 Z  E $ E  T E  E + T T  i T ( E ) Warning: numbers mean different things! rn = reduce using rule number n sm = shift to state m

Parsing id+id$ Initialize with state 0 Stack Input Action id + id $ s5 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 Initialize with state 0 rn = reduce using rule number n sm = shift to state m

Parsing id+id$ Initialize with state 0 Stack Input Action id + id $ s5 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 Initialize with state 0 rn = reduce using rule number n sm = shift to state m

Parsing id+id$ Stack Input Action id + id $ s5 0 id 5 + id $ r4 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 0 id 5 + id $ r4 goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 rn = reduce using rule number n sm = shift to state m

Parsing id+id$ pop id 5 Stack Input Action id + id $ s5 0 id 5 + id $ (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 0 id 5 + id $ r4 goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 pop id 5 rn = reduce using rule number n sm = shift to state m

Parsing id+id$ push T 6 Stack Input Action id + id $ s5 0 id 5 + id $ (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 0 id 5 + id $ r4 goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 push T 6 rn = reduce using rule number n sm = shift to state m

Parsing id+id$ Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6 r2 goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 rn = reduce using rule number n sm = shift to state m

Parsing id+id$ Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6 r2 0 E 1 s3 goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 rn = reduce using rule number n sm = shift to state m

Parsing id+id$ Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6 r2 0 E 1 s3 0 E 1 + 3 id $ goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 rn = reduce using rule number n sm = shift to state m

Parsing id+id$ Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6 r2 0 E 1 s3 0 E 1 + 3 id $ 0 E 1 + 3 id 5 $ goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 rn = reduce using rule number n sm = shift to state m

Parsing id+id$ Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6 r2 0 E 1 s3 0 E 1 + 3 id $ 0 E 1 + 3 id 5 $ 0 E 1 + 3 T 4 r3 goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 rn = reduce using rule number n sm = shift to state m

Parsing id+id$ Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6 (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Parsing id+id$ Stack grows this way Stack Input Action id + id $ s5 0 id 5 + id $ r4 0 T 6 r2 0 E 1 s3 0 E 1 + 3 id $ 0 E 1 + 3 id 5 $ 0 E 1 + 3 T 4 r3 s2 goto action S T E $ ) ( + id g6 g1 s7 s5 acc s3 1 2 g4 3 r3 4 r4 5 r2 6 g8 7 s9 8 r5 9 rn = reduce using rule number n sm = shift to state m

States and LR(0) Items E → E * B | E + B | B B → 0 | 1 The state will “remember” the potential derivation rules given the part that was already identified For example, if we have already identified E then the state will remember the two alternatives: (1) E → E * B, (2) E → E + B Actually, we will also remember where we are in each of them: (1) E → E ● * B, (2) E → E ● + B A derivation rule with a location marker is called LR(0) item. The state is actually a set of LR(0) items. E.g., q13 = { E → E ● * B , E → E ● + B}

LR(0) parser tables Actions determined by topmost state Two types of rows: Shift row – tells which state to GOTO for current token Reduce row – tells which rule to reduce (independent of current token) GOTO entries are blank

GOTO/ACTION table Warning: numbers mean different things! State i + ( ) $ E T q0 s5 s7 s1 s6 q1 s3 s2 q2 r1 q3 s4 q4 r3 q5 r4 q6 r2 q7 s8 q8 s9 q9 r5 action shift Z E$ Shift E E+T T i E T T E Z  E $ E  T E  E + T T  i T  ( E ) Warning: numbers mean different things! rn = reduce using rule number n sm = shift to state m

Constructing an LR parsing table Construct a (determinized) transition diagram from LR items If there are conflicts – stop Fill table entries from diagram

N  αβ LR item Already matched To be matched Input Hypothesis about αβ being a possible handle, so far we’ve matched α, expecting to see β

LR(0) automaton example reduce state shift state q6 E  T T q0 T q7 Z  E$ E  T E  E + T T  i T  (E) T  (E) E  T E  E + T T  i T  (E) ( q5 i T  i i E ( E ( i q1 q8 Z  E$ E  E+ T q3 E  E+T T  i T  (E) T  (E) E  E+T + + $ ) q2 q9 Z  E$ T  (E)  T q4 E  E + T

Computing item sets Initial set Z is in the start symbol -closure({ Z  α | Z  α is in the grammar } ) Next set from a set S and the next symbol X step(S,X) = { N  αXβ | N  αXβ in the item set S} nextSet(S,X) = -closure(step(S,X))

Operations for transition diagram construction Initial = {S’  S$} For an item set I Closure(I) = Closure(I) ∪ {X  µ is in grammar| N  αXβ in I} Goto(I, X) = { N  αXβ | N  αXβ in I}

Initial example Initial = {S  E $} Grammar (1) S  E $ (2) E  T (4) T  id (5) T  ( E ) Initial = {S  E $}

Closure example Initial = {Z  E $} Grammar (1) Z  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Initial = {Z  E $} Closure({Z  E $}) = { Z  E $ E  T E  E + T T  id T  ( E ) }

Goto example Initial = {Z  E $} Grammar (1) Z  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) Initial = {Z  E $} Closure({Z  E $}) = { Z  E $ E  T E  E + T T  id T  ( E ) } Goto({Z  E $ , E  E + T, T  id}, E) = {Z  E $, E  E + T}

Constructing the transition diagram Start with state 0 containing item Closure({S  E $}) Repeat until no new states are discovered For every state p containing item set Ip, and symbol N, compute state q containing item set Iq = Closure(goto(Ip, N))

LR(0) automaton example reduce state shift state q6 E  T T q0 T q7 Z  E$ E  T E  E + T T  i T  (E) T  (E) E  T E  E + T T  i T  (E) ( q5 i T  i i E ( E ( i q1 q8 Z  E$ E  E+ T q3 E  E+T T  i T  (E) T  (E) E  E+T + + $ ) q2 q9 Z  E$ T  (E)  T q4 E  E + T

Automaton construction example (1) Z  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) q0 Z E$ Initialize

Automaton construction example (1) Z  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) q0 Z  E$ E  T E  E + T T  i T  (E) apply Closure

Automaton construction example (1) Z  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) E  T q6 T q0 Z  E$ E  T E  E + T T  i T  (E) T  (E) E  T E  E + T T  i T  (E) ( T  i q5 i Z  E$ E  E+ T q1 E

Automaton construction example (1) S  E $ (2) E  T (3) E  E + T (4) T  id (5) T  ( E ) q6 E  T q0 T q7 T Z  E$ E  T E  E + T T  i T  (E) T  (E) E  T E  E + T T  i T  (E) non-terminal transition corresponds to goto action in parse table ( q5 i T  i i E ( E ( i q1 q8 q3 terminal transition corresponds to shift action in parse table Z  E$ E  E+ T E  E+T T  i T  (E) T  (E) E  E+T + + $ ) q2 q9 Z  E$ T  (E)  T q4 E  E + T a single reduce item corresponds to reduce action

Are we done? Can make a transition diagram for any grammar Can make a GOTO table for every grammar Cannot make a deterministic ACTION table for every grammar

LR(0) conflicts … T q0 Z  E$ E  T E  E + T T  i T  (E) ( … T  i[E] ( … q5 i T  i T  i[E] E Shift/reduce conflict … Z  E $ E  T E  E + T T  i T  ( E ) T  i[E]

LR(0) conflicts … T q0 Z  E$ E  T E  E + T T  i T  (E) ( … T  i[E] ( … q5 i T  i V  i E reduce/reduce conflict … Z  E $ E  T E  E + T T  i V  i T  ( E )

LR(0) conflicts Any grammar with an -rule cannot be LR(0) Inherent shift/reduce conflict A   – reduce item P  αAβ – shift item A   can always be predicted from P  αAβ

Conflicts Can construct a diagram for every grammar but some may introduce conflicts shift-reduce conflict: an item set contains at least one shift item and one reduce item reduce-reduce conflict: an item set contains two reduce items

LR variants LR(0) – what we’ve seen so far SLR(0) LR(1) LALR(0) Removes infeasible reduce actions via FOLLOW set reasoning LR(1) LR(0) with one lookahead token in items LALR(0) LR(1) with merging of states with same LR(0) component

LR (0) GOTO/ACTIONS tables GOTO table is indexed by state and a grammar symbol from the stack ACTION Table GOTO Table State i + ( ) $ E T action q0 q5 q7 q1 q6 shift q3 q2 Z E$ q4 Shift E E+T T i E T q8 q9 T E ACTION table determined only by state, ignores input

SLR parsing A handle should not be reduced to a non-terminal N if the lookahead is a token that cannot follow N A reduce item N  α is applicable only when the lookahead is in FOLLOW(N) If b is not in FOLLOW(N) we proved there is no derivation S * βNb. Thus, it is safe to remove the reduce item from the conflicted state Differs from LR(0) only on the ACTION table Now a row in the parsing table may contain both shift actions and reduce actions and we need to consult the current token to decide which one to take

Lookahead token from the input SLR action table Lookahead token from the input State i + ( ) [ ] $ shift 1 accept 2 3 4 E E+T 5 T i 6 E T 7 8 9 T (E) state action q0 shift q1 q2 q3 q4 E E+T q5 T i q6 E T q7 q8 q9 T E vs. SLR – use 1 token look-ahead LR(0) – no look-ahead … as before… T  i T  i[E]

LR(1) grammars In SLR: a reduce item N  α is applicable only when the lookahead is in FOLLOW(N) But FOLLOW(N) merges lookahead for all alternatives for N Insensitive to the context of a given production LR(1) keeps lookahead with each LR item Idea: a more refined notion of follows computed per item

LR(1) items LR(1) item is a pair Meaning Example LR(0) item Lookahead token Meaning We matched the part left of the dot, looking to match the part on the right of the dot, followed by the lookahead token Example The production L  id yields the following LR(1) items LR(1) items [L → ● id, *] [L → ● id, =] [L → ● id, id] [L → ● id, $] [L → id ●, *] [L → id ●, =] [L → id ●, id] [L → id ●, $] (0) S’ → S (1) S → L = R (2) S → R (3) L → * R (4) L → id (5) R → L LR(0) items [L → ● id] [L → id ●]

LR(1) items LR(1) item is a pair Meaning Example Lookahead token Meaning We matched the part left of the dot, looking to match the part on the right of the dot, followed by the lookahead token Example The production L  id yields the following LR(1) items Reduce only if the the expected lookhead matches the input [L → id ●, =] will be used only if the next input token is =

LALR(1) LR(1) tables have huge number of entries Often don’t need such refined observation (and cost) Idea: find states with the same LR(0) component and merge their lookaheads component as long as there are no conflicts LALR(1) not as powerful as LR(1) in theory but works quite well in practice Merging may not introduce new shift-reduce conflicts, only reduce-reduce, which is unlikely in practice

Summary

LR is More Powerful than LL Any LL(k) language is also in LR(k), i.e., LL(k) ⊂ LR(k). LR is more popular in automatic tools But less intuitive Also, the lookahead is counted differently in the two cases In an LL(k) derivation the algorithm sees the left-hand side of the rule + k input tokens and then must select the derivation rule In LR(k), the algorithm “sees” all right-hand side of the derivation rule + k input tokens and then reduces LR(0) sees the entire right-side, but no input token

Using tools to parse + create AST terminal Integer NUMBER; terminal PLUS,MINUS,MULT,DIV; terminal LPAREN, RPAREN; terminal UMINUS; nonterminal Integer expr; precedence left PLUS, MINUS; precedence left DIV, MULT; Precedence left UMINUS; %% expr ::= expr:e1 PLUS expr:e2 {: RESULT = new Integer(e1.intValue() + e2.intValue()); :} | expr:e1 MINUS expr:e2 {: RESULT = new Integer(e1.intValue() - e2.intValue()); :} | expr:e1 MULT expr:e2 {: RESULT = new Integer(e1.intValue() * e2.intValue()); :} | expr:e1 DIV expr:e2 {: RESULT = new Integer(e1.intValue() / e2.intValue()); :} | MINUS expr:e1 %prec UMINUS {: RESULT = new Integer(0 - e1.intValue(); :} | LPAREN expr:e1 RPAREN {: RESULT = e1; :} | NUMBER:n {: RESULT = n; :}

Grammar Hierarchy Non-ambiguous CFG LR(1) LALR(1) LL(1) SLR(1) LR(0)

Semantic Analysis Noam Rinetzky Compilation Semantic Analysis Noam Rinetzky

Back End You are here… ✓ ✓ Source text txt characters tokens AST Process text input Lexical Analysis Syntax Analysis SemanticAnalysis characters tokens AST ✓ ✓ Annotated AST Back End Intermediate code generation Intermediate code optimization Code generation IR IR Symbolic Instructions Executable code exe Target code optimization SI Machine code generation MI Write executable output

Oops int x; xx = 0; int x, y; z = x + 1; main() { break; } int x; print(x) Can the parser help?

0 or 1 – this is the question Can the parser help? int x = 0; p() { print(x) } q() { int x = 1; p() }

We want help … Semantic (Context) analysis to the rescue Syntax Analysis Semantic Analysis AST

We want help … Semantic (Context) analysis to the rescue Syntax Analysis Semantic Analysis AST

Abstract Syntax Tree AST is a simplification of the parse tree Can be built by traversing the parse tree E.g., using visitors Can be built directly during parsing Add an action to perform on each production rule Similarly to the way a parse tree is constructed

Abstract Syntax Tree The interface between the parser and the rest of the compiler Separation of concerns Reusable, modular and extensible The AST is defined by a context free grammar The grammar of the AST can be ambiguous! E E + E Is this a problem? Keep syntactic information Why?

What we want Parser Potato potato; Carrot carrot; x = tomato + potato + carrot Lexical analyzer …<id,tomato>,<PLUS>,<id,potato>,<PLUS>,<id,carrot>,EOF Parser AddExpr left right symbol kind type properties x var ? tomato potato Potato carrot Carrot AddExpr left right LocationExpr LocationExpr LocationExpr id=tomato id=potato id=carrot ‘tomato’ is undefined ‘potato’ used before initialized Cannot add Potato and Carrot

Context Analysis Check properties contexts of in which constructs occur Properties that cannot be formulated via CFG Type checking Declare before use Identifying the same word “w” re-appearing – wbw Initialization … Properties that are hard to formulate via CFG “break” only appears inside a loop Processing of the AST

Context Analysis Identification Context checking Gather information about each named item in the program e.g., what is the declaration for each usage Context checking Type checking e.g., the condition in an if-statement is a Boolean

Identification month : integer RANGE [1..12]; month := 1; while (month <= 12) { print(month_name[month]); month : = month + 1; }

Identification Forward references? month : integer RANGE [1..12]; month := 1; while (month <= 12) { print(month_name[month]); month : = month + 1; } Forward references? Languages that don’t require declarations?

Symbol table month : integer RANGE [1..12]; … month := 1; while (month <= 12) { print(month_name[month]); month : = month + 1; } name pos type … month 1 RANGE[1..12] month_name A table containing information about identifiers in the program Single entry for each named item

Not so fast… struct one_int { int i; } i; main() { i.i = 42; A struct field named i struct one_int { int i; } i; main() { i.i = 42; int t = i.i; printf(“%d”,t); } A struct variable named i Assignment to the “i” field of struct “i” Reading the “i” field of struct “i”

Not so fast… struct one_int { int i; } i; main() { i.i = 42; A struct field named i struct one_int { int i; } i; main() { i.i = 42; int t = i.i; printf(“%d”,t); { int i = 73; printf(“%d”,i); } A struct variable named i Assignment to the “i” field of struct “i” Reading the “i” field of struct “i” int variable named “i”

Scopes Typically stack structured scopes Scope entry Scope exit push new empty scope element Scope exit pop scope element and discard its content Identifier declaration identifier created inside top scope Identifier Lookup Search for identifier top-down in scope stack

Scope-structured symbol table “so” “long” 3 P P // { int the=1; int fish=2; int thanks=3; int x = 42; int all = 73; … } “and” “thanks” “x” 2 P P P // “x” “all” 1 P P // “the” “fish” “thanks” P P P // Scope stack

Scope and symbol table Scope x Identifier -> properties Expensive lookup A better solution hash table over identifiers

Hash-table based Symbol Table Id.info “x” name … decl 2 P 1 P // “thanks” name … decl 2 P P // “so” name … decl 3 P //

(now just pointers to the corresponding record in the symbol table) Scope Info Id.info(“so”) Id.info(“long”) 3 // Id.info(“and”) Id.info(“thanks”) Id.info(“x”) 2 // Id.info(“x”) Id.info(“all”) 1 // Id.info(“the”) Id.info(“fish”) Id.info(“thanks”) // Scope stack (now just pointers to the corresponding record in the symbol table)

Symbol table month : integer RANGE [1..12]; … month := 1; while (month <= 12) { print(month_name[month]); month : = month + 1; } name pos type … month 1 RANGE[1..12] month_name A table containing information about identifiers in the program Single entry for each named item

Semantic Checks Scope rules Type checking Use symbol table to check that Identifiers defined before used No multiple definition of same identifier … Type checking Check that types in the program are consistent How? Why?

Types What is a type? Why do we care? Simplest answer: a set of values + allowed operations Integers, real numbers, booleans, … Why do we care? Code generation: $1 := $1 + $2 Safety Guarantee that certain errors cannot occur at runtime Abstraction Hide implementation details Documentation Optimization

Type System (textbook definition) “A type system is a tractable syntactic method for proving the absence of certain program behaviors by classifying phrases according to the kinds of values they compute” -- Types and Programming Languages by Benjamin C. Pierce

Type System A type system of a programming language is a way to define how “good” program “behave” Good programs = well-typed programs Bad programs = not well typed Type checking Static typing – most checking at compile time Dynamic typing – most checking at runtime Type inference Automatically infer types for a program (or show that there is no valid typing)

Static typing vs. dynamic typing Static type checking is conservative Any program that is determined to be well-typed is free from certain kinds of errors May reject programs that cannot be statically determined as well typed Dynamic type checking May accept more programs as valid (runtime info) Errors not caught at compile time Runtime cost

Type Checking Type rules specify Examples string string int string which types can be combined with certain operator Assignment of expression to variable Formal and actual parameters of a method call Examples string string int string “drive” + “drink” 42 + “the answer” string ERROR

Type Checking Rules Specify for each operator Basic Types Types of operands Type of result Basic Types Building blocks for the type system (type rules) e.g., int, boolean, (sometimes) string Type Expressions Array types Function types Record types / Classes

If E1 has type int and E2 has type int, then E1 + E2 has type int Typing Rules If E1 has type int and E2 has type int, then E1 + E2 has type int E1 : int E2 : int E1 + E2 : int

More Typing Rules (examples) true : boolean false : boolean int-literal : int string-literal : string E1 : int E2 : int op ∈ { +, -, /, *, %} E1 op E2 : int E1 : int E2 : int rop ∈ { <=,<, >, >=} E1 rop E2 : boolean E1 : T E2 : T rop ∈ { ==,!=} E1 rop E2 : boolean

And Even More Typing Rules E1 : boolean E2 : boolean lop ∈ { &&,|| } E1 lop E2 : boolean E1 : int E1 : boolean - E1 : int ! E1 : boolean E1 : T[] E1 : T[] E2 : int E1 : int E1.length : int E1[E2] : T new T[E1] : T[]

Type Checking Traverse AST and assign types for AST nodes Use typing rules to compute node types Alternative: type-check during parsing More complicated alternative But naturally also more efficient

Example … 45 > 32 && !false : boolean : boolean : boolean : int E1 lop E2 : boolean lop ∈ { &&,|| } BinopExpr : boolean op=AND E1 : boolean !E1 : boolean : boolean : boolean BinopExpr UnopExpr op=GT op=NEG E1 : int E2 : int E1 rop E2 : boolean rop ∈ { <=,<, >, >=} intLiteral intLiteral boolLiteral val=45 val=32 val=false : int : int : boolean false : boolean int-literal : int 45 > 32 && !false

Type Declarations So far, we ignored the fact that types can also be declared TYPE Int_Array = ARRAY [Integer 1..42] OF Integer; (explicitly) Var a : ARRAY [Integer 1..42] OF Real; (anonymously)

Var a : ARRAY [Integer 1..42] OF Real; Type Declarations Var a : ARRAY [Integer 1..42] OF Real; TYPE #type01_in_line_73 = ARRAY [Integer 1..42] OF Real; Var a : #type01_in_line_73;

Forward References TYPE Ptr_List_Entry = POINTER TO List_Entry; TYPE List_Entry = RECORD Element : Integer; Next : Ptr_List_Entry; END RECORD; Forward references must be resolved A forward references added to the symbol table as forward reference, and later updated when type declaration is met At the end of scope, must check that all forward references have been resolved Check must be added for circularity

Type Table All types in a compilation unit are collected in a type table For each type, its table entry contains: Type constructor: basic, record, array, pointer,… Size and alignment requirements to be used later in code generation Types of components (if applicable) e.g., types of record fields

Type Equivalence: Name Equivalence Type t1 = ARRAY[Integer] OF Integer; Type t2 = ARRAY[Integer] OF Integer; t1 not (name) equivalence to t2 Type t3 = ARRAY[Integer] OF Integer; Type t4 = t3 t3 equivalent to t4

Type Equivalence: Structural Equivalence Type t5 = RECORD c: Integer; p: POINTER TO t5; END RECORD; Type t6 = RECORD c: Integer; p: POINTER TO t6; END RECORD; Type t7 = RECORD c: Integer; p: POINTER TO RECORD c: Integer; p: POINTER to t5; END RECORD; END RECORD; t5, t6, t7 are all (structurally) equivalent

In practice Almost all modern languages use name equivalence

Coercions If we expect a value of type T1 at some point in the program, and find a value of type T2, is that acceptable? float x = 3.141; int y = x;

l-values and r-values dst := src What is dst? What is src? dst is a memory location where the value should be stored src is a value “location” on the left of the assignment called an l-value “value” on the right of the assignment is called an r-value

l-values and r-values (example) x:= y + 1 … … 0x42 73 x 0x42 17 x … … 0x47 16 y 0x47 16 y … …

l-values and r-values expected lvalue rvalue - deref error found

So far… Static correctness checking Identification Type checking Identification matches applied occurrences of identifier to its defining occurrence The symbol table maintains this information Type checking checks which type combinations are legal Each node in the AST of an expression represents either an l-value (location) or an r-value (value)

How does this magic happen? We probably need to go over the AST? how does this relate to the clean formalism of the parser?

Syntax Directed Translation Semantic attributes Attributes attached to grammar symbols Semantic actions How to update the attributes Attribute grammars

Attribute grammars Attributes Semantic actions Every grammar symbol has attached attributes Example: Expr.type Semantic actions Every production rule can define how to assign values to attributes Example: Expr  Expr + Term Expr.type = Expr1.type when (Expr1.type == Term.type) Error otherwise

Indexed symbols Add indexes to distinguish repeated grammar symbols Does not affect grammar Used in semantic actions Expr  Expr + Term Becomes Expr  Expr1 + Term

Example float x,y,z D float float T L float L float float L Production Semantic Rule D  T L L.in = T.type T  int T.type = integer T  float T.type = float L  L1, id L1.in = L.in addType(id.entry,L.in) L  id addType(id.entry,L.in) float float T L float L float id1 float id2 L id3

Attribute Evaluation Build the AST Fill attributes of terminals with values derived from their representation Execute evaluation rules of the nodes to assign values until no new values can be assigned In the right order such that No attribute value is used before its available Each attribute will get a value only once

Dependencies A semantic equation a = b1,…,bm requires computation of b1,…,bm to determine the value of a The value of a depends on b1,…,bm We write a  bi

Cycles Cycle in the dependence graph May not be able to compute attribute values E E.s E.s = T.i T.i = E.s + 1 T T.i AST Dependence graph

Attribute Evaluation Build the AST Build dependency graph Compute evaluation order using topological ordering Execute evaluation rules based on topological ordering Works as long as there are no cycles

Building Dependency Graph All semantic equations take the form attr1 = func1(attr1.1, attr1.2,…) attr2 = func2(attr2.1, attr2.2,…) Actions with side effects use a dummy attribute Build a directed dependency graph G For every attribute a of a node n in the AST create a node n.a For every node n in the AST and a semantic action of the form b = f(c1,c2,…ck) add edges of the form (ci,b)

Example Production Semantic Rule D ➞ T L L.in = T.type T ➞ int T.type = integer T ➞ float T.type = float L ➞ L1, id L1.in = L.in addType(id.entry,L.in) L ➞ id addType(id.entry,L.in) Example Convention: Add dummy variables for side effects. Production Semantic Rule D ➞ T L L.in = T.type T ➞ int T.type = integer T ➞ float T.type = float L ➞ L1, id L1.in = L.in L.dmy = addType(id.entry,L.in) L ➞ id L.dmy = addType(id.entry,L.in)

Example float x,y,z D T L L float L Prod. Semantic Rule D  T L L.in = T.type T  int T.type = integer T  float T.type = float L  L1, id L1.in = L.in addType(id.entry,L.in) L  id addType(id.entry,L.in) D T type in L dmy in L float dmy id1 entry id2 entry in L dmy id3 entry

Example float x,y,z D T L L float L Prod. Semantic Rule D  T L L.in = T.type T  int T.type = integer T  float T.type = float L  L1, id L1.in = L.in addType(id.entry,L.in) L  id addType(id.entry,L.in) D T type in L dmy in L float dmy id1 entry id2 entry in L dmy id3 entry

Example topological orderings: 1 4 3 2 5, 4 1 3 5 2 For a graph G=(V,E), |V|=k Ordering of the nodes v1,v2,…vk such that for every edge (vi,vj)  E, i < j 4 3 2 5 1 Example topological orderings: 1 4 3 2 5, 4 1 3 5 2

Example float x,y,z float float float 1 5 6 ent1 float 7 10 2 float 9 type in dmy ent1 float 7 10 2 float in dmy entry 9 float 3 8 ent2 float entry in dmy 4 entry ent3

But what about cycles? For a given attribute grammar hard to detect if it has cyclic dependencies Exponential cost Special classes of attribute grammars Our “usual trick” sacrifice generality for predictable performance

Inherited vs. Synthesized Attributes Computed from children of a node Inherited attributes Computed from parents and siblings of a node Attributes of tokens are technically considered as synthesized attributes

example float x,y,z inherited synthesized D Production Semantic Rule D  T L L.in = T.type T  int T.type = integer T  float T.type = float L  L1, id L1.in = L.in addType(id.entry,L.in) L  id addType(id.entry,L.in) float float T L float L float id1 float id2 L inherited synthesized id3

S-attributed Grammars Special class of attribute grammars Only uses synthesized attributes (S-attributed) No use of inherited attributes Can be computed by any bottom-up parser during parsing Attributes can be stored on the parsing stack Reduce operation computes the (synthesized) attribute from attributes of children

S-attributed Grammar: example Production Semantic Rule S E ; print(E.val) E  E1 + T E.val = E1.val + T.val E  T E.val = T.val T  T1 * F T.val = T1.val * F.val T  F T.val = F.val F  (E) F.val = E.val F  digit F.val = digit.lexval

example S 31 E + E * T T T F F F 7 4 3 val=31 val=28 val=7 val=4 val=3 Lexval=7 4 Lexval=4 3 Lexval=3

L-attributed grammars L-attributed attribute grammar when every attribute in a production A  X1…Xn is A synthesized attribute, or An inherited attribute of Xj, 1 <= j <=n that only depends on Attributes of X1…Xj-1 to the left of Xj, or Inherited attributes of A

Example: typesetting Each box is built from smaller boxes from which it gets the height and depth, and to which it sets the point size. pointsize (ps) – size of letters in a box. Subscript text has smaller point size of o.7p. height (ht) – distance from top of the box to the baseline depth (dp) – distance from baseline to the bottom of the box.

Example: typesetting production semantic rules S ➞ B B.ps = 10 B ➞ B1 B2 B1.ps = B.ps B2.ps = B.ps B.ht = max(B1.ht,B2.ht) B.dp = max(B1.dp,B2.dp) B ➞ B1 sub B2 B2.ps = 0.7*B.ps B.ht = max(B1.ht,B2.ht – 0.25*B.ps) B.dp = max(B1.dp,B2.dp– 0.25*B.ps) B ➞ text B.ht = getHt(B.ps,text.lexval) B.dp = getDp(B.ps,text.lexval)

Computing the attributes from left to right during a DFS traversal procedure dfvisit (n: node); begin for each child m of n, from left to right evaluate inherited attributes of m; dfvisit (m) end; evaluate synthesized attributes of n end

Summary Contextual analysis can move information between nodes in the AST Even when they are not “local” Attribute grammars Attach attributes and semantic actions to grammar Attribute evaluation Build dependency graph, topological sort, evaluate Special classes with pre-determined evaluation order: S-attributed, L-attributed

The End Front-end