Presentation is loading. Please wait.

Presentation is loading. Please wait.

241-437 Compilers: syntax/4 1 Compiler Structures Objective – –describe general syntax analysis, grammars, parse trees, FIRST and FOLLOW sets 241-437,

Similar presentations


Presentation on theme: "241-437 Compilers: syntax/4 1 Compiler Structures Objective – –describe general syntax analysis, grammars, parse trees, FIRST and FOLLOW sets 241-437,"— Presentation transcript:

1 241-437 Compilers: syntax/4 1 Compiler Structures Objective – –describe general syntax analysis, grammars, parse trees, FIRST and FOLLOW sets 241-437, Semester 1, 2011-2012 4. Syntax Analysis

2 241-437 Compilers: syntax/4 2 Overview 1. What is a Syntax Analyzer? 2. What is a Grammar? 3. Parse Trees 4. Types of CFG Parsing 5. Syntax Analysis Sets

3 241-437 Compilers: syntax/4 3 In this lecture Source Program Target Lang. Prog. Semantic Analyzer Syntax Analyzer Lexical Analyzer Front End Code Optimizer Target Code Generator Back End Int. Code Generator Intermediate Code

4 241-437 Compilers: syntax/4 4 1. What is a Syntax Analyzer? Lexical Analyzer if (a == 0) a = b; if(a==0)a=b; Syntax Analyzer builds a parse tree IF EQASSIGN a0ab

5 241-437 Compilers: syntax/4 5 Syntax Analyses that we do IgaveJim cardthe pronounverb proper noun noun phrase articlenoun - Identify the function of each word - Recognize if a sentence is grammatically correct sentence (subject) (action)(object) verb phrase (indirect object) grammar types / categories

6 241-437 Compilers: syntax/4 6 Languages We use a natural language to communicate – –its grammar rules are very complex – –the rules don’t cover important things We use a formal language to define a programming language – –its grammar rules are fairly simple – –the rules cover almost everything

7 241-437 Compilers: syntax/4 7 2. What is a Grammar? A grammar is a notation for defining a language, and is made from 4 parts: – –the terminal symbols – –the syntactic categories (nonterminal symbols) e.g. statement, expression, noun, verb – –the grammar rules (productions) e,g, A => B1 B2... Bn – –the starting nonterminal the top-most syntactic category for this grammar continued

8 241-437 Compilers: syntax/4 8 We define a grammar G as a 4-tuple: G = (T, N, P, S) – –T = terminal symbols – –N = nonterminal symbols – –P = productions/rules – –S = starting nonterminal

9 241-437 Compilers: syntax/4 9 2.1. Example 1 Consider the grammar: T = {0, 1} N = {S, R} P = {S => 0 S => 0 R R => 1 S } S is the starting nonterminal the right hand sides of productions usually use a mix of terminals and nonterminals

10 241-437 Compilers: syntax/4 10 Is “01010” in the language? Start with a S rule: – –RuleString Generated --S S => 0 R0 R R => 1 S0 1 S S => 0 R0 1 0 R R => 1 S0 1 0 1 S S => 00 1 0 1 0 No more rules can be applied since there are no more nonterminals left in the string. Yes, it is in the language.

11 241-437 Compilers: syntax/4 11 Example 2 Consider the grammar: T = {a, b, c, d, z} N = {S, R, U, V} P = {S => R U z | z R => a | b R U => d V U | c V => b | c } S is the starting nonterminal

12 241-437 Compilers: syntax/4 12 The notation: X => Y | Z is shorthand for the two rules: X => Y X => Z Read ‘|’ as ‘or’.

13 241-437 Compilers: syntax/4 13 Is “adbdbcz” in the language? RuleString Generated --S S => R U zR U z R => aa U z U => d V Ua d V U z V => ba d b U z U => d V Ua d b d V U z V => ba d b d b U z U => ca d b d b c z Yes! This grammar has choices about how to rewrite the string.

14 241-437 Compilers: syntax/4 14 Example 3: Sums The grammar: T = {+, -, 0, 1, 2, 3,..., 9} N = {L, D} P = {L => L + D | L – D | D D => 0 | 1 | 2 |... | 9 } L is the starting nonterminal e.g. 5 + 6 - 2

15 241-437 Compilers: syntax/4 15 Example 4: Brackets The grammar: T = { '(', ')' } N = {L} P = {L => '(' L ')' L L => ε } L is the starting nonterminal ε means 'nothing'

16 241-437 Compilers: syntax/4 16 2.2. Derivations A sequence of the form: w 0  w 1  …  w n is a derivation of w n from w 0 (or w 0  * w n ) Example: Lrule L => ( L ) L    ( L ) Lrule L =>     ( ) Lrule L =>     ( ) L  * ( ) This means that the sentence ( ) is a derivation of L

17 241-437 Compilers: syntax/4 17 L rule L => ( L ) L   ( L ) L rule L => ( L ) L   ( L ) ( L ) L rule L =>   ( L ) ( L ) rule L => ( L ) L   (( L ) L ) ( L ) rule L =>    (( ) L ) ( L ) rule L =>   ( ( ) L ) ( ) rule L =>   ( ( ) ) ( ) so L  * (( )) ( )

18 241-437 Compilers: syntax/4 18 2.3. Kinds of Grammars There are 4 main kinds of grammar, of increasing expressive power: – –regular (type 3) grammars – –context-free (type 2) grammars – –context-sensitive (type 1) grammars – –unrestricted (type 0) grammars They vary in the kinds of productions they allow.

19 241-437 Compilers: syntax/4 19 Regular Grammars Every production is of the form: A => a | a B |  – –A, B are nonterminals, a is a terminal These are sometimes called right linear rules because if a nonterminal appears in the rule body, then it must appear last. Regular grammars are equivalent to REs. S => wT T => xT T => a

20 241-437 Compilers: syntax/4 20 Example Integer => + UInt | - UInt | 0 Digits | 1 Digits |... | 9 Digits UInt => 0 Digits | 1 Digits |... | 9 Digits Digits => 0 Digits | 1 Digits |... | 9 Digits | 

21 241-437 Compilers: syntax/4 21 Context-Free Grammars (CFGs) Every production is of the form: A =>  – –A is a nonterminal,  can be any number of nonterminals or terminals The Syntax Analyzer uses CFGs. A => a A => aBcd B => ae

22 241-437 Compilers: syntax/4 22 2.4. REs for Syntax Analysis? Why not use REs to describe the syntax of a programming language? – –they don’t have enough power Examples: – –nested blocks, if statements, balanced braces We need the ability to 'count', which can be implemented with CFGs but not REs.

23 241-437 Compilers: syntax/4 23 3. Parse Trees A parse tree is a graphical way of showing how productions are used to generate a string. The syntax analyzer creates a parse tree to store information about the program being compiled.

24 241-437 Compilers: syntax/4 24 Example The grammar: T = { a, b } N = { S } P = { S => S S | a S b | a b | b a } S is the starting nonterminal

25 241-437 Compilers: syntax/4 25 Parse Tree for “aabbba” The root of the tree is the start symbol S: S Expand using S => S S S S S Expand using S => a S b continued expand the symbol in the circle

26 241-437 Compilers: syntax/4 26 S S S S a b Expand using S => a b S S S S a b ab Expand using S => b a continued

27 241-437 Compilers: syntax/4 27 S S S a b ab S ba Stop when there are no more nonterminals in leaf positions. Read off the string by reading the leaves left to right.

28 241-437 Compilers: syntax/4 28 3.1. Ambiguity Two (or more) parse trees for the same string E => E + E E => E – E E => 0 | … | 9 E E + E E - E E + E E - E E 23 4 2 34 2 – 3 + 4 or

29 241-437 Compilers: syntax/4 29 The two derivations: E   E + E E   E – E   E – E + E   2 – E   2 – E + E   2 – E + E   2 – 3 + E   2 – 3 + E   2 – 3 + 4   2 – 3 + 4

30 241-437 Compilers: syntax/4 30 Fixing Ambiguity An ambiguous grammar can sometimes be made unambiguous: E => E + T | E – T | T T => 0 | … | 9 We'll look at some techniques in chapter 5.

31 241-437 Compilers: syntax/4 31 4. Types of CFG Parsing Top-down (chapter 5) – –recursive descent (predictive) parsing – –LL methods Bottom-up (chapter 6) – –operator precedence parsing – –LR methods – –SLR, canonical LR, LALR

32 241-437 Compilers: syntax/4 32 4.1. A Statement Block Grammar The grammar: T = {begin, end, simplestmt, ;} N = {B, SS, S} P = {B => begin SS end SS => S ; SS | ε S => simplestmt | begin SS end } B is the starting nonterminal

33 241-437 Compilers: syntax/4 33 Parse Tree begin simplestmt ; simplestmt ; end S S SS  B B => begin SS end SS => S ; SS SS =>  S => simplestmt S => begin SS end begin simplestmt ; simplestmt ; end

34 241-437 Compilers: syntax/4 34 4.2. Top Down (LL) Parsing begin simplestmt ; simplestmt ; end SS B B => begin SS end SS => S ; SS SS =>  S => simplestmt S => begin SS end continued

35 241-437 Compilers: syntax/4 35 begin simplestmt ; simplestmt ; end S SS B B => begin SS end SS => S ; SS SS =>  S => simplestmt S => begin SS end continued

36 241-437 Compilers: syntax/4 36 begin simplestmt ; simplestmt ; end S SS B B => begin SS end SS => S ; SS SS =>  S => simplestmt S => begin SS end continued

37 241-437 Compilers: syntax/4 37 begin simplestmt ; simplestmt ; end S S SS B B => begin SS end SS => S ; SS SS =>  S => simplestmt S => begin SS end continued

38 241-437 Compilers: syntax/4 38 begin simplestmt ; simplestmt ; end S S SS B B => begin SS end SS => S ; SS SS =>  S => simplestmt S => begin SS end continued

39 241-437 Compilers: syntax/4 39 begin simplestmt ; simplestmt ; end S S SS B 1 2 3 4 5  6 B => begin SS end SS => S ; SS SS =>  S => simplestmt S => begin SS end

40 241-437 Compilers: syntax/4 40 4.3. Bottomup (LR) Parsing begin simplestmt ; simplestmt ; end S B => begin SS end SS => S ; SS SS =>  S => simplestmt S => begin SS end continued

41 241-437 Compilers: syntax/4 41 begin simplestmt ; simplestmt ; end S S B => begin SS end SS => S ; SS SS =>  S => simplestmt S => begin SS end continued

42 241-437 Compilers: syntax/4 42 begin simplestmt ; simplestmt ; end S S SS  B => begin SS end SS => S ; SS SS =>  S => simplestmt S => begin SS end continued

43 241-437 Compilers: syntax/4 43 begin simplestmt ; simplestmt ; end S S SS  B => begin SS end SS => S ; SS SS =>  S => simplestmt S => begin SS end continued

44 241-437 Compilers: syntax/4 44 begin simplestmt ; simplestmt ; end S S SS  B => begin SS end SS => S ; SS SS =>  S => simplestmt S => begin SS end continued

45 241-437 Compilers: syntax/4 45 begin simplestmt ; simplestmt ; end S S SS B 6 5 1 4 2  3 B => begin SS end SS => S ; SS SS =>  S => simplestmt S => begin SS end

46 241-437 Compilers: syntax/4 46 5. Syntax Analysis Sets Syntax analyzers for top-down (LL) and bottom-up (LR) parsing utilize two types of sets: – –FIRST sets – –FOLLOW sets These sets are generated from the programming language CFG.

47 241-437 Compilers: syntax/4 47 5.1. The FIRST Sets FIRST( ) = set of all terminals that start productions for that non-terminal Example: S => ping S => begin S end FIRST(S) = { ping, begin }

48 241-437 Compilers: syntax/4 48 More Mathematically A is a non-terminal. FIRST(A) = – –{ c | A =>* c , c is a terminal }  {  } if A =>*   is the rest of the terminals and nonterminals after 'c'

49 241-437 Compilers: syntax/4 49 Building FIRST Sets For each non-terminal A, FIRST(A) = FIRST_SEQ(  )  FIRST_SEQ(  ) ... for all productions A => , A => ,... – – ,  are the bodies of the productions

50 241-437 Compilers: syntax/4 50 FIRST_SEQ() FIRST_SEQ(  ) = {  } FIRST_SEQ(c  ) = { c }, if c is a terminal FIRST_SEQ(A  ) = FIRST(A), if   FIRST(A) = (FIRST(A) – {  })  FIRST_SEQ(  ), if   FIRST(A) – –  is a sequence of terminals and non-terminals, and possibly empty

51 241-437 Compilers: syntax/4 51 FIRST() Example 1 S => a S e S => B B => b B e B => C C => c C e C => d FIRST(C) = {c,d} FIRST(C) = {c,d} FIRST(B) = FIRST(B) = FIRST(S) = FIRST(S) = Start with FIRST(C) since its rules only start with terminals continued

52 241-437 Compilers: syntax/4 52 FIRST(C) = {c,d} FIRST(B) = {b,c,d} FIRST(S) = do FIRST(B) now that we know FIRST(C) S => a S e S => a S e S => B S => B B => b B e B => b B e B => C B => C C => c C e C => c C e C => d C => d continued

53 241-437 Compilers: syntax/4 53 FIRST(C) = {c,d} FIRST(B) = {b,c,d} FIRST(S) = {a,b,c,d} S => a S e S => a S e S => B S => B B => b B e B => b B e B => C B => C C => c C e C => c C e C => d C => d do FIRST(S) now that we know FIRST(B)

54 241-437 Compilers: syntax/4 54 FIRST() Example 2 P => i | c | n T S Q => P | a S | b S c S T R => b |  S => c | R n |  T => R S q FIRST(P) = {i,c,n} FIRST(P) = {i,c,n} FIRST(Q) = FIRST(Q) = FIRST(R) = {b,  } FIRST(R) = {b,  } FIRST(S) = FIRST(S) = FIRST(T) = FIRST(T) = continued Start with P and R since their rules only start with terminals or 

55 241-437 Compilers: syntax/4 55 FIRST(P) = {i,c,n} FIRST(Q) = {i,c,n,a,b} FIRST(R) = {b,  } FIRST(S) = FIRST(T) = P => i | c | n T S P => i | c | n T S Q => P | a S | b S c S T Q => P | a S | b S c S T R => b |  R => b |  S => c | R n |  S => c | R n |  T => R S q T => R S q continued do FIRST(Q) now that we know FIRST(P)

56 241-437 Compilers: syntax/4 56 FIRST(P) = {i,c,n} FIRST(Q) = {i,c,n,a,b} FIRST(R) = {b,  } FIRST(S) = {c,b,n,  } FIRST(T) = do FIRST(S) now that we know FIRST(R) Note: S  R n  n because R  *  P => i | c | n T S P => i | c | n T S Q => P | a S | b S c S T Q => P | a S | b S c S T R => b |  R => b |  S => c | R n |  S => c | R n |  T => R S q T => R S q continued

57 241-437 Compilers: syntax/4 57 FIRST(P) = {i,c,n} FIRST(Q) = {i,c,n,a,b} FIRST(R) = {b,  } FIRST(S) = {c,b,n,  } FIRST(T) = {b,c,n,q} do FIRST(T) now that we know FIRST(R) and FIRST(S) Note: T  R S q  S q  q because both R and S  *  P => i | c | n T S P => i | c | n T S Q => P | a S | b S c S T Q => P | a S | b S c S T R => b |  R => b |  S => c | R n |  S => c | R n |  T => R S q T => R S q

58 241-437 Compilers: syntax/4 58 FIRST() Example 3 S => a S e | S T S T => R S e | Q R => r S r |  Q => S T |  FIRST(S) = {a} FIRST(S) = {a} FIRST(T) = {r, a,  } FIRST(T) = {r, a,  } FIRST(R) = {r,  } FIRST(R) = {r,  } FIRST(Q) = {a,  } FIRST(Q) = {a,  } Order 1) R, S 2) Q 3) T

59 241-437 Compilers: syntax/4 59 5.2. The FOLLOW Sets FOLLOW( ) = – –set of all the terminals that follow in productions – –the set includes $ if nothing follows

60 241-437 Compilers: syntax/4 60 Example: S => bing A bong | ping A pong | zing A A => ha FOLLOW(A) = { bong, pong, $ }

61 241-437 Compilers: syntax/4 61 More Mathematically A is a non-terminal. FOLLOW(A) = { c in terminals | S => +... A c... }  { $ } if S => +...   is a sequence of terminals and non-terminals => + is any number of => expansions

62 241-437 Compilers: syntax/4 62 Building FOLLOW() Sets To make the FOLLOW(A) set, apply rules 1-4: 1. for all productions (B =>... A  ) add FIRST_SEQ(  )-{  } 2. for all (B =>... A  ) and   FIRST_SEQ(  ) add FOLLOW(B) 3. for all (B =>... A) add FOLLOW(B) 4. if A is the start symbol then add { $ }  is a sequence of termminals and non-terminals

63 241-437 Compilers: syntax/4 63 What is in FOLLOW(A) for the productions: B => A C C => s FOLLOW(A) gets FIRST_SEQ(C) == FIRST(C) == { s } – –uses rule 1 continued Small Examples

64 241-437 Compilers: syntax/4 64 What is in FOLLOW(A) for the productions: C => B r B => t A FOLLOW(A) gets FOLLOW(B) == { r } – –uses rule 3

65 241-437 Compilers: syntax/4 65 FOLLOW() Example 1 S => a S e | B B => b B C f | C C => c C g | d |  FIRST(C) = {c,d,  } FIRST(B) = {b,c,d,  } FIRST(S) = {a,b,c,d,  } FOLLOW(C) = FOLLOW(C) = FOLLOW(B) = FOLLOW(B) = FOLLOW(S) = {$, e} FOLLOW(S) = {$, e} S is the start symbol continued

66 241-437 Compilers: syntax/4 66 S => a S e | B B => b B C f | C C => c C g | d |  FIRST(C) = {c,d,  } FIRST(B) = {b,c,d,  } FIRST(S) = {a,b,c,d,  } FOLLOW(C) = {f,g}  follow(B) FOLLOW(C) = {f,g}  follow(B) FOLLOW(B) = FIRST_SEQ(C f) -{  }  FOLLOW(S) = {c, d, f, $, e} FOLLOW(B) = FIRST_SEQ(C f) -{  }  FOLLOW(S) = {c, d, f, $, e} FOLLOW(S) = {$,e} FOLLOW(S) = {$,e} continued

67 241-437 Compilers: syntax/4 67 S => a S e | B B => b B C f | C C => c C g | d |  FIRST(C) = {c,d,  } FIRST(B) = {b,c,d,  } FIRST(S) = {a,b,c,d,  } FOLLOW(C) = {f,g,c,d,$,e} FOLLOW(C) = {f,g,c,d,$,e} FOLLOW(B) = {c, d, f, $, e} FOLLOW(B) = {c, d, f, $, e} FOLLOW(S) = {$,e} FOLLOW(S) = {$,e}

68 241-437 Compilers: syntax/4 68 FOLLOW() Example 2 S => ( A ) |  A => T E E => & T E |  T => ( A ) | a | b | c FIRST(T) = {(,a,b,c} FIRST(E) = {&,  } FIRST(A) = {(,a,b,c} FIRST(S) = {(,  } FOLLOW(S) = {$} FOLLOW(S) = {$} FOLLOW(A) = {)} FOLLOW(A) = {)} FOLLOW(E) = FOLLOW(E) = FOLLOW(T) = FOLLOW(T) = continued

69 241-437 Compilers: syntax/4 69 S => ( A ) |  A => T E E => & T E |  T => ( A ) | a | b | c FIRST(T) = {(,a,b,c} FIRST(E) = {&,  } FIRST(A) = {(,a,b,c} FIRST(S) = {(,  } FOLLOW(S) = { $ } FOLLOW(S) = { $ } FOLLOW(A) = { ) } FOLLOW(A) = { ) } FOLLOW(E) = FOLLOW(E) = FOLLOW(A)  FOLLOW(E) = { ) } FOLLOW(A)  FOLLOW(E) = { ) } FOLLOW(T) = FOLLOW(T) = (FIRST_SEQ(E) – {  })  FOLLOW(A)  FOLLOW(E) = {&, )} (FIRST_SEQ(E) – {  })  FOLLOW(A)  FOLLOW(E) = {&, )}

70 241-437 Compilers: syntax/4 70 FOLLOW() Example 3 S => T E1 E1 => + T E1 |  T => F T1 T1 => * F T1 |  F => ( S ) | id FIRST(F) = FIRST(T) = FIRST(S) = {(,id} FIRST(T1) = {*,  } FIRST(E1) = {+,  } FOLLOW(S) = {$,)} FOLLOW(S) = {$,)} FOLLOW(E1) = FOLLOW(E1) = FOLLOW(T) = FOLLOW(T) = FOLLOW(T1) = FOLLOW(T1) = FOLLOW(F) = FOLLOW(F) = continued

71 241-437 Compilers: syntax/4 71 S => T E1 E1 => + T E1 |  T => F T1 T1 => * F T1 |  F => ( S ) | id FIRST(F) = FIRST(T) = FIRST(S) = {(,id} FIRST(T1) = {*,  } FIRST(E1) = {+,  } FOLLOW(S) = {$,)} FOLLOW(S) = {$,)} FOLLOW(E1) = FOLLOW(S)  Follow(E1) = {$,)} FOLLOW(E1) = FOLLOW(S)  Follow(E1) = {$,)} FOLLOW(T) = FIRST(E1)  FOLLOW(S)  FOLLOW(E1) = {+,$,)} FOLLOW(T) = FIRST(E1)  FOLLOW(S)  FOLLOW(E1) = {+,$,)} FOLLOW(T1) = FOLLOW(T) = {+,$,)} FOLLOW(T1) = FOLLOW(T) = {+,$,)} FOLLOW(F) = FIRST(T1)  FOLLOW(T)  FOLLOW(T1) = {*,+,$,)} FOLLOW(F) = FIRST(T1)  FOLLOW(T)  FOLLOW(T1) = {*,+,$,)}

72 241-437 Compilers: syntax/4 72 FOLLOW() Example 4 S => A B C | A D A => a | a A B => b | c |  C => D a C D => b b | c c FIRST(D) = FIRST(C) = {b,c} FIRST(B) = {b,c  FIRST(A) = FIRST(S) = {a} FOLLOW(S) = {$} FOLLOW(S) = {$} FOLLOW(D) = {a,$} FOLLOW(D) = {a,$} FOLLOW(A) = FOLLOW(A) = FOLLOW(B) = FOLLOW(B) = FOLLOW(C) = FOLLOW(C) = continued

73 241-437 Compilers: syntax/4 73 S => A B C | A D A => a | a A B => b | c |  C => D a C D => b b | c c FIRST(D) = FIRST(C) = {b,c} FIRST(B) = {b,c  FIRST(A) = FIRST(S) = {a} FOLLOW(S) = {$} FOLLOW(S) = {$} FOLLOW(D) = {a,$} FOLLOW(D) = {a,$} FOLLOW(A) = {b,c} FOLLOW(A) = {b,c} FOLLOW(B) = {b,c} FOLLOW(B) = {b,c} FOLLOW(C) = {$} FOLLOW(C) = {$}


Download ppt "241-437 Compilers: syntax/4 1 Compiler Structures Objective – –describe general syntax analysis, grammars, parse trees, FIRST and FOLLOW sets 241-437,"

Similar presentations


Ads by Google