Download presentation
Presentation is loading. Please wait.
Published byAdrian Burke Modified over 8 years ago
1
Compiler Design BMZ 1 Chapter4: Syntax Analysis
2
Compiler Design BMZ 2 Syntax Analysis Source Program Target Program Semantic Analyser Intermediate Code Generator Code Optimiser Code Generator Syntax Analyser Lexical Analyser Symbol Table Manager Error Handler
3
Compiler Design BMZ 3 Where is Syntax Analysis? Lexical Analysis or Scanner if (b == 0) a = b; if(b==0)a=b; Syntax Analysis or Parsing if === b0ab abstract syntax tree or parse tree
4
Compiler Design BMZ 4 The Role of the Parser source program Lexical Analyzer Parser symbol table token get next token syntax tree Semantic Analyzer intermediate representation p 160
5
Compiler Design BMZ 5 Declaration Section User types: As in flex, these are in a section bracketed by “%{“ and “%}” Tokens – terminal symbols of the grammar –%token terminal1 terminal2... Values for tokens assigned sequentially after all ASCII characters –or %token terminal1 val1 terminal2 val2... Tip – Use ‘-d’ option in bison to get foo.tab.h that contains the token definitions that can be included in the flex file
6
Compiler Design BMZ 6 Declaration continued Start symbol –%start non-terminal Associativity – (left, right or none) –%leftTK_PLUS –%rightTK_EXPONENT –%nonassocTK_LESSTHAN Precedence –Order of the directives specifies precedence –%prec changes the precedence of a rule
7
Compiler Design BMZ 7 Declaration continued Attribute values – information associated with all terminal/non-terminal symbols – passed from the lexer –%union { int ival; char *name; double dval; –} –Becomes YYSTYPE Symbol attributes – types of non-terminals –%type non_terminal –%type IntNumber
8
Compiler Design BMZ 8 Values Used by yyparse() Error function –yyerror(char *s); Last token value –yylval of type YYSTYPE (%union decl) Setting yylval in flex –[a-z]{yylval.ival = yytext[0] – ‘a’; return TK_NAME;} Then, yylval is available in bison –But in a strange way
9
Compiler Design BMZ 9 Rules Section Every name appearing that has not been declared is a non-terminal Productions –non-terminal : first_production | second_production |... ; – production has the form non-terminal : ; Thus you can say, foo: production1 | /* nothing*/ ; –Adding actions non-terminal : RHS {action routine} ; Action called before LHS is pushed on parse stack
10
Compiler Design BMZ 10 Attribute Values (aka $ vars) Each terminal/non-terminal has one Denoted by $n where n is its rank in the rule starting by 1 –$$ = LHS –$1 = first symbol of the RHS –$2 = second symbol, etc. –Note, semantic actions have values too!!! A: B {...} C {...} ; C’s value is denoted by $3
11
Compiler Design BMZ 11 example %union { intvalue; char*symbol; } %type exp term factor %type ident... exp : exp ‘+’ term {$$ = $1 + $3; }; /* Note, $1 and $3 are ints here */ factor : ident {$$ = lookup(symbolTable, $1); }; /* Note, $1 is a char* here */
12
Compiler Design BMZ 12 Conflict Bison reports the number of shift/reduce and reduce/reduce conflicts found Shift/reduce conflicts –Occurs when there are 2 possible parses for an input string, one parse completes a rule (reduce) and one does not (shift) –Example e:‘X’ | e ‘+’ e ;\ “X+X+X” has 2 possible parses “(X+X)+X” or “X+(X+X)”
13
Compiler Design BMZ 13 Conflict continued Reduce/reduce conflict occurs when the same token could complete 2 different rules –Example prog : proga | progb ; proga : ‘X’ ; progb : ‘X’ ; “X” can either be a proga or progb –Ambiguous grammar!!
14
Compiler Design BMZ 14 Parsing Analogy sentence subjectverbindirect objectobject Igavehimnoun phrase articlenoun bookthe “I gave him the book” Syntax analysis for natural languages Recognize whether a sentence is grammatically correct Identify the function of each word
15
Compiler Design BMZ 15 Overview Goal – determine if the input token stream satisfies the syntax of the program What do we need to do this? –An expressive way to describe the syntax –A mechanism that determines if the input token stream satisfies the syntax description For lexical analysis –Regular expressions describe tokens –Finite automata = mechanisms to generate tokens from input stream
16
Compiler Design BMZ 16 Use Regular Expressions? REs can expressively describe tokens –Easy to implement via DFAs So just use them to describe the syntax of a programming language??? –NO! – They don’t have enough power to express any non-trivial syntax –Example – Nested constructs (blocks, expressions, statements) – Detect balanced braces: {{} {} {{} { }}} { {{{{ }}}}}...
17
Compiler Design BMZ 17 Context-Free Grammars Consist of 4 components: –Terminal symbols = token or –Nonterminal symbols = syntactic variables –Start symbol S = special non-terminal –Productions of the form LHS RHS LHS = single non-terminal RHS = string of terminals and non-terminals Specify how non-terminals may be expanded Language generated by a grammar is the set of strings of terminals derived from the start symbol by repeatedly applying the productions –L(G) = language generated by grammar G S a S a S T T b T b T
18
Compiler Design BMZ 18 Context-Free Grammars continued A set of terminals: basic symbols from which sentences are formed A set of nonterminals: syntactic categories denoting sets of sentences A set of productions: rules specifying how the terminals and nonterminals can be combined to form sentences The start symbol: a distinguished nonterminal denoting the language
19
Compiler Design BMZ 19 Notational Conventions To avoid always having to state that "these are the terminals”, "these are the nonterminals”, and so on, we shall employ the following notational conventions with regard to grammars throughout the remainder of this subject Terminals: id, +, -, *, /, (, ) Nonterminals: expr, op Productions: expr expr op expr expr ( expr ) expr - expr expr id op + | - | * | / The start symbol: expr
20
Compiler Design BMZ 20 Notational Conventions continued l. These Symbols are Terminals: i) Lower-case letters early in the alphabet such as a, b, c. ii) Operator symbols such as +, -, etc. iii) Punctuation symbols such as parentheses, comma, etc. iv) The digits 0, 1,..., 9. v) Boldface strings such as id or if. 2. These Symbols are Nonterminals: i) Upper-case letters early in the alphabet such as A, B, C. ii) The letter S, when it appears, is usually the start symbol. iii) Lower-case italic names such as expr or stmt.
21
Compiler Design BMZ 21 Notational Conventions continued 3. Upper-case letters late in the alphabet, such as X, Y, Z, represent grammar symbols, that is, either nonterminals or terminals. 4. Lower-case letters late in the alphabet, chiefly u, v,..., z, represent strings of terminals. 5. Lower-case Greek letters, , , , …, represent strings of grammar symbols. Thus, a generic production could be written as A , indicating that there is a single nonterminal A on the left of the arrow (the left side of the production) and a string of grammar symbols to the right of the arrow (the right side of the production).
22
Compiler Design BMZ 22 Notational Conventions continued 6. If A 1, A 2,..., A k are all productions with A on the left (we call them A-productions), we may write A 1 | 2 | … | k. We call 1, 2,..., k the alternatives for A. 7. The left side of the first production is the start symbol. E E A E | ( E ) | - E | id A + | - | * | /
23
Compiler Design BMZ 23 Derivations * + 1. for any string 2. If and , then * * * * + A derivation step is an application of a production as a rewriting rule E - E A sequence of derivation steps E - E - ( E ) - ( id ) is called a derivation of “- ( id )” from E The symbol denotes: derives in zero or more steps The symbol denotes: derives in one or more steps E - ( id )E - ( id ) E E A E | ( E ) | - E | id A + | - | * | /
24
Compiler Design BMZ 24 example Grammar for balanced-parentheses language –S ( S ) S –S 1 non-terminal: S 2 terminals: “(”, “)” Start symbol: S 2 productions If grammar accepts a string, there is a derivation of that string using the productions –“(())” –S = (S) = ((S) S) = (( ) ) = (()) ? Why is the final S required?
25
Compiler Design BMZ 25 More on CFGs Shorthand notation – vertical bar for multiple productions S a S a | T T b T b | CFGs powerful enough to expression the syntax in most programming languages Derivation = successive application of productions starting from S Acceptance? = Determine if there is a derivation for an input token stream
26
Compiler Design BMZ 26 RE is a Subset of CFG Can inductively build a grammar for each RE S aS a R1 R2S S1 S2 R1 | R2S S1 | S2 R1*S S1 S | Where G1 = grammar for R1, with start symbol S1 G2 = grammar for R2, with start symbol S2
27
Compiler Design BMZ 27 Context-Free Languages A context-free language L(G) is the language defined by a context-free grammar G A string of terminals is in L(G) if and only if S + , is called a sentence of G If S * , where may contain non terminals, then we call a sentential form of G E - E - ( E ) - ( id ) G 1 is equivalent to G 2 if L(G 1 ) = L(G 2 )
28
Compiler Design BMZ 28 Parser A Parser Context free grammar, G Token stream, s (from lexer) Yes, if s in L(G) No, otherwise Error messages Syntax analyzers (parsers) = CFG acceptors which also output the corresponding derivation when the token stream is accepted Various kinds: LL(k), LR(k), SLR, LALR
29
Compiler Design BMZ 29 Left- & Right-most Derivations Each derivation step needs to choose a nonterminal to rewrite a production to apply A leftmost derivation always chooses the leftmost nonterminal to rewrite E lm - E lm - ( E ) lm - ( E + E ) lm - ( id + E ) lm - ( id + id ) A rightmost derivation always chooses the rightmost nonterminal to rewrite E rm - E rm - ( E ) rm - ( E + E ) rm - (E + id ) rm - ( id + id )
30
Compiler Design BMZ 30 Parse Trees * A parse tree is a graphical representation for a derivation that filters out the order of choosing nonterminals for rewriting * Many derivations may correspond to the same parse tree, but every parse tree has associated with it a unique leftmost and a unique rightmost derivation E - () + id E EE E E lm - E lm - ( E ) lm - ( E + E ) lm - ( id + E ) lm - ( id + id ) E rm - E rm - ( E ) rm - ( E + E ) rm - (E + id ) rm - ( id + id )
31
Compiler Design BMZ 31 Parse Tree vs Abstract Syntax Tree S E+S ( S )E E + S 5 1 2E ( S ) E + S E3 4 S E + S | E E number | (S) Derive: (1 + 2 + (3 + 4)) + 5 + + + + 1 2 34 5 AST discards (abstracts) unneeded information more compact format Parse tree = tree representation of the derivation Leaves of the tree are terminals Internal nodes are non-terminals No information about the order of the derivation steps
32
Compiler Design BMZ 32 Derivation Order Can choose to apply productions in any order, select non- terminal and substitute RHS of production Two standard orders: left and right-most Leftmost derivation –In the string, find the leftmost non-terminal and apply a production to it –E + S 1 + S Rightmost derivation –Same, but find rightmost non-terminal –E + S E + E + S
33
Compiler Design BMZ 33 Leftmost & Rightmost Derivation S E + S | E E number | (S) S E + S (S)+S (E+S) + S (1+S)+S (1+E+S)+S (1+2+S)+S (1+2+E)+S (1+2+(S))+S (1+2+(E+S))+S (1+2+(3+S))+S (1+2+(3+E))+S (1+2+(3+4))+S (1+2+(3+4))+E (1+2+(3+4))+5 Rightmost derive: (1 + 2 + (3 + 4)) + 5 Result: Same parse tree, same productions chosen, but in different order S E+S E+E E+5 (S)+5 (E+S)+5 (E+E+S)+5 (E+E+E)+5 (E+E+(S))+5 (E+E+(E+S))+5 (E+E+(E+E))+5 (E+E+(E+4))+5 (E+E+(3+4))+5 (E+2+(3+4))+5 (1+2+(3+4))+5 Leftmost derive: (1 + 2 + (3 + 4)) + 5
34
Compiler Design BMZ 34 Ambiguous Grammar * A grammar is ambiguous if it produces more than one parse tree for some sentence E E + E id + E id + E * E id + id * E id + id * id E E * E E + E * E id + E * E id + id * E id + id * id E +EE id *EE E *EE +EE
35
Compiler Design BMZ 35 Resolving Ambiguity * Use disambiguating rules to throw away undesirable parse trees * Rewrite grammars by incorporating disambiguating rules into grammars
36
Compiler Design BMZ 36 Example The dangling-else grammar stmt if expr then stmt | if expr then stmt else stmt | other if E 1 then if E 2 then S 1 else S 2 S elseESSifthen ifEthenSelseE S SSifthen ifEthenS
37
Compiler Design BMZ 37 Disambiguating Rules * Rule: match each else with the closest previous unmatched then * Remove undesired state transitions in the pushdown automaton
38
Compiler Design BMZ 38 Grammar Rewriting stmt m_stmt | unm_stmt m_stmt if expr then m_stmt else m_stmt | other unm_stmt if expr then stmt | if expr then m_stmt else unm_stmt
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.