Download presentation
Presentation is loading. Please wait.
1
3. Formal Grammars and and Top-down Parsing Chih-Hung Wang
Compilers 3. Formal Grammars and and Top-down Parsing Chih-Hung Wang References 1. C. N. Fischer and R. J. LeBlanc. Crafting a Compiler with C. Pearson Education Inc., 2009. 2. D. Grune, H. Bal, C. Jacobs, and K. Langendoen. Modern Compiler Design. John Wiley & Sons, 2000. 3. Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, (2nd Ed. 2006)
2
Introduction Context-free Grammar
The syntax of programming language constructs can be described by context-free grammar Important aspects A grammar serves to impose a structure on the linear sequence of tokens which is the program. Using techniques from the field of formal languages, a grammar can be employed to construct a parser for it automatically. Grammars aid programmers to write syntactically correct programs and provide answer to detailed questions about the syntax.
3
The role of the parser
4
Context-Free Grammars
A context-free grammar(CFG) is a compact, finite representation of a language, defined by the following four components: A finite terminal vocabulary Vt A finite nonterminal vocabulary Vn A start symbol S Vn A finite set of productions P AX1…Xm, where A Vn, Xi Vn Vt 1i m Note that A is a valid production
5
A Simple Expression Grammar
EPrefix(E) EV Tail PrefixF Prefix Tail+E Tail
6
Leftmost Derivations lm f(E) lm f(V Tail) lm f(V+E)
A sentential form produced via a leftmost derivation is called a left sentential form. The production sequence discovered by a large class of parsers (the top-down parsers) is a leftmost derivation. Hence, these parsers are said to produce a leftmost parse. Example: f(V+V) E lm Prefix(E) lm f(E) lm f(V Tail) lm f(V+E) lm f(V+V Tail) lm f(V+V)
7
Rightmost Derivations
As a bottom-up parser discovers the productions that derive a given token sequence, it traces a rightmost derivation, but the productions are applied in reverse order. Called rightmost or canonical parse Example: f(V+V) E rm Prefix(E) rm Prefix(V Tail) rm Prefix(V+E) rm Prefix(V+V Tail) rm Prefix(V+V) rm f(V+V)
8
Parse Tree It is rooted by the start symbol S
Each node is either a grammar symbol or
9
Properties of CFGs The grammar may include useless symbols
The grammar may allow multiple, distinct derivations (parse trees) for some input string. The grammar may include strings that do not belong in the language, or the grammar may exclude strings that are in the language.
10
Ambiguity (1) Some grammars allow a derived string to have two or more different parse trees (and thus a nonunique structure). Example: 1. <expression> →<expression> – <expression> | ID This grammar allows two different parse tree for ID - ID - ID.
11
Ambiguity (2) Two parse trees for ID - ID - ID
12
Parsers and Recognizers
Two approaches A parser is considered top-down if it generates a parse tree by starting at the root of the tree, expanding the tree by applying productions in a depth-first manner. The bottom-up parsers generate a parse tree by starting the tree’s leaves and working toward its root.
13
Two approaches of Parser
Deterministic left-to-right top-down LL method Deterministic left-to-right bottom-up LR method Left-to-right The sequence of tokens is processed from left to right Deterministic No searching is involved: each token brings the parser one step closer to the goal of constructing the syntax tree
14
Parsers (Top-down)
15
Parsers (bottom-Up)
16
Pre-order and post-order (1)
The top-down method constructs the syntax tree in pre- order The bottom-up method constructs the syntax tree in post- order
17
Pre-order and post-order (2)
18
Principles of top-down parsing
The main task of a top-down parser is to choose the correct alternatives for known non-terminals
19
Principles of bottom-up parsing
The main task of a bottom-up parser is to repeatedly find the first node all of whose children have already been constructed.
20
Creating a top-down parser manually
Recursive descent parsing Simplest way but has its limitations
21
Recursive descent parsing program (1)
22
Recursive descent parsing program (2)
23
Drawbacks Three drawbacks
There is still some searching through the alternatives The method often fails to produce a correct parser Error handling leaves much to be desired
24
Second problems (1) Example 1 Index_element will never be tried
IDENTIFIER ‘[‘
25
Second problems (2) Example 2 The recognizer will not recognize ab
26
Second problems (3) Example 3
Recursive descent parsers cannot handle left-recursive grammars
27
Creating a top-down parser automatically
The principles of constructing a top-down parser automatically derive from those of writing one by hand, by applying precomputation. Grammars which allow the construction of a top-down parser to be performed are called LL(1) grammars.
28
LL(1) parsing FIRST set The sets of first tokens produced by all alternatives in the grammar. We have to precompute the FIRST sets of all non-terminals The first sets of the terminals are obvious. Finding FIRST() is trivial when starts with a terminal. FIRST(N) is the union of the FIRST sets of its alternatives. First()={a Σ| * a}
29
Predictive recursive descent parser
The FIRST sets can be used in the construction of a predictive parser because it predicts the presence of a given alternative without trying to find out if it is there.
30
Closure algorithm for computing the FIRST set (1)
Data definitions
31
Closure algorithm for computing the FIRST set (2)
Initializations
32
Closure algorithm for computing the FIRST set (3)
Inference rules
33
FIRST sets example(1) Grammar
34
FIRST sets example(2) The initial FIRST sets
35
FIRST sets example(3) The final FIRST sets
36
Another Example of First Set
EPrefix(E) EV Tail PrefixF Prefix Tail+E Tail
37
Another Example of First Set (II)
S aSe S B B bBe B C C cCe C d
38
Another Example of First Set (III)
S ABc A a A B b B
39
Algorithms of Computing First(α)
First()={aVt| * a}{if * then {} else } Page 104 in the textbook
40
The predictive parser (1)
41
The predictive parser (2)
42
Practice Find the FIRST sets of all alternative of the following grammar. E -> TE’ E’->+TE’| T->FT’ T’->*FT’| F->(E)|id
43
Nullable alternatives
A complication arises with the case label for the empty alternative (ex. rest_expression). Since it does not itself start with any token, how can we decide whether it is the correct alternative?
44
FOLLOW sets Follow sets
Determining the set of tokens that can immediately follow a given non-terminal N. LL(1) parser ‘LL’ because the parser works from Left to right identifying the nodes in what is called Leftmost derivation order. ‘(1)’ because all choices are based on a one token look-ahead. Follow(A)={b Σ |S+ Ab β}
45
Closure algorithm for computing the FOLLOW sets
46
The first and follow sets
47
Recall the predictive parser
rest_expression ‘+’ expression | FIRST(rest_expr) = {‘+’, } void rest_expression(void) { switch (Token.class) { case '+': token('+'); expression(); break; case EOF: case ')': break; default: error(); } FOLLOW(rest_expr) = {EOF, ‘)’}
48
Another Example of Follow Set
Follow(A)={aVt|S* Aa }{if S + A then {} else } S ABc A a A B b B
49
Another Example of Follow Set (II)
S aSe S B B bBe B C C cCe C d
50
Another Example of Follow Set (III)
S ABc A a A B b B
51
Algorithm of Follow(A)
52
LL(1) conflicts Example The codes
53
LL(1) conflicts FIRST/FIRST conflict term IDENTIFIER
| IDENTIFIER ‘[‘ expression ‘]’ | ‘(’ expression ‘)’
54
LL(1) conflicts FIRST/FOLLOW conflict FIRST set FOLLOW set
S A ‘a’ ‘b’ { ‘a’ } {} A ‘a’ | {‘a’, } {‘a’}
55
LL(1) conflicts left recursion Look-ahead token LL(1) grammar
expression expression ‘-’ term | term Look-ahead token LL(1) method predicts the alternative Ak for a non-terminal N FIRST(Ak) (if is nullable then FOLLOW(N)) LL(1) grammar No FIRST/FIRST conflicts No FIRST/FOLLOW conflicts No multiple nullable alternatives No non-terminal can have more than one nullable alternative.
56
Solve the LL(1) conflicts
Two options Use a stronger parser Make the grammar LL(1)
57
Making a grammar LL(1) manual labour three rewrite methods
rewrite grammar adjust semantic actions three rewrite methods left factoring substitution left-recursion removal
58
‘[’ FOLLOW(after_identifier)
Left-factoring term IDENTIFIER | IDENTIFIER ‘[‘ expression ‘]’ factor out common prefix term IDENTIFIER after_identifier after_identifier | ‘[‘ expression ‘]’ ‘[’ FOLLOW(after_identifier)
59
Substitution replace non-terminal by its alternative
A a | B c | S p A q replace non-terminal by its alternative S p a q | p B c q | p q Example S A ‘a’ ‘b’ A ‘a’ | S ‘a’ ‘a’ ‘b’ | ‘a’ ‘b’
60
Left-recursion removal
Three types of left-recursion Direct left-recursion N N|… Indirect left-recursion Chain structure N A … A B … … Z N … Hidden left-recursion N N|… ( can produce )
61
Left-recursion removal
... N N | replace by N M M M | example expression expression ‘-’ term | term N expression term expression_tail_option expression_tail_option ‘-’ term expression_tail_option |
62
Practice make the following grammar LL(1)
expression expression ‘+’ term | expression ‘-’ term | term term term ‘*’ factor | term ‘/’ factor | factor factor ‘(‘ expression ‘)’ | func-call | identifier | constant func-call identifier ‘(‘ expr-list? ‘)’ expr-list expression (‘,’ expression)*
63
Answers substitution left factoring left recursion removal
F ‘(‘ E ‘)’ | ID ‘(‘ expr-list? ‘)’ | ID | constant left factoring E E ( ‘+’ | ‘-’ ) T | T T T ( ‘*’ | ‘/’ ) F | F F ‘(‘ E ‘)’ | ID ( ‘(‘ expr-list? ‘)’ )? | constant left recursion removal E T (( ‘+’ | ‘-’ ) T )* T F (( ‘*’ | ‘/’ ) F )*
64
Undoing the semantic effects of grammar transformations
While it is often possible to transform our grammar into a new grammar that is acceptable by a parser generator and that generates the same language, the new grammar usually assigns a different structure to strings in the language than our original grammar did Fortunately, in many cases we are not really interested in the structure but rather in the semantics implied by it.
65
Semantics Non-left-recursive equivalent
66
Automatic conflict resolution (1)
There are two ways in which LL parsers can be strengthened By increasing the look-ahead Distinguishing alternatives not by their first token but by their first two tokens is called LL(2). Disadvantages: the parser code can get much bigger. By allowing dynamic conflict resolvers When the conflict arises during parsing, some of conditions are evaluated to solve it. The parser generator LLgen requires a conflict resolver to be placed on the first of two conflicting alternatives.
67
Automatic conflict resolution (2)
If-else statement in C else_tail_option: both FIRST set and FOLLOW set contain the token ‘else’ Conflict resolver
68
The LL(1) push-down automation
Transition table for an LL(1) parser
69
Push-down automation (PDA)
Type of moves Prediction move Top of the prediction stack is a non-terminal N. N is removed from the stack Look up the prediction table Push the alternative of N into the prediction stack Match move Top of the prediction stack is a terminal Termination Parsing terminates when the prediction stack is exhausted.
70
Prediction move in an LL(1) PDA
71
Match move in an LL(1) PDA
72
Predictive parsing with an LL(1) PDA
73
PDA example (1) input aap + ( noot + mies ) EOF prediction stack
state (top of stack) look-ahead token IDENT + ( ) EOF input expression EOF expression term rest-expr term ( expression ) rest-expr + expression
74
replace non-terminal by transition entry
PDA example (2) input prediction stack aap + ( noot + mies ) EOF input replace non-terminal by transition entry state (top of stack) look-ahead token IDENT + ( ) EOF input expression EOF expression term rest-expr term ( expression ) rest-expr + expression
75
PDA example (3) expression EOF aap + ( noot + mies ) EOF
prediction stack aap + ( noot + mies ) EOF input state (top of stack) look-ahead token IDENT + ( ) EOF input expression EOF expression term rest-expr term ( expression ) rest-expr + expression
76
replace non-terminal by transition entry
PDA example (4) expression EOF prediction stack aap + ( noot + mies ) EOF input replace non-terminal by transition entry state (top of stack) look-ahead token IDENT + ( ) EOF input expression EOF expression term rest-expr term ( expression ) rest-expr + expression
77
PDA example (5) term rest-expr EOF aap + ( noot + mies ) EOF
prediction stack aap + ( noot + mies ) EOF input state (top of stack) look-ahead token IDENT + ( ) EOF input expression EOF expression term rest-expr term ( expression ) rest-expr + expression
78
replace non-terminal by transition entry
PDA example (6) prediction stack term rest-expr EOF aap + ( noot + mies ) EOF input replace non-terminal by transition entry state (top of stack) look-ahead token IDENT + ( ) EOF input expression EOF expression term rest-expr term ( expression ) rest-expr + expression
79
PDA example (7) Please continue!! Example of parsing (i+i)+i
80
Example in Textbook: Micro (1)
81
Example in Textbook: Micro (2)
The First set
82
Example in Textbook: Micro (3)
The Follow set
83
Example in Textbook: Micro (4)
Calculation of Predict sets for Micro
84
Example in Textbook: Micro (5)
LL(1) Table
85
Obtaining LL(1) Grammars
Most LL(1) prediction conflicts can be grouped into two categories: common prefix and left recursion
86
Common Prefixes Factoring method
87
Algorithm of Factoring
88
Left Recursion
89
Algorithm of Eliminating Left Recursion
90
LLgen LLgen is part of the Amsterdam Compiler Kit
takes LL(1) grammar + semantic actions in C and generates a recursive descent parser The non-terminals in the grammar can have parameters, and rules can have local variables, both again expressed in C. LLgen features: repetition operators advanced error handling parameter passing control over semantic actions dynamic conflict resolvers
91
LLgen add semantic actions attach parameters to grammar rules
start from LR(1) grammar make grammar LL(1) use repetition operators %token DIGIT; main : [line]+ ; line : expr '\n' expr : term [ '+' term ]* term : factor [ '*' factor ]* factor : '(' expr ')‘ | DIGIT add semantic actions attach parameters to grammar rules insert C-code between the symbols LLgen
92
Minimal non-left-recursive grammar for expressions
93
LLgen code for a parser Grammar Semantics
94
LLgen code for a parser The code from previous page resides in a file called parser.g. LLgen converts the file to one called parser.c, which contains a recursive descent parser.
95
LLgen interface to lexical analyzer
96
LLgen interface to back-end
LLgen handles syntax errors by inserting missing tokens and deleting unexpected tokens LLmessage() is invoked to notify the lexical analyzer
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.