3. Formal Grammars and and Top-down Parsing Chih-Hung Wang

3. Formal Grammars and and Top-down Parsing Chih-Hung Wang
Compilers 3. Formal Grammars and and Top-down Parsing Chih-Hung Wang References 1. C. N. Fischer and R. J. LeBlanc. Crafting a Compiler with C. Pearson Education Inc., 2009. 2. D. Grune, H. Bal, C. Jacobs, and K. Langendoen. Modern Compiler Design. John Wiley & Sons, 2000. 3. Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, (2nd Ed. 2006)

Introduction Context-free Grammar
The syntax of programming language constructs can be described by context-free grammar Important aspects A grammar serves to impose a structure on the linear sequence of tokens which is the program. Using techniques from the field of formal languages, a grammar can be employed to construct a parser for it automatically. Grammars aid programmers to write syntactically correct programs and provide answer to detailed questions about the syntax.

The role of the parser

Context-Free Grammars
A context-free grammar(CFG) is a compact, finite representation of a language, defined by the following four components: A finite terminal vocabulary Vt A finite nonterminal vocabulary Vn A start symbol S  Vn A finite set of productions P AX1…Xm, where A Vn, Xi  Vn  Vt 1i m Note that A   is a valid production

A Simple Expression Grammar
EPrefix(E) EV Tail PrefixF Prefix Tail+E Tail 

Leftmost Derivations  lm f(E)  lm f(V Tail)  lm f(V+E)
A sentential form produced via a leftmost derivation is called a left sentential form. The production sequence discovered by a large class of parsers (the top-down parsers) is a leftmost derivation. Hence, these parsers are said to produce a leftmost parse. Example: f(V+V) E lm Prefix(E)  lm f(E)  lm f(V Tail)  lm f(V+E)  lm f(V+V Tail)  lm f(V+V)

Rightmost Derivations
As a bottom-up parser discovers the productions that derive a given token sequence, it traces a rightmost derivation, but the productions are applied in reverse order. Called rightmost or canonical parse Example: f(V+V) E rm Prefix(E)  rm Prefix(V Tail)  rm Prefix(V+E)  rm Prefix(V+V Tail)  rm Prefix(V+V)  rm f(V+V)

Parse Tree It is rooted by the start symbol S
Each node is either a grammar symbol or 

Properties of CFGs The grammar may include useless symbols
The grammar may allow multiple, distinct derivations (parse trees) for some input string. The grammar may include strings that do not belong in the language, or the grammar may exclude strings that are in the language.

Ambiguity (1) Some grammars allow a derived string to have two or more different parse trees (and thus a nonunique structure). Example: 1. <expression> →<expression> – <expression> | ID This grammar allows two different parse tree for ID - ID - ID.

Ambiguity (2) Two parse trees for ID - ID - ID

Parsers and Recognizers
Two approaches A parser is considered top-down if it generates a parse tree by starting at the root of the tree, expanding the tree by applying productions in a depth-first manner. The bottom-up parsers generate a parse tree by starting the tree’s leaves and working toward its root.

Two approaches of Parser
Deterministic left-to-right top-down LL method Deterministic left-to-right bottom-up LR method Left-to-right The sequence of tokens is processed from left to right Deterministic No searching is involved: each token brings the parser one step closer to the goal of constructing the syntax tree

Parsers (Top-down)

Parsers (bottom-Up)

Pre-order and post-order (1)
The top-down method constructs the syntax tree in pre- order The bottom-up method constructs the syntax tree in post- order

Pre-order and post-order (2)

Principles of top-down parsing
The main task of a top-down parser is to choose the correct alternatives for known non-terminals

Principles of bottom-up parsing
The main task of a bottom-up parser is to repeatedly find the first node all of whose children have already been constructed.

Creating a top-down parser manually
Recursive descent parsing Simplest way but has its limitations

Recursive descent parsing program (1)

Recursive descent parsing program (2)

Drawbacks Three drawbacks
There is still some searching through the alternatives The method often fails to produce a correct parser Error handling leaves much to be desired

Second problems (1) Example 1 Index_element will never be tried
IDENTIFIER ‘[‘

Second problems (2) Example 2 The recognizer will not recognize ab

Second problems (3) Example 3
Recursive descent parsers cannot handle left-recursive grammars

Creating a top-down parser automatically
The principles of constructing a top-down parser automatically derive from those of writing one by hand, by applying precomputation. Grammars which allow the construction of a top-down parser to be performed are called LL(1) grammars.

LL(1) parsing FIRST set The sets of first tokens produced by all alternatives in the grammar. We have to precompute the FIRST sets of all non-terminals The first sets of the terminals are obvious. Finding FIRST() is trivial when  starts with a terminal. FIRST(N) is the union of the FIRST sets of its alternatives. First()={a Σ|  * a}

Predictive recursive descent parser
The FIRST sets can be used in the construction of a predictive parser because it predicts the presence of a given alternative without trying to find out if it is there.

Closure algorithm for computing the FIRST set (1)
Data definitions

Initializations

Inference rules

FIRST sets example(1) Grammar

FIRST sets example(2) The initial FIRST sets

FIRST sets example(3) The final FIRST sets

Another Example of First Set
EPrefix(E) EV Tail PrefixF Prefix Tail+E Tail 

Another Example of First Set (II)
S  aSe S  B B  bBe B  C C  cCe C  d

Another Example of First Set (III)
S  ABc A  a A   B  b B  

Algorithms of Computing First(α)
First()={aVt|  * a}{if  *  then {} else } Page 104 in the textbook

The predictive parser (1)

The predictive parser (2)

Practice Find the FIRST sets of all alternative of the following grammar. E -> TE’ E’->+TE’| T->FT’ T’->*FT’| F->(E)|id

Nullable alternatives
A complication arises with the case label for the empty alternative (ex. rest_expression). Since it does not itself start with any token, how can we decide whether it is the correct alternative?

FOLLOW sets Follow sets
Determining the set of tokens that can immediately follow a given non-terminal N. LL(1) parser ‘LL’ because the parser works from Left to right identifying the nodes in what is called Leftmost derivation order. ‘(1)’ because all choices are based on a one token look-ahead. Follow(A)={b Σ |S+  Ab β}

Closure algorithm for computing the FOLLOW sets

The first and follow sets

Recall the predictive parser
rest_expression  ‘+’ expression |  FIRST(rest_expr) = {‘+’, } void rest_expression(void) { switch (Token.class) { case '+': token('+'); expression(); break; case EOF: case ')': break; default: error(); } FOLLOW(rest_expr) = {EOF, ‘)’}

Another Example of Follow Set
Follow(A)={aVt|S*  Aa  }{if S + A then {} else } S  ABc A  a A   B  b B  

Another Example of Follow Set (II)
S  aSe S  B B  bBe B  C C  cCe C  d

Another Example of Follow Set (III)
S  ABc A  a A   B  b B  

Algorithm of Follow(A)

LL(1) conflicts Example The codes

LL(1) conflicts FIRST/FIRST conflict term  IDENTIFIER
| IDENTIFIER ‘[‘ expression ‘]’ | ‘(’ expression ‘)’

LL(1) conflicts FIRST/FOLLOW conflict FIRST set FOLLOW set
S  A ‘a’ ‘b’ { ‘a’ } {} A  ‘a’ |  {‘a’, } {‘a’}

LL(1) conflicts left recursion Look-ahead token LL(1) grammar
expression  expression ‘-’ term | term Look-ahead token LL(1) method predicts the alternative Ak for a non-terminal N FIRST(Ak)  (if is nullable then FOLLOW(N)) LL(1) grammar No FIRST/FIRST conflicts No FIRST/FOLLOW conflicts No multiple nullable alternatives No non-terminal can have more than one nullable alternative.

Solve the LL(1) conflicts
Two options Use a stronger parser Make the grammar LL(1)

Making a grammar LL(1) manual labour three rewrite methods
rewrite grammar adjust semantic actions three rewrite methods left factoring substitution left-recursion removal

‘[’  FOLLOW(after_identifier)
Left-factoring term  IDENTIFIER | IDENTIFIER ‘[‘ expression ‘]’ factor out common prefix term  IDENTIFIER after_identifier after_identifier   | ‘[‘ expression ‘]’ ‘[’  FOLLOW(after_identifier)

Left-recursion removal
Three types of left-recursion Direct left-recursion N  N|… Indirect left-recursion Chain structure N  A … A  B … … Z  N … Hidden left-recursion N   N|… ( can produce )

Left-recursion removal
          ... N  N  |  replace by N   M M   M |  example expression  expression ‘-’ term | term N   expression  term expression_tail_option expression_tail_option  ‘-’ term expression_tail_option | 

Answers substitution left factoring left recursion removal
F  ‘(‘ E ‘)’ | ID ‘(‘ expr-list? ‘)’ | ID | constant left factoring E  E ( ‘+’ | ‘-’ ) T | T T  T ( ‘*’ | ‘/’ ) F | F F  ‘(‘ E ‘)’ | ID ( ‘(‘ expr-list? ‘)’ )? | constant left recursion removal E  T (( ‘+’ | ‘-’ ) T )* T  F (( ‘*’ | ‘/’ ) F )*

Undoing the semantic effects of grammar transformations
While it is often possible to transform our grammar into a new grammar that is acceptable by a parser generator and that generates the same language, the new grammar usually assigns a different structure to strings in the language than our original grammar did Fortunately, in many cases we are not really interested in the structure but rather in the semantics implied by it.

Semantics Non-left-recursive equivalent

Automatic conflict resolution (1)
There are two ways in which LL parsers can be strengthened By increasing the look-ahead Distinguishing alternatives not by their first token but by their first two tokens is called LL(2). Disadvantages: the parser code can get much bigger. By allowing dynamic conflict resolvers When the conflict arises during parsing, some of conditions are evaluated to solve it. The parser generator LLgen requires a conflict resolver to be placed on the first of two conflicting alternatives.

Automatic conflict resolution (2)
If-else statement in C else_tail_option: both FIRST set and FOLLOW set contain the token ‘else’ Conflict resolver

The LL(1) push-down automation
Transition table for an LL(1) parser

Push-down automation (PDA)
Type of moves Prediction move Top of the prediction stack is a non-terminal N. N is removed from the stack Look up the prediction table Push the alternative of N into the prediction stack Match move Top of the prediction stack is a terminal Termination Parsing terminates when the prediction stack is exhausted.

Prediction move in an LL(1) PDA

Match move in an LL(1) PDA

Predictive parsing with an LL(1) PDA

PDA example (1) input aap + ( noot + mies ) EOF  prediction stack
state (top of stack) look-ahead token IDENT + ( ) EOF input expression EOF expression term rest-expr term ( expression ) rest-expr + expression 

replace non-terminal by transition entry
PDA example (2) input prediction stack aap + ( noot + mies ) EOF input replace non-terminal by transition entry state (top of stack) look-ahead token IDENT + ( ) EOF input expression EOF expression term rest-expr term ( expression ) rest-expr + expression 

PDA example (3) expression EOF aap + ( noot + mies ) EOF 
prediction stack aap + ( noot + mies ) EOF input state (top of stack) look-ahead token IDENT + ( ) EOF input expression EOF expression term rest-expr term ( expression ) rest-expr + expression 

PDA example (4) expression EOF prediction stack aap + ( noot + mies ) EOF input replace non-terminal by transition entry state (top of stack) look-ahead token IDENT + ( ) EOF input expression EOF expression term rest-expr term ( expression ) rest-expr + expression 

PDA example (5) term rest-expr EOF aap + ( noot + mies ) EOF 
prediction stack aap + ( noot + mies ) EOF input state (top of stack) look-ahead token IDENT + ( ) EOF input expression EOF expression term rest-expr term ( expression ) rest-expr + expression 

PDA example (6) prediction stack term rest-expr EOF aap + ( noot + mies ) EOF input replace non-terminal by transition entry state (top of stack) look-ahead token IDENT + ( ) EOF input expression EOF expression term rest-expr term ( expression ) rest-expr + expression 

PDA example (7) Please continue!! Example of parsing (i+i)+i

Example in Textbook: Micro (1)

The First set

The Follow set

Calculation of Predict sets for Micro

LL(1) Table

Obtaining LL(1) Grammars
Most LL(1) prediction conflicts can be grouped into two categories: common prefix and left recursion

Common Prefixes Factoring method

Algorithm of Factoring

Left Recursion

Algorithm of Eliminating Left Recursion

LLgen LLgen is part of the Amsterdam Compiler Kit
takes LL(1) grammar + semantic actions in C and generates a recursive descent parser The non-terminals in the grammar can have parameters, and rules can have local variables, both again expressed in C. LLgen features: repetition operators advanced error handling parameter passing control over semantic actions dynamic conflict resolvers

LLgen add semantic actions attach parameters to grammar rules
start from LR(1) grammar make grammar LL(1) use repetition operators %token DIGIT; main : [line]+ ; line : expr '\n' expr : term [ '+' term ]* term : factor [ '*' factor ]* factor : '(' expr ')‘ | DIGIT add semantic actions attach parameters to grammar rules insert C-code between the symbols LLgen

Minimal non-left-recursive grammar for expressions

LLgen code for a parser Grammar Semantics

LLgen code for a parser The code from previous page resides in a file called parser.g. LLgen converts the file to one called parser.c, which contains a recursive descent parser.

LLgen interface to lexical analyzer

LLgen interface to back-end
LLgen handles syntax errors by inserting missing tokens and deleting unexpected tokens LLmessage() is invoked to notify the lexical analyzer

3. Formal Grammars and and Top-down Parsing Chih-Hung Wang

Similar presentations

Presentation on theme: "3. Formal Grammars and and Top-down Parsing Chih-Hung Wang"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

3. Formal Grammars and and Top-down Parsing Chih-Hung Wang

Similar presentations

Presentation on theme: "3. Formal Grammars and and Top-down Parsing Chih-Hung Wang"— Presentation transcript:

Similar presentations

About project

Feedback