Chapter 2 (part) + Chapter 4: Syntax Analysis S. M. Farhad 1
Chapter 2 (part) + Chapter 4: Syntax Analysis S. M. Farhad 2
Grammars Specify the syntax of a language Hierarchical structure Java if-else statement if ( expr ) stmt else stmt A production rule for if-else statement stmt if ( expr ) stmt else stmt Terminals and nonterminals 3
Context Free Grammars The notation to specify syntax Context Free Grammar (CFG) Backus-Naur Form (BNF) A context-free grammar Analyze the syntax Also used to translate the programs Context free grammar Grammar 4
Components of Grammars A set of terminal symbols For example: token, +, -, keywords A set of nonterminals Sets of strings help define the language Nonterminals impose a hierarchical structure For example expr, stmt as follows: stmt if ( expr ) stmt else stmt 5
Components of Grammars A set of productions The head or left side Consists of a nonterminal An arrow means can have the form Body or right side A sequence of terminals and nonterminals Start symbol A special nonterminal symbol The productions for the start symbol are listed first 6
Example The arithmetic expression consisting of + or – E E + E | E – E | E*E | (E) | int int 0|1|2|3|4|5|6|7|8|9 7
Derivations Beginning with the start symbol Each rewriting step replaces a nonterminal by the body of one of its productions Left most derivation Leftmost nonterminal is always chosen LL grammar (parses from left to right, left most) Rightmost derivation Rightmost nonterminal is always chosen LR grammar (parses from left to right, right most) 8
Left Most Derivation Given E E + E | E – E | E*E | (E) | int String int * int + int E => E + E => E*E + E => int *E + E => int * int + E => int * int + int 9
Right Most Derivation String int * int + int
Parse Tree 11 E EE + EE * int String int * int + int
Ambiguity Grammar that produces more than one parse tree for some sentence E EE + EE * int E E E * EE + For string: int * int + int
Reasons for Ambiguity Associativity and Precedence +, -, *, / are left associate *, / have higher precedence than +, - Use E and T for two levels of precedence Use F for basic units of expression
Non Ambiguous F int | (E) T T * F | T / F | F E E + T | E – T | T String: int * (int + int)
Ambiguity: The Dangling Else Consider the grammar S → if E then S | if E then S else S | other This grammar is also ambiguous
Ambiguity: The Dangling Else The expression if E1 then if E2 then S1 else S2 has two parse trees 16 S ifE1thenS ifE2thenS1 elseS2 S ifE1thenS ifE2thenS1 elseS2 Typically we want the second form
The Dangling Else: A Fix else matches the closest unmatched then We can describe this in the grammar S → MS /* all then are matched */ | US /* some then are unmatched */ MS → if E then MS else MS | other US → if E then S | if E then MS else US 17
The Dangling Else: The Parse Tree 18 US ifE1thenS ifE2thenS1 elseS2 S MS The expression if E1 then if E2 then S1 else S2
CFG vs RE Grammars are more powerful notation than RE For RE: (a l b)*abb A 0 aA 0 | bA 0 | aA 1 A 1 bA 2 A 2 bA 3 A 3 Ɛ
Why us RE in Lexical Analysis Two manageable-sized components More Simple More Concise Construction of Lexical Analyzer becomes easier and efficient 20
RE vs CFG REs are most useful for Identifiers, constants, keywords, and white space Grammars are most useful for describing nested structure B alanced parentheses, matching begin-end's, corresponding if-then-else Nested structure cannot be described by RE 21
Parsing Top down parsing: Starts at the root and proceeds towards the leave Easier to understand and program manually Bottom up parsing Starts at the leaves and proceeds towards the root more powerful, used by most parser generators 22
Recursive Descent Parsing Consider the grammar E → T + E | T T → int | int * T | ( E ) Token stream is: int * int Start with top-level non-terminal E Try the rules for E in order 23
Recursive Descent Parsing - Example Try E → T + E Then try a rule for T → ( E ) But ( does not match input token int Try T → int - Token matches. But + after T does not match input token * Try T → int * T This will match but + after T will be unmatched Has exhausted the choices for T Backtrack to choice for E 24
Recursive Descent Parsing - Example Token stream is: int * int Try E → T Follow same steps as before for T And succeed with T → int * T and T → int With the following parse tree 25 E T intE *
When Recursive Descent Does Not Work Consider the left-recursive grammar S → S α | β S is called itself without consuming any symbol Gets into an infinite loop Recursive descent does not work in such cases 26
Elimination of Left Recursion Consider the left-recursive grammar S → S α | β S generates all strings starting with a β and followed by a number of α Can rewrite using right-recursion S → β S’ S’ → α S’ | ε 27
More Elimination of Left- Recursion In general S → S α1 | … | S αn | β1 | … | βm All strings derived from S start with one of β1,…,βm and continue with several instances of α1,…,αn Rewrite as S → β1 S’ | … | βm S’ S’ → α1 S’ | … | αn S’ | ε 28
General Left Recursion The grammar S → A α | δ A → S β is also left-recursive because S → S β α This left-recursion can also be eliminated See book, Section 4.3 for general algorithm 29
Summary of Recursive Descent Simple and general parsing strategy Left-recursion must be eliminated first … but that can be done automatically Unpopular because of backtracking Thought to be too inefficient In practice, backtracking is eliminated by restricting the grammar 30
Predictive Parsers Like recursive-descent but parser can “predict” which production to use By looking at the next few tokens No backtracking Predictive parsers accept LL(k) grammars L means “left-to-right” scan of input L means “leftmost derivation” k means “predict based on k tokens of lookahead” In practice, LL(1) is used 31
LL(1) Languages In recursive-descent, for each non-terminal and input token, may be a choice of production LL(1) means that for each non-terminal and token there is only one production Can be specified via 2D tables One dimension for current non-terminal to expand One dimension for next token A table entry contains one production 32
Predictive Parsing and Left Factoring Recall the grammar E → T + E | T T → int | int * T | ( E ) Hard to predict because For T two productions start with int For E it is not clear how to predict A grammar must be left-factored before use for predictive parsing 33
Left-Factoring Example Recall the grammar E → T + E | T T → int | int * T | ( E ) Factor out common prefixes of productions E → T X X → + E | ε T → ( E ) | int Y Y → * T | ε 34
Left-Factoring Example Left-factored grammar E → T X X → + E | ε T → ( E ) | int Y Y → * T | ε Token stream is: int * int 35 E T int Y * x T Y ε ε
LL(1) Parsing Table Example Left-factored grammar E → T X X → + E | ε T → ( E ) | int Y Y → * T | ε LL(1) parsing table: 36 int*+()$ ET X X+ Eεε Tint Y( E ) Y* Tεεε
LL(1) Parsing Table Example Consider the [E, int] entry “When current non-terminal is E and next input is int, use production E → T X” This production can generate a int in the first place Consider the [Y,+] entry “When current non-terminal is Y and current token is +, get rid of Y” Y can be followed by + only in a derivation in which Y → ε 37
LL(1) Parsing Tables - Errors Blank entries indicate error situations Consider the [E,*] entry “There is no way to derive a string starting with * from non-terminal E” 38
Using Parsing Tables Method similar to recursive descent, except For each non-terminal S We look at the next token a And chose the production shown at [S, a] We use a stack to keep track of pending nonterminals We reject when we encounter an error state We accept when we encounter end-of-input 39
LL(1) Parsing Algorithm initialize stack = and next repeat case stack of : if T[X,*next] = Y1…Yn then stack ← ; else error (); : if t == *next ++ then stack ← ; else error (); until stack == 40
LL(1) Parsing Example Stack Input Action E $ int* int $ T X T X $ int *int $ int Y int Y X $ int *int $ terminal Y X $ * int $ * T * T X $ * int $terminal T X $ int $ int Y int Y X $ int $ terminal Y X $ $ ε X $ $ ε $ $ ACCEPT 41
Constructing Parsing Tables LL(1) languages are those defined by a parsing table for the LL(1) algorithm No table entry can be multiply defined We want to generate parsing tables from CFG 42
Constructing Parsing Tables If A → α, where in the line of A we place α ? In the column of t where t can start a string derived from α α =>* t β We say that t ∈ First(α) In column of t if α is ε and t can follow an A S =>* β A t δ We say t ∈ Follow(A) 43
Computing First Sets Definition: First(X) = { t | X =>* tα} ∪ {ε | X =>* ε} Algorithm sketch (see book for details): 1. For all terminals t do First(t) ← { t } 2. If X → A 1 … A k If a ∈ First(A 1 ), add a to First(X) Everything in First(A 1 ) is in First(X) If A 1 does not drive ε stop If A 1 =>* ε then we add First(A 2 ), and so on 3. For each production X → ε, add ε in First(X) 44
First Sets - Example Recall the grammar E → T X X → + E | ε T → ( E ) | int YY → * T | ε First sets First( ( ) = { ( } First( T ) = {int, ( } First( ) ) = { ) } First( E ) = {int, ( } First(int) = {int } First( X ) = {+, ε } First( + ) = { + } First( Y ) = {*, ε } First( * ) = { * } 45
Computing Follow Sets Definition: Follow(B) = { t | S =>* β B t δ } If S is the start symbol then $ ∈ Follow(S) If A → α B β then First(β) - ε is in Follow(B) If A → α B or A → α B β and ε ∈ First(β) Follow(A) is in Follow(B) 46
Follow Sets. Example Recall the grammar E → T X X → + E | ε T → ( E ) | int Y Y → * T | ε Follow sets Follow( + ) = {int, ( } Follow( E ) = {), $} Follow( ( ) = {int, ( } Follow( X ) = {), $} Follow( * ) = {int, ( }Follow( T ) = {+, ), $} Follow( ) ) = {+, ), $} Follow( Y ) = {+, ), $} Follow(int) = {*, +, ), $} 47
Constructing LL(1) Parsing Tables Construct a parsing table T for CFG, G For each production A → α in G do: For each terminal t ∈ First(α) do T[A, t] = α If ε ∈ First(α), for each t ∈ Follow(A) do T[A, t] = α If ε ∈ First(α) and $ ∈ Follow(A) do T[A, $] = α 48
Constructing LL(1) Parsing Tables Grammar E → T X X → + E | ε T → ( E ) | int Y Y → * T | ε 49 int*+()$ ET X X+ Eεε Tint Y( E ) Y* Tεεε First Sets First( T ) = {int, ( } First( E ) = {int, ( } First( X ) = {+, ε } First( Y ) = {*, ε } Follow Sets Follow( X ) = {), $} Follow( E ) = {), $} Follow( T ) = {+, ), $} Follow( Y ) = {+, ), $}
LL(1) Parsing Example Stack Input Action E $ int* int $ T X T X $ int *int $ int Y int Y X $ int *int $ terminal Y X $ * int $ * T * T X $ * int $terminal T X $ int $ int Y int Y X $ int $ terminal Y X $ $ ε X $ $ ε $ $ ACCEPT 50
Predictive Parsing for Dangling Else Grammar Dangling else grammar S → i E t S | i E t S e S | a E → b Left factoring S → i E t S S’ | a S’ → e S | ε E → b 51
Predictive Parsing for Dangling Else Grammar S → i E t S S’ | a S’ → e S | ε E → b 52 abeiT$ SS→aS→iEtSS’ S’S’→eS S’→ε EE→b First(S) = {i, a} First(E) = {b} First(S’) = {e, ε } Follow(S) = {e, $} Follow(S’) = {e, $} Follow(E) = {t}
Error Handling in Syntax Analysis Goals Report the presence of errors clearly and accurately Recover from each error quickly To detect subsequent errors Add minimal overhead to the processing of correct programs 53
Error Recovery Strategies Panic-Mode Recovery Discards input symbols one at a time Synchronizing tokens is used Follow set, keyword, etc Phrase-Level Recovery Perform local correction on the remaining inputs Replace a comma by a semicolon, delete an extraneous semicolon For the empty cells of the parsing table implement the error correcting routines 54
Error Recovery Strategies Error Productions Augment the grammar for erroneous inputs Global Correction Make as few changes as possible in processing an incorrect input string Read section
Error Recovery id+*()$ E E' T T' F E → TE' T → FT' F → id E → +TE1 synch T' + ε synch T' →* FT' synch E → TE' T → FT' F → (E) E → ε synch T‘→ ε synch E → ε synch T→ ε synch 56 Table entry [A, a] is empty input a is skipped If the entry is synch then the stack top is popped If the stack top terminal does not match input then stack top is popped
Error Recovery: Panic Mode StackInputRemark E $ TE'$ FT'E' $ id T'E'$ TIE' $ * FT'E' $ FT'E' $ TIE' $ E' $ + TE' $ TE' $ FT'E' $ id T'E' $ T'E' $ E' $ $ ) id * + id $ id * + id $ * + i d$ + id $ id $ $ error, skip ) id is in FIRST(E) error, M [F, +] = synch F has been popped 57
Bottom-up Parsing Bottom-up parsing is more general than top- down parsing Efficient although difficult by hand Similar ideas of top-down parsing Bottom-up is the preferred method in practice Reading: Section
Bottom-up Parsing Bottom-up parsers don’t need left factored grammars Hence we can revert to the “natural” grammar for our example: E → T + E | T T → int * T | int | (E) Consider the string: int * int + int 59
Bottom-up Parsing Bottom-up parsing reduces a string to the start symbol by inverting productions: int * int + int T → int int * T + int T → int * T T + int T → int T + T E → T T + E E → T + E E 60
Observation Read productions from bottom-up parse in reverse (i.e., from bottom to top) This is a rightmost derivation! int * int + int T → int int * T + int T → int * T T + int T → int T + T E → T T + E E → T + E E 61
Trivial Bottom-Up Parsing Algorithm Let I = input string repeat pick a non-empty substring β of I where X→ β is a production if no such β, backtrack replace one β by X in I until I = “S” (the start symbol) or all possibilities are exhausted 62