Parsing Discrete Mathematics and Its Applications Baojian Hua
Derivations A string is valid in a language if and only if there exists a derivation from the start state which produces it Begin with the start symbol, and apply grammar rules until you produce the string Note that the final string (sentence) consists of only terminals
Question Given a formal grammar G and a sentence (program) p, is p derivable from grammar G ? Or equivalently, is a given program p valid according to some language ’ s syntax (say C)?
Example: Context-Free Grammar S ::= x A | y B A ::= u C | v C B ::= t C ::= w | z // derivable? xum
Example: Context-Free Grammar // derivable? xum xuwz S ::= x A | y B A ::= u C | v C B ::= t C ::= w | z
Example: Context-Free Grammar // derivable? xum xuwz xwu S ::= x A | y B A ::= u C | v C B ::= t C ::= w | z
Example: Context-Free Grammar // derivable? xum xuwz xwu xuz S ::= x A | y B A ::= u C | v C B ::= t C ::= w | z
Lexical Analyzer The lexical analyzer translates the source program into a stream of lexical tokens Source program: stream of (ASCII or Unicode) characters Lexical token: compiler data structure that represents the occurrence of a terminal symbol Valid sentence consists of only allowable terminals
Example: Context-Free Grammar // all terminals T={x, y, u, v, t, w, z} S ::= x A | y B A ::= u C | v C B ::= t C ::= w | z
Example: Context-Free Grammar // all terminals T={x, y, u, v, t, w, z} // allowable strings T* S ::= x A | y B A ::= u C | v C B ::= t C ::= w | z
Predictive Parsing Parsing: recognizing a string and do something useful The most na ï ve approach to use when implementing a parser is to use recursive descent A form of top-down parsing Not as powerful as other methods, but easy enough to implement by hand
Predictive Parsing // Valid? xum xuwz xwu xuz S ::= x A | y B A ::= u C | v C B ::= t C ::= w | z
A Predictive Parser in C (Sketch) tokenTy token; void parseS () { switch (token.kind) { case x: token = nextToken (); parseA (); break; case y: token = nextToken (); parseB (); break; default: error (…); } // other functions are similar
Output: Abstract Syntax Tree xuz S xA uC z
A Predictive Parser Emitting AST in C (Sketch) tokenTy token; S parseS () { switch (token.kind) { case x: token = nextToken (); a=parseA (); return newS1 (x, a); case y: token = nextToken (); b=parseB (); return newS2 (y, b); default: error (…); } // other functions are similar
Predictive Parsing Difficulties // derivable? xuz S ::= x A | x B A ::= u C | v C B ::= t C ::= w | z
E By 4 => E * E By 5 => E * (E + E) By 2 => E * (E + 4) By 2 => E * (3 + 4) By 2 => 15 * (3 + 4) Or Even Worse 1 E ::= id 2 | num 3 | E + E 4 | E * E 5 | ( E ) 15*(3+4)
E E * E E * (E + E) E * (E + 4) E * (3 + 4) 15 * (3 + 4) Or Even Worse 15*(3+4) E E * E 15 * E 15 * (E + E) 15 * (3 + E) 15 * (3 + 4) rightmost derivationleftmost derivation
Ambiguous grammars A grammar is ambiguous if there is a sentence with >1 parse tree 15 * E E*E 15 E +E 3 4 E E+E E *E 3
Eliminating ambiguity In programming language syntax, ambiguity often arises from missing operator precedence or associativity * higher precedence than +? * and + are left associative? Can sometimes rewrite the grammar to disambiguate this Beyond the scope of this course
Unambiguous Grammar E ::= id | num | E + E | E * E | ( E ) E ::= E + T | T T ::= T * F | F F ::= id | num | ( E ) Accepts the same language, but parses unambiguously
Limitations with Predictive Parsing Rewriting grammar: to resolve ambiguity Grammars/trees are ugly But … easy to write code by hand, and very good for error reporting
Doing better We can do better We can use a parsing algorithm that can handle all context-free languages (though not all context-free grammars) Remember: a context-free language might have many different context-free grammars
The Yacc Tool semantic analyzer specification parser Yacc Originally developed for C, and now almost every main-stream language has its own Yacc-tool: bison (C), ml-yacc (SML), Cup (Java), GPPG (C#), …
Whole Structure source code abstract syntax tree lexical analyzer parser tokens Pentiu m other part