Parsing G Programming Languages May 24, 2012 New York University Chanseok Oh
Chapter 2 Scanning Parsing
Overview – Scanner, Tokenizer, Lexer, Lexical Analyzer IF ( A >=.30 ) THEN { … IF, LPARAN, IDENT(A), GTE, FPN(.30), RPARAN, THEN, … Tokens, Lexemes DFA, NFA, Regular expressions lex, flex, Jlex – Parser DPDA, Deterministic context-free grammars Yacc, Bison
Table of Contents – Practical parsers( Linear time) LL(top-down, predictive) LR(bottom-up, shift-reduce) – Related side-topics Ambiguity, Language and parser hierarchy – Examples: Simple Calculator Language
A Language – A set of strings (of given symbols) { finite, set, with, five, strings } { ab, aaba, abbaba, … } { 0 n 1 n } { a i b j | i < j } { void main() { int i = 0 }, … } – Is an input string in the language? cf. Recursive, Turing-decidable languages
Context-Free Languages (CFL) – Languages that can be generated by CFG’s – Languages that can be determined by PDA’s – Not all languages are CF. – CFG: suitable for most PL’s. := PERIOD – Deterministic CFL
Example Here is our CFG: Input: sum, a1, ptr ; S:= id A A:=, id A A := ;
Parse Tree S A A A sum a1 ptr,, ; S:= id A A:=, id A A := ;
Ambiguous Grammars – Is it ambiguous? Undecidable. – No general procedure for converting to unambiguous grammars – Can be allowed to some extent for deterministic parsing, e.g., by defining precedence or associativity. E E + E E – E E * E E / E
Parsers – LL (Left-to-right, Left-most derivation) Top-down Predictive Simple and easy to understand – LR (Left-to-right, Right-most derivation) Bottom-up Shift-reduce Most common in production-level SLR (Simple) LALR (Look-ahead)
LL(k) Parser – LL(k) Parser Uses k look-ahead symbols Does not backtrack (deterministic). – LL(1) is the most popular kind of LL parser. – LL(k) Languages Not all CFL’s are LL(k) languages. CFL LL(k)
LL Parsing Example It is an LL grammar. The language is also LL. Input to parse: sum, a1, ptr ; := id :=, id := ; CFL LL
Parse Tree suma1ptr,,; := id :=, id := ;
LR Parser – LR(k) parser Uses k look-ahead symbols. Usually k is 1, and the term LR Parser is often intended to refer to this case. – LR(k) Languages Not all CFL’s are LR(k) languages. CFL LR
Language Relationships Unambiguous languagesAmbiguous languages LR(0)SLR LALRLR(1) LL(0) LL(1)
LR Parsing Example With the same grammar, It is also an LR grammar, and the language is LR. Input to parse (as before): sum, a1, ptr ; id_list id id_list_tail id_list_tail, id id_list_tail id_list_tail ; CFL LR(1) LL
Parse Tree suma1ptr,,; := id :=, id := ;
Another LR Parsing Example Consider a modified grammar, The grammar is not LL, (though the language itself is both LR and LL). := ; :=, id := id
LR Parsing ;, a1 ptr, sum := ; :=, id := id
Simple Calculator Language 3 + ( 4 * 1 ) total := 7 read n write ( 10 – ( total + 1 ) / 3 * n )
Simple Arithmetic Expression E E + E | E – E E * E | E / E E id | number | ( E )
Simple Arithmetic Expression – LL language, but not LL grammar (yet LR one) – Two most common obstacles to “LL(1)-ness” Left-recursion Common prefixes expr term | expr add_op term term factor | term mult_op factor factor id | number | ( expr ) add_op + | - mult_op * | / stmt stmt stmt_list id := expr id ( arg_list )
stmt id := expr id ( arg_list ) Converting to LL-Grammars – Alternatively, you can employ conflict-resolution rules. stmt_list stmt stmt_list | є stmt id | stmt_list_tail stmt_list_tail := expr | ( arg_list ) stmt stmt stmt_list
Converted LL(1) Grammar expr term term_tail term_tail add_op term term_tail | є term factor | factor_tail factor_tail mult_op factor factor_tail | є factor ( expr ) | id | number add_op + | - mult_op * | / CFL LL Not every CFG can be converted to LL grammar. Why?
LL(1) for Simple Calculator Language program stmt_list $$ stmt_list stmt stmt_list | є stmt id := expr | read id | write expr expr term term_tail term_tail add_op term term_tail | є term factor factor_tail factor_tail mult_op factor factor_tail | є factor ( expr ) | id | number add_op + | - mult_op * | / Added three more production rules to the previous LL(1) grammar for expressions.
LL Parsing – Input program read A read B sum := A + B write sum write sum / 2
Predict Sets program stmt_list $$ {id, read, write, $$} stmt_list stmt stmt_list {id, read, write} | є {$$} stmt id := expr {id} read id {read} | write expr {write} expr term term_tail {(, id, number} term_tail add_op term term_tail {+,-} є {), id, read, write, $$} term factor factor_tail {(, id, number} factor_tail mult_op factor factor_tail {*, /} є {+, -, ), id, read, write, $$} factor ( expr ) {(} | id {id} | number {number} add_op + {+} | - {-} mult_op * {*} | / {/}
Predict Sets – Notice the pair-wise disjoint sets: {id}, {read},{write} – You are to expand stmt. – Look ahead 1 token (LL(1)). stmt id := expr {id} read id {read} write expr {write}
LL(1) program stmt_list $$ stmt_list stmt stmt_list | є stmt id := expr | read id | write expr expr term term_tail term_tail add_op term term_tail | є term factor factor_tail factor_tail mult_op factor factor_tail | є factor ( expr ) | id | number add_op + | - mult_op * | /
Better grammar: LR(1) – M ore intuitive than LL However, not exactly the same language (no empty string) – Left-recursive is advantageous. program stmt_list $$ stmt_list stmt_list stmt | stmt stmt id := expr | read id | write expr expr term | expr add_op term term factor | term mult_op factor factor id | number | ( expr ) add_op + | - mult_op * | /
LR Parsing – With the same input program, read A read B sum := A + B write sum write sum / 2
State Transition Diagram program ● stmt_list $$ stmt_list ● stmt_list stmt ● stmt stmt ● id := expr ● read id ● write expr State 0(Initial state) stmt read ● id State 1 stmt read id ● State 1’ read id Reduce (shifting stmt from a viewpoint of State 0) stmt_list stmt ● stmt Reduce (shifting stmt_list) State 0’ program stmt_list ● $$ stmt_list stmt_list ● stmt stmt ● id := expr ● read id ● write expr State 2 stmt_list
Shift/Reduce Conflicts Reduce/Reduce Conflicts expr ● term factor id ● … expr id ● factor id ●
Resolving Conflicts LR(0) – Any LR language has an LR(0) grammar (with $$). – Not practical: prohibitively large and unintuitive SLR – SLR grammar: no shift/reduce or reduce/reduce conflicts when using FOLLOW sets – FOLLOW sets: also used in LL to generate PREDICT sets LALR(1) – LALR(1) grammar (may not be SLR) – Same states as SLR – Improvement over SLR with local look-ahead – LALR’s are the most common parsers in practice. LR(1) – LR(1) grammars (may not be LALR(1) or SLR)