Session 14 (DM62) / 15 (DM63) Recursive Descendent Parsing
2 Læringsmål n Kunne redegøre for forskelle på regulære og kontekstfri sprog (rekursive regler). n Kunne forstå kontekstfri grammatikker beskrevet i fx BNF. n Kunne redegøre for, hvordan kontekstfri sprog kan parses vha. rekursiv nedstigning (syntakstræer). n Kunne opbygge en rekursiv nedstignings parser udfra en simpel kontekstfri grammatik (BNF).
3 The Translation Process n A compiler consist of a number of logical layers and components.
4 Parsing n Parsing (syntax analysis) is the task of determining whether a program is syntactically correct or not. n Doing this the parser determines the syntactic structure of the program – usually in form of a parse tree or syntax tree. n This structure guides the rest of the translation process. n The syntax is defined by grammar rules of a context-free grammar. n Grammar rules are define in a manner similar to regular expressions. The major difference is that grammar rules are recursive. There is no * operation. n There are two general categories of parsing algorithm: n Top-down parsing n Bottom-up parsing
5 Context-free Grammars n A context-free grammar is a specification for the syntactic structure of a programming language. n As a running example, we will use simple integer arithmetic expressions exp -> exp op exp | ( exp ) | number op -> + | - | * where number is a regular expression n The vertical bar | means choice n Concatenation is also use as a standard operation n Remark the recursive nature of the definition of exp n Note also that the rule use regular expressions as symbols. That is: The rule is defined over an alphabet which contain tokens. n We need also a symbol ε for the empty string of tokens.
6 Programming Language n Context-free grammar rules determine a programming language: The set of legal strings of tokens. n For example (34-3)*42 corresponds to the legal string of seven tokens defined by exp: ( number – number ) * number n On the other hand, the string (34-3*42 corresponds to the illegal string of six tokens: (number – number * number n Grammar rules are sometimes called production because they “produce” strings in the language.
7 Backus-Naur Form (BNF) n Grammar rules using this form are said to be in Backus- Naur form (BNF) n A BNF for Pascal will begin with grammar rules such as: program -> program_heading ; program_block. program_heading ->program... program_block -> statements … statements -> statements; statement | statement statement -> if_statement | assign_statement |.. assign_statement -> identifier := exp; program is called the start symbol program, program_heading, program_block, statements, statement, assign_statement are called nonterminals The tokens program, identifier and := are examples of terminals.
8 Derivation n A derivation is a sequence of replacements of structure names by choices on the right-hand sides of grammar rules n As an example we look at a derivation for the arithmetic expression (34 – 3) * 42:
9 Parse Tree n A parse tree corresponding to a derivation is a labeled tree in which: n the interior nodes are labeled by non-terminals, n the leaf nodes are labeled by terminals, n and the children of each internal node represent the replacement of the associated non-terminal
10 Abstract Syntax Tree n A parse tree contains more info than is absolutely necessary for a compiler to produce object code. n Abstract syntax trees can be thought of as a tree representation of a shorthand notation called abstract syntax
11 Ambiguous Grammars n Consider the simple integer arithmetic grammar exp -> exp op exp | ( exp ) | number op -> + | - | * And consider the string 34-3*42. This string has two different parse trees. n Exercise: Draw two different parse trees for the expression 34-3*43
12 Ambiguous Grammars n Consider the simple integer arithmetic grammar exp -> exp op exp | ( exp ) | number op -> + | - | * And consider the string 34-3*42. This string has two different parse trees: n A grammar that generates a string with two distinct parse trees is called an ambiguous grammar (a serious problem) Which one is correct?
13 Ambiguous Grammars n Two basic method are used to deal with ambiguities n Disambiguating rule State a rule that specifies in each ambiguous case which of the parse trees is the correct one. This will correct the ambiguity without changing the grammar, but the grammar rule is no longer only in BNF. n Changing the grammar We can change the grammar into a different grammar that is correct. This will often complicate the grammar.
14 Ambiguous Grammars n To remove the ambiguity in the integer arithmetic grammar, we could simply state a disambiguating rule that establish the relative precedence's of the three operations +, - * and that subtraction is considered to be left associative. n To remove the ambiguity without a disambiguating rule (preferable) we must: n group the operators into groups of equal precedence n Make subtraction (or all operators) left associative Exercise: Draw the syntax tree for 34- 3*42 using this grammar. Are there more than one? Is operator precedence ok?
15 Extended Backus-Naur Form n Repetitive and optional constructs are common in programming languages, and thus in BNF grammar rules. Therefore the BNF notation is sometimes extended to include: n Repetition BNF (left recursive)A -> Aa | b EBNFA -> b {a} n Optional BNFstatement -> if-stmt | other if-stmt -> if( exp ) statement | if( exp ) statement else statement exp -> 0 | 1 EBNFstatement -> if-stmt | other if-stmt -> if( exp ) statement [ else statement] exp -> 0 | 1
16 Syntax diagram n Graphical representations for BNF or EBNF rules are called syntax diagrams. They consist of: n oval boxes indicate terminals n rectangles indicate non-terminals n arrowed lines representing sequencing and choices As an example, consider the grammar rule factor -> ( exp ) | number
17 Exercises Draw the syntax diagram for: if-statement -> if ( exp ) statement | if ( exp ) statement else statement exp -> true | false Write down the derivation and syntax tree for the following expression: 3-(4+5*6)
18 Context-Free Grammar for TINY Exercise: Draw syntax diagrams that defines, this part of the TINY grammar:
19 Top-Down Parsing n A top-down parsing algorithm parses an input string of tokens by tracing out the steps in a leftmost derivation. n Top-down parses come in two forms: n Predictive parsers Attempts to predict the next construction in the input string using one or more look ahead tokens n Backtracking parsers Will try different possibilities for a parse of the input (slow) n There are two kinds of top-down parsers n Recursive-decent parsing (suitable for handwritten parses) n LL(1) parsing (no longer used in practice).
20 Recursive-Decent n The idea of recursive-decent parsing is simple: n We view the grammar rule for a non-terminal A as a definition for a method that will recognize an A n The right-hand side of the grammar specifies the code structure: n A choice correspond to alternatives (if-statements or case-statement) n Non-terminals corresponds to other methods. Recursive Decent Parsing is important in connection with XML. XML parsers of the DOM type use recursive decent.
21 Recursive-Decent – small example n Identifiers descripted in BNF (usually one would use regular expressions) ::= | ::= a|b|…|z ::= 0|1|…|9 C#-code
22 Recursive-Decent: – small example – now in Java!! Java-code
23 Exercises n Write a recursive decent parser for the grammar that defines integers: ::= 0│1│2│3│4│5│6│7│8│9 ::= +|- ::= │ ::= | n Look at the Java-code for the small English grammar. Rewrite the code into C#
24 Exercises - Extra n Modify the grammar for integers so decimals are accepted: ::= 0│1│2│3│4│5│6│7│8│9 ::= +|- ::= │ ::= | n Write a recursive decent parser for the grammar that defines decimals: ::= ::= | ::= 0|1|2|3|4|5|6|7|8|9 ::= +|- ::=.