1 CS 410 Mastery in Programming Chapter 5 LL(1) Parsing Herbert G. Mayer, PSU CS status 7/17/2011.

1 CS 410 Mastery in Programming Chapter 5 LL(1) Parsing Herbert G. Mayer, PSU CS status 7/17/2011

2 Syllabus Goal Goal Grammars Formally, Intuitively Grammars Formally, Intuitively BNF, EBNF BNF, EBNF Grammar G1 Grammar G1 Suitable Grammar Suitable Grammar Uses of Grammar to Parse Uses of Grammar to Parse Recursive Descent Recursive Descent Recursive Descent Parser For s Recursive Descent Parser For s Sample e Sample e Sample s Sample s

3 Goal The rules of a programming language L specify how to generate strings of text that are in L; other strings are not part of L The number of strings in L (i.e. the size of set { L } ) is generally unbounded for typical programming languages One way of expressing language rules is through a grammar G Our goal is to become familiar with suitable grammars. Suitable means, certain rules are not allowed, such as left-recursion, circular rules, and lambda-producing rules – with exception! The class of grammar we use is context-free; thus, the more powerful class of grammars with context-sensitive rules is excluded A side goal is to learn a particular notation for writing grammars, but that notation is simply a convenience, just a handy way of writing We’ll focus on Backus Naur Form (BNF), AKA Backus Normal Form (BNF), from the early days of the Algol-60

4 Grammars Formally A grammar G for language L, named G(L), is a quintuple of { terminals, nonterminals, metasymbols, start symbol, productions } defining all strings in L; each string in L is named a program Terminal: A final token in the language L; e.g. “hello” Nonterminal Symbol: Is a grammar symbol, used as short-hand that for a string of other symbols; must be defined at least once on the left-hand side of a production; convenient to have multiple alternatives grouped via the metasymbol | Metasymbol: Symbol of the grammar itself defining action or meaning; is not part of the language L defined by G; is a grammar short-hand Start Symbol: One of the productions starts the process of generating (defining) strings in L; doesn’t have to be the first nonterminal being defined in G, but is convenient to be listed first Production: Rule that defines a nonterminal; consists of nonterminal on left-hand side being defined, specified by the “produces” metasymbol, plus some string of symbols on the right-hand side that is not circular

5 Grammars, Some Terminology The empty string is referred to as lambda. We’ll use lambda as a convenience in grammar writing; otherwise it is superfluous; also referred to in the literature as epsilon Lambda is superfluous as a grammar tool, except if the language allows the empty program. In all other cases, rules that produce lambda can be replaced by other rules that do not use lambda, at the expense of a more complex grammar Right-hand side of a suitable production –AKA alternative– eventually starts with a terminal; could be several terminals, if several alternatives exist. The set of all distinct terminals that can start a right-hand side is called the first set

6 Grammars Intuitively A grammar G is a set of rules to produce programs; programs are strings of characters in a programming language L Each rule has a name on the left-hand side, the nonterminal that generates at least one sequence of other symbols; those can be terminals or nonterminals listed on the right-hand side Terminal is a symbol expressing a value directly, like 500. Can also be some fixed symbol, like + or ( or END. A terminal symbol cannot produce other strings Nonterminal is a name that can be used on the right-hand-side of a production. Occurs at least once on right-hand side of a production, and is defined by nonterminals or terminals When there are multiple rules --AKA productions-- for a nonterminal, we call these alternatives One of the nonterminals is the start symbol. That is where the generating process starts; often written as the first rule, but must be clearly identified somehow

7 Grammars Example for grammar G 0 : s:s ( s ) | Discussion of G 0 : The only nonterminal symbol used in grammar G 0 is s. Hence s must also be the start symbol There are 2 meta-symbols, or if we are picky 3 Metasymbol : means “left side produces the string on the right” Metasymbol | means “another alternative for s” End of all rules means it is the end of G 0 Nothing else to the right of | means: “this alternative generates the empty string”, i.e. nothing, or lambda The first alternative of the two productions in G 0 is left-recursive There are 2 terminal symbols, ( and ) We can debate, whether the empty string lambda is also a terminal symbol I do not count the empty string, since this would be a case where an infinite sequence of the same terminal symbols --of nothings-- is the same as a single occurrence; not suitable for language grammars

8 BNF, EBNF While authoring the report on the language Algol60 in the late 1950s, John Backus developed a convenient short-hand, ably supported by ideas from Peter Naur Backus Normal Form, AKA Backus Naur Form Typical metasymbols in the Algol60 report ::= | <> [] [.. ] encloses an optional phrase; allowed once or not at all defines the non-terminal enclosed; allows disambiguation between, say, nonterminal and terminal symbol start ::= is the “produces” symbol; we’ll use a simpler one | starts another alternative for a production The notation found wide acceptance; extended to allow multiple options, by using the {.. } metasymbols {.. } states that the.. part is included 0 or more times {.. }+ states that the.. part is included 1 or more times [.. ] states that the.. part is optional, i.e. included once or not at all Hence called EBNF for Extended BNF

9 Grammar G 1 Metasymbol : means “produces” Metasymbol | means “r.h.s. also produces …” i.e. offer another alternative Nonterminals e and n Terminals + - * / ^ ( ) 0 1 2 3 4 5 6 7 8 9 Start Symbol e Grammar G 1 e:e + n-- addition |e - n-- subtraction |e * n-- multiplication |e / n-- division |e ^ n-- exponentiation, lots of left-recursion |( e ) -- grouping |n-- non-terminal for 10 terminals n:0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

10 Grammar G 2 Rewrite G 1 suitable for RD parsing, introduce metasymbols { } for repetition 0 or more times; see G 2 Rewrite G 1 suitable for RD parsing, introduce metasymbols { } for repetition 0 or more times; see G 2 expression: term { plus_op term } plus_op: + | - term : factor { mult_op factor } mult_op: * | / factor: primary { ^ primary } primar: ( expression ) | number number: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 note that position of semantic action effectively defines precedence; important for ^, which is right-associative! Others are usually left- associative; except in APL! We won’t cover semantics in CS 410 note that position of semantic action effectively defines precedence; important for ^, which is right-associative! Others are usually left- associative; except in APL! We won’t cover semantics in CS 410

11 Strings in G 1 80+76*6-4+22*(3+2)(((7)))((9)+8)*(((5-4)/2)/0) Discussion of operator precedence: In regular arithmetic, * and / have stronger binding than + and -, AKA precedence; yet G 1 alone cannot express that!! i.e. the expression 2+3*4 means 14 in arithmetic, NOT 20 However: Parser discussed does not account for precedences! Can encode this in grammar too, but not covered here, since we do not include semantics discussion, i.e. code generation

12 Suitable Grammar Definition: Parsing means “analyzing a string for grammatical correctness, according to the rules of language L” Definition: A program written in language L is a string of terminal symbols; these symbols are strung together according to the grammar rules of L Such a program can be empty only if there is a way for the start symbol to generate lambda We parse program strings in a top down fashion. Top down means: we start with the topmost nonterminal, AKA start symbol, regenerating the terminals from the input stream one symbol (i.e. terminal) at a time. Other methods exist not mentioned here; yes, named bottom-up We parse program strings in a top down fashion. Top down means: we start with the topmost nonterminal, AKA start symbol, regenerating the terminals from the input stream one symbol (i.e. terminal) at a time. Other methods exist not mentioned here; yes, named bottom-up When we see several alternatives during the parse that may have created this program so far, we look-ahead one source symbol to determine the correct next alternative Thus was coined the short-hand LL(1): Left-to-right reading symbols, Left- to-right grammar use, 1 symbol look-ahead. Notation: LL(1)

13 Suitable Grammar A grammar G is suitable for LL(1) parsing, if it adheres to certain restrictions, aside from being meaningful: No lambda productions: Except for the start symbol, no other nonterminal is allowed to generate the empty string; reason is, a parser can always succeed finding an empty string, so there is no real information in finding lambda You learn detail in the compiler course CS 321/322 No left-recursive rules: In presence of left-recursive rules, the resulting parser we write would cause infinite regress; i.e. self-recursive calls until stack overflow Detail in the compiler course No circular productions: There cannot productions of the type a: a …- without intermediate productions! a:b … b:a …- with some intermediate productions! No context-sensitive rules: Two or more non-terminals do not occur on the left side of a production: a b: some sequence – is not permitted

14 Uses of Grammar to Parse 1.) Once we have a suitable grammar G, use G to mechanically (automatically) design a parser for language L(G). The method is named “Recursive Descent Parsing”; common, old method, outlined below 2.) Once we have a suitable grammar G, encode G directly as a data structure. Then write a simple loop that reads the source and traverses the data structure driven by the incoming token stream, deciding at each point, which production of G to use that would allow the current source symbol 3.) If indeed a person can “mechanically implement a parser for all strings in L” given G, then a program can do so as well; Church Thesis. These programs exist and are called parser generators. Their inventors sometimes call them “Compiler Compilers”; sounds fancier. A widely used industrial quality parser generator is YACC, so named after the tongue-in cheek phrase: Yet Another Compiler Compiler. Available on Unix systems

15 Now for the MAIN idea:

16 Recursive Descent Goal: Describe an algorithm for mechanically producing a parser for language L(G) using grammar G Preparation: Write a scanner, AKA lexical analyzer scan() that reads the source program one character at a time, and returns a token t for each string of characters constituting a whole token, AKA lexeme. Lambda is not one of the possible tokens; and then:  For each nonterminal n defined in G, define a recursive function/procedure by that name n() –we’ll skip some nonterminals  For each nonterminal n used on the right-hand-side in G, issue a call to n()  For each terminal t that is required by any alternative in G, call must_be( t ) verify t was found, and scan() the next token after t  When a production has multiple alternatives, use the mutually exclusive first-sets of each nonterminal and the next input token t (i.e. look-ahead 1) to determine, which nonterminal n to call; if the first-set does not resolve this: not a suitable grammar!  When a production has multiple alternatives, use the mutually exclusive first-sets of each nonterminal and the next input token t (i.e. look-ahead 1) to determine, which nonterminal n to call; if the first-set does not resolve this: not a suitable grammar!

17 Recursive Descent Parser For s Grammar G 0 : s:( s ) s | Sample strings in L(G 0 ): () or ((())) or ()()() but not )( scan(): For such simple tokens –AKA lexemes– consisting of single characters ’(’ and ’)’, scanner can be as simple as the C/C++ function getchar() Generally, tokens are multi- character symbols Function must_be( t ) simply checks for expected symbol t : // assume global: char NextChar, void function scan() void must_be( char expected ) { // must_be if ( NextChar != expected ) { printf( " Expect ‘%c', is '%c'.\n", expected, NextChar ); } //end if scan(); } //end must_be

18 Recursive Descent Parser For parens() void scan( ) { // scan next_char = getchar();// read next input character if ( BLANK == next_char ) {// skip ’ ’ scan();}else{ printf( "%c", next_char );// echo the non-blank found } //end if } // end scan void parens() { // parens if ( next_char == OPEN ) {// that is open parenthesis ‘(‘ scan(); parse_parens();// recurse for nested ( ( must_be( CLOSED );// i.e. closed parenthesis ‘)’ parse_parens();// recurse for sequence ( ) ( ) } //end if// no more OPEN found; return } //end parens int main() { // main scan();// get first ever token parens();// language Assert( EOF, “Garbage found” ); } //end main

19 Repeat of Grammar G 2 expression: term { plus_op term } plus_op: + | - term : factor { mult_op factor } mult_op: * | / factor: primary { ^ primary } primar: ( expression ) | number number: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

20 Parser For G 2 expression, 1 // parser for grammar G2: // // expression: term { plus_op term } // plus_op: '+' | '-' // term: factor { mult_op factor } // mult_op: '*' | '/' // factor: primary { ^ primary } // primary: '(' expression ')' //| number // number: '0' | '1' | '2'... '9' // #include #include #define BLANK' ' #define EOL'\n' #defineOPEN'(' #defineCLOSED')' char next_char = BLANK;// globally used for "token" #define ASSERT( c )\ if ( next_char != c ) {\ printf( "Error, expected '%c', found '%c'\n", c, next_char );\ printf( "Error, expected '%c', found '%c'\n", c, next_char );\ } else{\ scan();\ } //end if void scan( ) { // scan next_char = getchar(); if ( BLANK == next_char ) { scan(); scan();}else{ printf( "%c", next_char );// echo non-blank found printf( "%c", next_char );// echo non-blank found } //end if } // end scan void expression();// forward announcement!!

21 Parser For G 2 expression, 2 // really just scans a digit // but one is expected; if not found: error void number() { // number if ( ( next_char >= '0' ) && ( next_char = '0' ) && ( next_char <= '9' ) ) { scan(); scan();}else{ printf( "primary expression 0,1,2.. or '(' expected.\n" ); printf( "primary expression 0,1,2.. or '(' expected.\n" ); } //end if } //end number // parse primary expression, either: // (... ) or a number void primary() { // primary if ( next_char == OPEN ) { scan(); scan(); expression(); expression(); ASSERT( CLOSED ); ASSERT( CLOSED );}else{ number(); number(); } //end if } //end primary

22 Parser For G 2 expression, 3 // parse highest priority operator ^ void factor() { // factor primary(); while ( next_char == '^' ) { scan(); scan(); primary(); primary(); } //end while } //end factor // parse multiply operators; skip mult_op nonterminal void term() { // term factor(); while ( ( next_char == '*' ) || ( next_char == '/' ) ) { // note: abbreviation from “mult_op()” // note: abbreviation from “mult_op()” scan(); scan(); factor(); factor(); } //end while } //end term // parse adding operators + and 0, skip plus_op nonterminal void expression() { // expression term(); while ( ( next_char == '+' ) || ( next_char == '-' ) ) { // note: abbreviation from “add_op()” // note: abbreviation from “add_op()” scan(); scan(); term(); term(); } //end while } //end expression

23 Parser For G 2 expression, 4 // get first token // then parse complete expression // assert no more source after expression // int main() { // main scan();expression(); ASSERT( EOL ); return 0; } //end main

24 Sample Input for expression e() ( ( ( 5 + 3* 3 ) / ( 5^6 ) - 2 ) ^ ( 2 ^ 6 ^ 7 ) )

25 A Parsing Variation We broke the general rule for Recursive Descent Parsing, namely defining a recursive function for each non-terminal symbols in G For example, we coded the scanning of operators (such as + and -, or the * and / ) directly in-line Using a while loop to parse one or more of the [repeated] operators instead In such cases, the semantic actions can be associated with the operator just scanned in a left-to-right fashion i.e. the semantic actions are done left-associatively An equally elegant way is to use an If-Statement and call the parsing function directly recursively Easily allowing right-associative semantic actions Recursion parses multiple operators of the same precedence

26 Change Grammar G 2 to G 3 expression: term [ plus_op expression ] plus_op: + | - term : factor [ mult_op term ] mult_op: * | / factor: primary [ ^ factor ] primary: ( expression ) | number number: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

27 Modified Parse For G 3 // parse highest priority operator ^ void factor() { // factor primary(); if ( ‘^’ == next_char ) { scan(); scan(); factor();// <- parse repeated ^ operators factor();// <- parse repeated ^ operators } //end if } //end factor // parse multiply operators; skip mult_op nonterminal void term() { // term factor(); if ( ( next_char == '*' ) || ( next_char == '/' ) ) { if ( ( next_char == '*' ) || ( next_char == '/' ) ) { // note: abbreviation from “mult_op()” // note: abbreviation from “mult_op()” scan(); scan(); term(); // <- parse repeated * and / operators term(); // <- parse repeated * and / operators } //end if } //end term // parse adding operators + and 0, skip plus_op nonterminal void expression() { // expression term(); if ( ( next_char == '+' ) || ( next_char == '-' ) ) { if ( ( next_char == '+' ) || ( next_char == '-' ) ) { // note: abbreviation from “add_op()” // note: abbreviation from “add_op()” scan(); scan(); expression(); // <- parse repeated + and - operators expression(); // <- parse repeated + and - operators } //end if } //end expression

28 Data Structure and Grammar To be handled in compiler course Possibly a future extension at CS 410/510

29 Grammar G 4 For Statement s() s: statement [ s ] statement: if_statement | assign_statement if_statement: IF_SYM expression THEN_SYM statement [ ELSE_SYM statement ] FI_SYM ‘;’ assign_statement: ident ‘=’ expression ‘;’ -- separate ideas: expression: as discussed earlier *_SYM; these are tokens returned by scan()

30 Parser For G 4 Statements s(), Part 1 void s();// forward announcement void assign_statement() { // assign_statement must_be( ident ); must_be( assign_sym ); expression(); must_be( semi_sym ); } //end assign_statement

31 Parser For G 4 Statements s(), Part 2 void if_statement() { // if_statement must_be( if_sym ); expression(); must_be( then_sym ); s(); if ( else_sym == token ) { scan();s(); } //end if must_be( fi_sym ); must_be( semi_sym ); } //end if_statement

32 Parser For G 4 Statements s(), Part 3 void statement() { // statement if ( if_sym == token ) { if_statement();}else{assign_statement(); } //end if } //end statement void s() { // s statement(); // use first-set: more statements? if ( ( if_sym == token ) || ( ident == token ) ) { s(); } //end if } //end s

33 References  Algol-60 Report: http://www.masswerk.at/algol60/report.htm  John Backus, http://www- 03.ibm.com/ibm/history/exhibits/builders/builders_backus.html  BNF: http://cui.unige.ch/db- research/Enseignement/analyseinfo/AboutBNF.html  ISO EBNF: http://www.cl.cam.ac.uk/~mgk25/iso-ebnf.html  Left-Recursion elimination, see: Herbert G Mayer, “Programming Languages”, © 1988 MacMillan Publishing Co., ISBN: 0-02-378295-1  Church Thesis: http://plato.stanford.edu/entries/church-turing/  YACC: http://dinosaur.compilertools.net/yacc/

1 CS 410 Mastery in Programming Chapter 5 LL(1) Parsing Herbert G. Mayer, PSU CS status 7/17/2011.

Similar presentations

Presentation on theme: "1 CS 410 Mastery in Programming Chapter 5 LL(1) Parsing Herbert G. Mayer, PSU CS status 7/17/2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CS 410 Mastery in Programming Chapter 5 LL(1) Parsing Herbert G. Mayer, PSU CS status 7/17/2011.

Similar presentations

Presentation on theme: "1 CS 410 Mastery in Programming Chapter 5 LL(1) Parsing Herbert G. Mayer, PSU CS status 7/17/2011."— Presentation transcript:

Similar presentations

About project

Feedback