Parsing Giuseppe Attardi Università di Pisa. Parsing Calculate grammatical structure of program, like diagramming sentences, where: Tokens = “words” Tokens.

Parsing Giuseppe Attardi Università di Pisa

Parsing Calculate grammatical structure of program, like diagramming sentences, where: Tokens = “words” Tokens = “words” Programs = “sentences” Programs = “sentences” For further information: Aho, Sethi, Ullman, “Compilers: Principles, Techniques, and Tools” (a.k.a, the “Dragon Book”)

Outline of coverage Context-free grammars Context-free grammars Parsing Parsing –Tabular Parsing Methods –One pass Top-down Bottom-up Yacc Yacc

Parser: extracts grammatical structure of program function-def name argumentsstmt-list main stmt expression operatorexpression variablestring cout << “hello, world\n”

Context-free languages Grammatical structure defined by context- free grammar statement  labeled-statement | expression-statement | compound-statement labeled-statement  ident : statement | case constant-expression : statement compound-statement   { declaration-list statement-list } statement  labeled-statement | expression-statement | compound-statement labeled-statement  ident : statement | case constant-expression : statement compound-statement   { declaration-list statement-list } terminal non-terminal “Context-free” = only one non-terminal in left-part

Parse trees Parse tree = tree labeled with grammar symbols, such that: If node is labeled A, and its children are labeled x 1...x n, then there is a production A  x 1...x n If node is labeled A, and its children are labeled x 1...x n, then there is a production A  x 1...x n “Parse tree from A” = root labeled with A “Parse tree from A” = root labeled with A “Complete parse tree” = all leaves labeled with tokens “Complete parse tree” = all leaves labeled with tokens

Parse trees and sentences Frontier of tree = labels on leaves (in left- to-right order) Frontier of tree = labels on leaves (in left- to-right order) Frontier of tree from S is a sentential form Frontier of tree from S is a sentential form Frontier of a complete tree from S is a sentence Frontier of a complete tree from S is a sentence L E a L ; E “Frontier”

Example G: L  L ; E | E E  a | b Syntax trees from start symbol (L): aa; E a;b;b L E a L E a L ; E L E a L ; E b L E b ; Sentential forms:

Derivations Alternate definition of sentence: Given ,  in V*, say  is a derivation step if  ’  ’’ and  =  ’  ’’, where A   is a production Given ,  in V*, say  is a derivation step if  ’  ’’ and  =  ’  ’’, where A   is a production  is a sentential form iff there exists a derivation (sequence of derivation steps) S  ( alternatively, we say that S    )  is a sentential form iff there exists a derivation (sequence of derivation steps) S  ( alternatively, we say that S    ) Two definitions are equivalent, but note that there are many derivations corresponding to each parse tree

Another example H: L  E ; L | E E  a | b H: L  E ; L | E E  a | b L E a L E a L ; E L E a L ; E b L E b ;

Ambiguity For some purposes, it is important to know whether a sentence can have more than one parse tree For some purposes, it is important to know whether a sentence can have more than one parse tree A grammar is ambiguous if there is a sentence with more than one parse tree A grammar is ambiguous if there is a sentence with more than one parse tree Example: E  E + E | E * E | id Example: E  E + E | E * E | id E E E E E id + * E E EE E + *

Notes If e then if b then d else f If e then if b then d else f { int x; y = 0; } { int x; y = 0; } A.b.c = d; A.b.c = d; Id -> s | s.id Id -> s | s.id E -> E + T -> E + T + T -> T + T + T -> id + T + T -> id + T * id + T -> id + id * id + T -> id + id * id + id

Ambiguity Ambiguity is a function of the grammar rather than the language Ambiguity is a function of the grammar rather than the language Certain ambiguous grammars may have equivalent unambiguous ones Certain ambiguous grammars may have equivalent unambiguous ones

Grammar Transformations Grammars can be transformed without affecting the language generated Grammars can be transformed without affecting the language generated Three transformations are discussed next: Three transformations are discussed next: –Eliminating Ambiguity –Eliminating Left Recursion (i.e.productions of the form A  A  ) –Left Factoring

Eliminating Ambiguity Sometimes an ambiguous grammar can be rewritten to eliminate ambiguity Sometimes an ambiguous grammar can be rewritten to eliminate ambiguity For example, expressions involving additions and products can be written as follows: For example, expressions involving additions and products can be written as follows: E  E + T | T E  E + T | T T  T * id | id T  T * id | id The language generated by this grammar is the same as that generated by the grammar in slide “Ambiguity”. Both generate id (+ id| * id )* The language generated by this grammar is the same as that generated by the grammar in slide “Ambiguity”. Both generate id (+ id| * id )* However, this grammar is not ambiguous However, this grammar is not ambiguous

Eliminating Ambiguity (Cont.) One advantage of this grammar is that it represents the precedence between operators. In the parsing tree, products appear nested within additions One advantage of this grammar is that it represents the precedence between operators. In the parsing tree, products appear nested within additions E T TE id + * T

Eliminating Ambiguity (Cont.) An example of ambiguity in a programming language is the dangling else An example of ambiguity in a programming language is the dangling else Consider Consider S  if  then S else S | if  then S |  S  if  then S else S | if  then S | 

Eliminating Ambiguity (Cont.) When there are two nested ifs and only one else.. When there are two nested ifs and only one else.. Sif  then S else S  if  then S  Sif  then S if  S else S 

Eliminating Ambiguity (Cont.) In most languages (including C++ and Java), each else is assumed to belong to the nearest if that is not already matched by an else. This association is expressed in the following (unambiguous) grammar: In most languages (including C++ and Java), each else is assumed to belong to the nearest if that is not already matched by an else. This association is expressed in the following (unambiguous) grammar: S  Matched S  Matched | Unmatched | Unmatched Matched  if  then Matched else Matched Matched  if  then Matched else Matched |  |  Unmatched  if  then S Unmatched  if  then S |  if  then Matched else Unmatched |  if  then Matched else Unmatched

Eliminating Ambiguity (Cont.) Ambiguity is a property of the grammar Ambiguity is a property of the grammar It is undecidable whether a context free grammar is ambiguous It is undecidable whether a context free grammar is ambiguous The proof is done by reduction to Post’s correspondence problem The proof is done by reduction to Post’s correspondence problem Although there is no general algorithm, it is possible to isolate certain constructs in productions which lead to ambiguous grammars Although there is no general algorithm, it is possible to isolate certain constructs in productions which lead to ambiguous grammars

Eliminating Ambiguity (Cont.) For example, a grammar containing the production A  AA |  would be ambiguous, because the substring  has two parses: For example, a grammar containing the production A  AA |  would be ambiguous, because the substring  has two parses: A AA A A A A A A A     This ambiguity disappears if we use the productions This ambiguity disappears if we use the productions A  AB | B and B   A  AB | B and B   or the productions A  BA | B and B  . A  BA | B and B  .

Eliminating Ambiguity (Cont.) Examples of ambiguous productions: Examples of ambiguous productions: A  A  A A  A | A  A  A |  A  A A CF language is inherently ambiguous if it has no unambiguous CFG A CF language is inherently ambiguous if it has no unambiguous CFG –An example of such a language is L = { a i b j c m | i=j or j=m} which can be generated by the grammar: S  AB | DC A  a A |  C  c C |  B  b B c |  D  a D b | 

Elimination of Left Recursion A grammar is left recursive if it has a nonterminal A and a derivation A   A  for some string  A grammar is left recursive if it has a nonterminal A and a derivation A   A  for some string  –Top-down parsing methods cannot handle left- recursive grammars, so a transformation to eliminate left recursion is needed Immediate left recursion (productions of the form A  A  ) can be easily eliminated: Immediate left recursion (productions of the form A  A  ) can be easily eliminated: 1.Group the A-productions as A  A  1 | A  2 | … | A  m |  1 |  2 | … |  n where no  i begins with A 2.Replace the A-productions by A   1 A’ |  2 A’ | … |  n A’ A’   1 A’ |  2 A ’| … |  m A’ | 

Elimination of Left Recursion (Cont.) The previous transformation, however, does not eliminate left recursion involving two or more steps The previous transformation, however, does not eliminate left recursion involving two or more steps For example, consider the grammar For example, consider the grammar S  A a | b A  A c | S d |  S is left-recursive because S  A a  S da  but it is not immediately left recursive S is left-recursive because S  A a  S da  but it is not immediately left recursive

Elimination of Left Recursion (Cont.) Algorithm. Eliminate left recursion Arrange nonterminals in some order A 1, A 2,,…, A n for i = 1 to n { for j = 1 to i - 1 { for j = 1 to i - 1 { replace each production of the form A i  A j  replace each production of the form A i  A j  by the production A i   1  |  2  | … |  n  by the production A i   1  |  2  | … |  n  where A j   1 |  2 |…|  n are all the current A j - productions where A j   1 |  2 |…|  n are all the current A j - productions } eliminate the immediate left recursion among the A i - productions eliminate the immediate left recursion among the A i - productions}

Elimination of Left Recursion (Cont.) To show that the previous algorithm actually works, notice that iteration i only changes productions with A i on the left-hand side. And m > i in all productions of the form A i  A m  To show that the previous algorithm actually works, notice that iteration i only changes productions with A i on the left-hand side. And m > i in all productions of the form A i  A m  Induction proof: Induction proof: –Clearly true for i = 1 –If it is true for all i < k, then when the outer loop is executed for i = k, the inner loop will remove all productions A i  A m  with m < i –Finally, with the elimination of self recursion, m in the A i  A m  productions is forced to be > i At the end of the algorithm, all derivations of the form A i   A m  will have m > i and therefore left recursion would not be possible At the end of the algorithm, all derivations of the form A i   A m  will have m > i and therefore left recursion would not be possible

Left Factoring Left factoring helps transform a grammar for predictive parsing Left factoring helps transform a grammar for predictive parsing For example, if we have the two productions For example, if we have the two productions S  if  then S else S S  if  then S else S | if  then S | if  then S on seeing the input token if, we cannot immediately tell which production to choose to expand S In general, if we have A    1 |   2 and the input begins with , we do not know (without looking further) which production to use to expand A In general, if we have A    1 |   2 and the input begins with , we do not know (without looking further) which production to use to expand A

Left Factoring (Cont.) However, we may defer the decision by expanding A to  A’ However, we may defer the decision by expanding A to  A’ Then after seeing the input derived from , we may expand A’ to  1 or to  2 Then after seeing the input derived from , we may expand A’ to  1 or to  2 Left-factored, the original productions become Left-factored, the original productions become A   A’ A   A’ A’   1 |  2 A’   1 |  2

Non-Context-Free Language Constructs Examples of non-context-free languages are: Examples of non-context-free languages are: –L 1 = {w c w | w is of the form ( a | b )*} –L 2 = { a n b m c n d m | n  1 and m  1 } –L 3 = { a n b n c n | n  0 } Languages similar to these that are context free Languages similar to these that are context free –L’ 1 = {w c w R | w is of the form ( a | b )*} (w R stands for w reversed) This language is generated by the grammar S  a S a | b S b | c –L’ 2 = { a n b m c m d n | n  1 and m  1 } This language is generated by the grammar S  a S d | a A d A  b A c | bc

Non-Context-Free Language Constructs (Cont.) L” 2 = { a n b n c m d m | n  1 and m  1 } L” 2 = { a n b n c m d m | n  1 and m  1 } is generated by the grammar is generated by the grammar S  AB A  a A b | ab B  c B d | cd L’ 3 = { a n b n | n  1} L’ 3 = { a n b n | n  1} is generated by the grammar is generated by the grammar S  a S b | ab This language is not definable by any regular expression This language is not definable by any regular expression

Non-Context-Free Language Constructs (Cont.) Suppose we could construct a DFSM D accepting L’ 3. Suppose we could construct a DFSM D accepting L’ 3. D must have a finite number of states, say k. D must have a finite number of states, say k. Consider the sequence of states s 0, s 1, s 2, …, s k entered by D having read , a, aa, …, a k. Consider the sequence of states s 0, s 1, s 2, …, s k entered by D having read , a, aa, …, a k. Since D only has k states, two of the states in the sequence have to be equal. Say, s i  s j (i  j). Since D only has k states, two of the states in the sequence have to be equal. Say, s i  s j (i  j). From s i, a sequence of i bs leads to an accepting (final) state. Therefore, the same sequence of i bs will also lead to an accepting state from s j. Therefore D would accept a j b i which means that the language accepted by D is not identical to L’ 3. A contradiction. From s i, a sequence of i bs leads to an accepting (final) state. Therefore, the same sequence of i bs will also lead to an accepting state from s j. Therefore D would accept a j b i which means that the language accepted by D is not identical to L’ 3. A contradiction.

Parsing The parsing problem is: Given string of tokens w, find a parse tree whose frontier is w. (Equivalently, find a derivation from w ) A parser for a grammar G reads a list of tokens and finds a parse tree if they form a sentence (or reports an error otherwise) Two classes of algorithms for parsing: –Top-down –Bottom-up

Parser generators A parser generator is a program that reads a grammar and produces a parser A parser generator is a program that reads a grammar and produces a parser The best known parser generator is yacc It produces bottom-up parsers The best known parser generator is yacc It produces bottom-up parsers Most parser generators - including yacc - do not work for every CFG; they accept a restricted class of CFG’s that can be parsed efficiently using the method employed by that parser generator Most parser generators - including yacc - do not work for every CFG; they accept a restricted class of CFG’s that can be parsed efficiently using the method employed by that parser generator

Top-down parsing Starting from parse tree containing just S, build tree down toward input. Expand left-most non-terminal. Starting from parse tree containing just S, build tree down toward input. Expand left-most non-terminal. Algorithm: (next slide) Algorithm: (next slide)

Top-down parsing (cont.) Let input = a 1 a 2...a n current sentential form (csf) = S loop { suppose csf = a 1 …a k A  suppose csf = a 1 …a k A  based on a k+1 …, choose production based on a k+1 …, choose production A  csf becomes a 1 …a k  csf becomes a 1 …a k }

Top-down parsing example Grammar: H: L  E ; L | E E  a | b Input: a;b Parse tree Sentential form Input L a;b E;LE;L L L EL ; L EL ; a a; L a;b

Top-down parsing example (cont.) Parse tree Sentential form Input a; E a;b L EL ; a E L EL ; a E b

LL(1) parsing Efficient form of top-down parsing Efficient form of top-down parsing Use only first symbol of remaining input ( a k+1 ) to choose next production. That is, employ a function M:   N  P in “choose production” step of algorithm. Use only first symbol of remaining input ( a k+1 ) to choose next production. That is, employ a function M:   N  P in “choose production” step of algorithm. When this is possible, grammar is called LL(1) When this is possible, grammar is called LL(1)

LL(1) examples Example 1: Example 1: H: L  E ; L | E E  a | b Given input a;b, so next symbol is a. Which production to use? Can’t tell.  H not LL(1)

LL(1) examples Example 2: Example 2: Exp  Term Exp’ Exp’  $ | + Exp Term  id (Use $ for “end-of-input” symbol.) Grammar is LL(1): Exp and Term have only one production; Exp’ has two productions but only one is applicable at any time. Grammar is LL(1): Exp and Term have only one production; Exp’ has two productions but only one is applicable at any time.

Nonrecursive predictive parsing Maintain a stack explicitly, rather than implicitly via recursive calls Maintain a stack explicitly, rather than implicitly via recursive calls Key problem during predictive parsing: determining the production to be applied for a non-terminal Key problem during predictive parsing: determining the production to be applied for a non-terminal

Nonrecursive predictive parsing Algorithm. Nonrecursive predictive parsing Algorithm. Nonrecursive predictive parsing Set ip to point to the first symbol of w$. Set ip to point to the first symbol of w$. repeat repeat Let X be the top of the stack symbol and a the symbol pointed to by ip Let X be the top of the stack symbol and a the symbol pointed to by ip if X is a terminal or $ then if X is a terminal or $ then if X == a then if X == a then pop X from the stack and advance ip pop X from the stack and advance ip else error() else error() else // X is a nonterminal else // X is a nonterminal if M[X,a] == X  Y 1 Y 2 … Y k then if M[X,a] == X  Y 1 Y 2 … Y k then pop X from the stack pop X from the stack push Y k Y k-1, …, Y 1 onto the stack with Y 1 on top push Y k Y k-1, …, Y 1 onto the stack with Y 1 on top (push nothing if Y 1 Y 2 … Y k is  ) (push nothing if Y 1 Y 2 … Y k is  ) output the production X  Y 1 Y 2 … Y k output the production X  Y 1 Y 2 … Y k else error() else error() until X == $ until X == $

LL(1) grammars No left recursion No left recursion A  A  : If this production is chosen, parse makes no progress. No common prefixes No common prefixes A   |  Can fix by “left factoring”: A   A’  ’  | 

LL(1) grammars (cont.) No ambiguity No ambiguity Precise definition requires that production to choose be unique (“choose” function M very hard to calculate otherwise)

Top-down Parsing Input tokens: L E 0 … E n Start symbol and root of parse tree Input tokens: L E 0 … E n... From left to right, “grow” the parse tree downwards

Checking LL(1)-ness For any sequence of grammar symbols , define set FIRST(  )   to be For any sequence of grammar symbols , define set FIRST(  )   to be FIRST(  ) = { a |   * a  for some  }

LL(1) definition Define: Grammar G = (N, , P, S) is LL(1) iff whenever there are two left-most derivations (in which the leftmost nonterminal is always expanded first) Define: Grammar G = (N, , P, S) is LL(1) iff whenever there are two left-most derivations (in which the leftmost nonterminal is always expanded first) S  * wA   w   * wtx S  * wA   w   * wtx S  * wA   w   * wty S  * wA   w   * wty it follows that  =  it follows that  =  In other words, given In other words, given 1. a string wA  in V* and 1. a string wA  in V* and 2. t, the first terminal symbol to be derived from A  2. t, the first terminal symbol to be derived from A  there is at most one production that can be applied to A to there is at most one production that can be applied to A to yield a derivation of any terminal string beginning with wt yield a derivation of any terminal string beginning with wt FIRST sets can often be calculated by inspection FIRST sets can often be calculated by inspection

FIRST Sets Exp  Term Exp’ Exp’  $ | + Exp Term  id (Use $ for “end-of-input” symbol) FIRST( $ ) = { $ } FIRST( + Exp ) = { + } FIRST( $ )  FIRST( + Exp ) = {}  grammar is LL(1) FIRST( $ ) = { $ } FIRST( + Exp ) = { + } FIRST( $ )  FIRST( + Exp ) = {}  grammar is LL(1)

FIRST Sets L  E ; L | E E  a | b FIRST(E ; L) = { a, b } = FIRST(E) FIRST(E ; L)  FIRST(E)  {}  grammar not LL(1). FIRST(E ; L) = { a, b } = FIRST(E) FIRST(E ; L)  FIRST(E)  {}  grammar not LL(1).

Computing FIRST Sets Algorithm. Compute FIRST(X) for all grammar symbols X Algorithm. Compute FIRST(X) for all grammar symbols X forall X  V do FIRST(X) = {} forall X  V do FIRST(X) = {} forall X   (X is a terminal) do FIRST(X) = {X} forall X   (X is a terminal) do FIRST(X) = {X} forall productions X   do FIRST(X) = FIRST(X) U {  } forall productions X   do FIRST(X) = FIRST(X) U {  } repeat repeat c: forall productions X  Y 1 Y 2 … Y k do c: forall productions X  Y 1 Y 2 … Y k do forall i  [1,k] do forall i  [1,k] do FIRST(X) = FIRST(X) U (FIRST(Y i ) - {  }) if   FIRST(Y i ) then continue c FIRST(X) = FIRST(X) U (FIRST(Y i ) - {  }) if   FIRST(Y i ) then continue c FIRST(X) = FIRST(X) U {  } FIRST(X) = FIRST(X) U {  } until no more terminals or  are added to any FIRST set until no more terminals or  are added to any FIRST set

FIRST Sets of Strings of Symbols FIRST(X 1 X 2 …X n ) is the union of FIRST(X 1 ) and all FIRST(X i ) such that   FIRST(X k ) for k = 1, 2, …, i-1 FIRST(X 1 X 2 …X n ) is the union of FIRST(X 1 ) and all FIRST(X i ) such that   FIRST(X k ) for k = 1, 2, …, i-1 FIRST(X 1 X 2 …X n ) contains  iff   FIRST(X k ) for k = 1, 2, …, n FIRST(X 1 X 2 …X n ) contains  iff   FIRST(X k ) for k = 1, 2, …, n

FIRST Sets do not Suffice Given the productions Given the productions A  T x A  T x A  T y T  w T   A  T y T  w T   T  w should be applied when the next input token is w. T  w should be applied when the next input token is w. T   should be applied whenever the next terminal is either x or y T   should be applied whenever the next terminal is either x or y

FOLLOW Sets For any nonterminal X, define the set FOLLOW(X)   as For any nonterminal X, define the set FOLLOW(X)   as FOLLOW(X) = { a | S  *  X a  }

Computing the FOLLOW Set Algorithm. Compute FOLLOW(X) for all nonterminals X Algorithm. Compute FOLLOW(X) for all nonterminals X FOLLOW(S) ={$} FOLLOW(S) ={$} forall productions A   B  do FOLLOW(B)=Follow(B)  (FIRST(  ) - {  }) forall productions A   B  do FOLLOW(B)=Follow(B)  (FIRST(  ) - {  }) repeat repeat forall productions A   B or A   B  with   FIRST(  ) do forall productions A   B or A   B  with   FIRST(  ) do FOLLOW(B) = FOLLOW(B)  FOLLOW(A) FOLLOW(B) = FOLLOW(B)  FOLLOW(A) until all FOLLOW sets remain the same until all FOLLOW sets remain the same

Construction of a predictive parsing table Algorithm. Construction of a predictive parsing table Algorithm. Construction of a predictive parsing table M[:,:] = {} M[:,:] = {} forall productions A   do forall productions A   do forall a  FIRST(  ) do forall a  FIRST(  ) do M[A,a] = M[A,a] U {A   } M[A,a] = M[A,a] U {A   } if   FIRST(  ) then if   FIRST(  ) then forall b  FOLLOW(A) do forall b  FOLLOW(A) do M[A,b] = M[A,b] U {A   } M[A,b] = M[A,b] U {A   } Make all empty entries of M be error Make all empty entries of M be error

Another Definition of LL(1) Define: Grammar G is LL(1) if for every A  N with productions A  1   n FIRST(  i FOLLOW(A))  FIRST(  j FOLLOW(A) ) = {} for all i, j

Regular Languages Definition. A regular grammar is one whose productions are all of the type: Definition. A regular grammar is one whose productions are all of the type: –A  aB –A  a A Regular Expression is either: A Regular Expression is either: –a –R 1 | R 2 –R 1 R 2 –R*

Nondeterministic Finite State Automaton 012 3 a b a bbstart

Regular Languages Theorem. The classes of languages Theorem. The classes of languages –Generated by a regular grammar –Expressed by a regular expression –Recognized by a NDFS automaton –Recognized by a DFS automatoncoincide.

Deterministic Finite Automaton space, tab, new line digit OPERATOR KEYWORD digit =, +, -, /, (, ) letter START NUM $$$ circlestate double circleaccept state arrowtransition bold, cap labelsstate names lower case labelstransition characters

Scanner code state := start state := start loop loop if no input character buffered then read one, and add it to the accumulated token if no input character buffered then read one, and add it to the accumulated token case state of case state of start: start: case input_char of case input_char of A..Z, a..z : state := id A..Z, a..z : state := id 0..9 : state := num 0..9 : state := num else... else... end end id: id: case input_char of case input_char of A..Z, a..z : state := id A..Z, a..z : state := id 0..9 : state := id 0..9 : state := id else... else... end end num: num: case input_char of case input_char of 0..9:... 0..9:......... else... else... end end...... end; end;

Table-driven DFA 0-start1-num2-id3-operator4-keyword white space0exit letter2error2exiterror digit112exiterror operator3exit $4error exit4

L0 CFL [NPA] Language Classes LR(1) LL(1) RL [DFA=NFA] L0 CSL

Question Are regular expressions, as provided by Perl or other languages, sufficient for parsing nested structures, e.g. XML files? Are regular expressions, as provided by Perl or other languages, sufficient for parsing nested structures, e.g. XML files?

Recursive Descent Parser stat → var = expr ; expr → term [ + expr] term → factor [ * factor] factor → ( expr ) | var | constant var → identifier

Scanner public class Scanner { private StreamTokenizer input; private StreamTokenizer input; private Type lastToken; private Type lastToken; public enum Type { INVALID_CHAR, NO_TOKEN, PLUS, // etc. for remaining tokens, then: EOF}; public Scanner (Reader r) { public Scanner (Reader r) { input = new StreamTokenizer(r); input = new StreamTokenizer(r); input.resetSyntax(); input.resetSyntax(); input.eolIsSignificant(false); input.eolIsSignificant(false); input.wordChars('a', 'z'); input.wordChars('a', 'z'); input.wordChars('A', 'Z'); input.wordChars('A', 'Z'); input.ordinaryChar('+'); input.ordinaryChar('+'); input.ordinaryChar('*'); input.ordinaryChar('*'); input.ordinaryChar('='); input.ordinaryChar('='); input.ordinaryChar('('); input.ordinaryChar('('); input.ordinaryChar(')'); input.ordinaryChar(')'); input.whitespaceChars('\u0000', ' '); input.whitespaceChars('\u0000', ' '); }

Scanner public int nextToken() { public int nextToken() { Type token; Type token; try { try { switch (input.nextToken()) { switch (input.nextToken()) { case StreamTokenizer.TT_EOF: case StreamTokenizer.TT_EOF: token = EOF; break; case Type.TT_WORD: case Type.TT_WORD: if (input.sval.equalsIgnoreCase("false")) token = FALSE; token = FALSE; else if (input.sval.equalsIgnoreCase("true")) token = TRUE; token = TRUE;else token = VARIABLE; token = VARIABLE;break; case '+': token = PLUS; token = PLUS;break; // etc. } } catch (IOException ex) { token = EOF; } } catch (IOException ex) { token = EOF; } return token; return token; }}

Parser public class Parser { private LexicalAnalyzer lexer; private LexicalAnalyzer lexer; private Type token; private Type token; public Expr parse(Reader r) throws SyntaxException { public Expr parse(Reader r) throws SyntaxException { lexer = new LexicalAnalyzer(r); lexer = new LexicalAnalyzer(r); nextToken(); // assigns token Statement stat = statement(); Statement stat = statement(); expect(LexicalAnalyzer.EOF); expect(LexicalAnalyzer.EOF); return stat; return stat; }

Statement // stat ::= variable '=' expr ';' // stat ::= variable '=' expr ';' private Statement stat() throws SyntaxException { private Statement stat() throws SyntaxException { Expr var = variable(); Expr var = variable(); expect(LexicalAnalyzer.ASSIGN); expect(LexicalAnalyzer.ASSIGN); Expr exp = expr(); Expr exp = expr(); Statement stat = new Statement(var, exp); Statement stat = new Statement(var, exp); expect(LexicalAnalyzer.SEMICOLON); expect(LexicalAnalyzer.SEMICOLON); return stat; return stat; }

Expr // expr ::= term ['+' expr] // expr ::= term ['+' expr] private Expr expr() throws SyntaxException { private Expr expr() throws SyntaxException { Expr exp = term(); Expr exp = term(); while (token == LexicalAnalyzer.PLUS) { while (token == LexicalAnalyzer.PLUS) { nextToken(); nextToken(); exp = new Exp(exp, expression()); exp = new Exp(exp, expression()); } return exp; return exp; }

Term // term ::= factor ['*' term ] // term ::= factor ['*' term ] private Expr term() throws SyntaxException { private Expr term() throws SyntaxException { Expr exp = factor(); Expr exp = factor(); // Rest of body: left as an exercise. // Rest of body: left as an exercise.}

Factor // factor ::= ( expr ) | var // factor ::= ( expr ) | var private Expr factor() throws S.Exception { private Expr factor() throws S.Exception { Expr exp = null; Expr exp = null; if (token == LexicalAnalyzer.LEFT_PAREN) { if (token == LexicalAnalyzer.LEFT_PAREN) { nextToken(); nextToken(); exp = expression(); exp = expression(); expect(LexicalAnalyzer.RIGHT_PAREN); expect(LexicalAnalyzer.RIGHT_PAREN); } else { } else { exp = variable(); exp = variable();} return exp; return exp; }

Variable // variable ::= identifier // variable ::= identifier private Expr variable() throws S.Exception { private Expr variable() throws S.Exception { if ( token == LexicalAnalyzer.ID) { Expr exp = new Variable(lexer.getString()); Expr exp = new Variable(lexer.getString()); nextToken(); nextToken(); return exp; return exp; } }

Constant private Expr constantExpression() throws S.Exception { private Expr constantExpression() throws S.Exception { Expr exp = null; Expr exp = null; // Handle the various cases for constant // Handle the various cases for constant // expressions: left as an exercise. return exp; return exp; }

Utilities private void expect(Type t) throws SyntaxException { private void expect(Type t) throws SyntaxException { if (token != t) { // throw SyntaxException... if (token != t) { // throw SyntaxException... } nextToken(); nextToken(); } private void nextToken() { private void nextToken() { token = lexer.nextToken(); token = lexer.nextToken(); }}

Parsing Giuseppe Attardi Università di Pisa. Parsing Calculate grammatical structure of program, like diagramming sentences, where: Tokens = “words” Tokens.

Similar presentations

Presentation on theme: "Parsing Giuseppe Attardi Università di Pisa. Parsing Calculate grammatical structure of program, like diagramming sentences, where: Tokens = “words” Tokens."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parsing Giuseppe Attardi Università di Pisa. Parsing Calculate grammatical structure of program, like diagramming sentences, where: Tokens = “words” Tokens.

Similar presentations

Presentation on theme: "Parsing Giuseppe Attardi Università di Pisa. Parsing Calculate grammatical structure of program, like diagramming sentences, where: Tokens = “words” Tokens."— Presentation transcript:

Similar presentations

About project

Feedback