Parsing Giuseppe Attardi Università di Pisa.

Parsing Giuseppe Attardi Università di Pisa

Parsing To extract the grammatical structure of a sentence, where:
sentence = program words = tokens For further information: Aho, Sethi, Ullman, “Compilers: Principles, Techniques, and Tools” (a.k.a, the “Dragon Book”)

Outline of coverage Context-free grammars Parsing Yacc
Tabular Parsing Methods One pass Top-down Bottom-up Yacc

Grammatical structure of program
function-def name arguments stmt-list stmt () main expression expression operator expression variable << string cout “hello, world\n”

Context-free languages
Grammatical structure defined by context-free grammar statement  assignment ; statement  expression ; statement  compound-statement assignment  ident = expression compound-statement  { declaration-list statement-list } Definition (Context-free). Grammar with only one non-terminal symbol in left hand side of productions. Symbols terminal (token) non-terminal

Context Free Grammar G = (V, S, P, S)
V is a finite set of non-terminal symbols S is a finite set of terminal symbols P is a finite set of productions  V  (V  S)* S is the start symbol

Parse trees Parse tree: tree labeled with grammar symbols, such that:
if node is labeled A and its children are labeled x1...xn, then there is a production A x1...xn “Parse tree from A” = root labeled with A “Complete parse tree” = all leaves labeled with tokens

Parse trees and sentences
Frontier of tree = labels on leaves (in left-to-right order) Frontier of tree from S is a sentential form Definition. A sentence is the frontier of a complete tree from S. L E a ; “Frontier”

Example G: L L ; E | E E a | b Syntax trees from start symbol (L):
Sentential forms: a a;E a;b;b

Derivations Alternate definition of sentential form:
Given ,  in V*, say  is a derivation step if ’’’ and  = ’’’ , where A  is a production  is a sentential form iff there exists a derivation (sequence of derivation steps) S  (alternatively, we say that S * ) Two definitions are equivalent, but note that there are many derivations corresponding to each parse tree

Another example H: L E ; L | E E a | b L L L L E E ; L E a ; L ; E

Ambiguity A sentence can have more than one parse tree
A grammar is ambiguous if there is a sentence with more than one parse tree Example 1 E  E+E | E*E | id E id + * E id + *

Notes If e then if b then d else f { int x; y = 0; } a.b.c = d;
Id -> s | s.id E -> E + T -> E + T + T -> T + T + T -> id + T + T -> id + T * id + T -> id + id * id + T -> id + id * id + id

Ambiguity Ambiguity is a feature of the grammar rather than the language Certain ambiguous grammars may have equivalent unambiguous ones

Grammar Transformations
Grammars can be transformed without affecting the language generated Three transformations are discussed next: Eliminating Ambiguity Eliminating Left Recursion (i.e. productions of the form AA  ) Left Factoring

Eliminating Ambiguity
Sometimes an ambiguous grammar can be rewritten to eliminate ambiguity For example, the grammar of Example 1 can be written as follows: E  E +T | T T  E * id | id The language generated by this grammar is the same as that generated by the previous grammar. Both generate id(+id|*id)* However, this grammar is not ambiguous

Eliminating Ambiguity (Cont.)
One advantage of this grammar is that it represents the precedence between operators. In the parsing tree, products appear nested within additions E T id + *

An example of ambiguity in a programming language is the dangling else Consider S  if b then S else S | if b then S | a

When there are two nested ifs and only one else.. S if b then else a if b then S S if b then else a

In most languages (including C++ and Java), each else is assumed to belong to the nearest if that is not already matched by an else. This association is expressed in the following (unambiguous) grammar: S  Matched | Unmatched Matched  if b then Matched else Matched | a Unmatched  if b then S | if b then Matched else Unmatched

Ambiguity is a property of the grammar It is undecidable whether a context free grammar is ambiguous The proof is done by reduction to Post’s correspondence problem Although there is no general algorithm, it is possible to isolate certain constructs in productions which lead to ambiguous grammars

For example, a grammar containing the production AAA | a would be ambiguous, because the substring aaa has two parses: A A A A A A A A a A A a a a a a This ambiguity disappears if we use the productions AAB | B and B a or the productions ABA | B and B a.

Examples of ambiguous productions: AAaA AaA | Ab AaA | aAbA A CF language is inherently ambiguous if it has no unambiguous CFG An example of such a language is L = {aibjcm | i=j or j=m} which can be generated by the grammar: SAB | DC AaA | e CcC | e BbBc | e DaDb | e

Elimination of Left Recursion
A grammar is left recursive if it has a nonterminal A and a derivation A + Aa for some string a. Top-down parsing methods cannot handle left-recursive grammars, so a transformation to eliminate left recursion is needed Immediate left recursion (productions of the form A  A a) can be easily eliminated: Group the A-productions as A  A a1 | A a2 | … | A am | b1| b2 | … | bn where no bi begins with A 2. Replace the A-productions by A  b1A’ | b2A’ | … | bnA’ A’  a1A’ | a2A’| … | amA’ | e

Elimination of Left Recursion (Cont.)
The previous transformation, however, does not eliminate left recursion involving two or more steps For example, consider the grammar S  Aa | b A  Ac | Sd | e S is left-recursive because S Aa Sda, but it is not immediately left recursive

Algorithm. Left recursion elimination. Arrange nonterminals in some order A1, A2 ,,…, An for i = 1 to n { for j = 1 to i - 1 { replace each production of the form Ai  Aj g by Ai  d1 g | … | dn g where Aj  d1 |…| dn are the current productions for Aj } eliminate the immediate left recursion among the Ai-productions

Notice that iteration i only changes productions with Ai on the left-hand side, and Aj with j > i in the right-hand side Correctness induction proof: Clearly true for i = 1 If true for all i < k, then when the outer loop is executed for i = k, the inner loop will remove all productions Ai  Aj with j < i Finally, after the elimination of self recursion, m in any Ai Am productions will be > i At the end of the algorithm, all derivations of the form Ai + Ama will have m > i and therefore left recursion will not be present

Left Factoring Left factoring helps transform a grammar for predictive parsing For example, if we have the two productions S  if b then S else S | if b then S on seeing the input token if, we cannot immediately tell which production to choose to expand S In general, if we have A  b1 | b2 and the input begins with a, we do not know (without looking further) which production to use to expand A

However, we may defer the decision by expanding A to A’
Left Factoring (Cont.) However, we may defer the decision by expanding A to A’ Then after seeing the input derived from , we may expand A’ to 1 or to 2 Left-factored, the original productions become A  A’ A’ b1 | b2

Non-Context-Free Language Constructs
Examples of non-context-free languages are: L1 = {wcw | w is of the form (a|b)*} L2 = {anbmcndm | n  1 and m  1 } L3 = {anbncn | n  0 } Languages similar to these that are context free L'1 = {wcwR | w is of the form (a|b)*} (wR stands for w reversed) This language is generated by the grammar S aSa | bSb | c L'2 = {anbmcmdn | n  1 and m 1 } S aSd | aAd A bAc | bc

Non-Context-Free Language Constructs (Cont.)
L''2 = {anbncmdm | n  1 and m  1 } is generated by the grammar S AB A aAb | ab B cBd | cd L'3 = {anbn | n  1} S aSb | ab This language is not definable by any regular expression

CFG vs DFSA L‘4 = {anbm | n > 0, n > 0} a b start 1 b

Non-Context-Free Language Constructs (Cont.)
Suppose we could construct a DFSM D accepting L'3. D must have a finite number of states, say k. Consider the sequence of states s0, s1, s2, …, sk entered by D having read , a, aa, …, ak. Since D only has k states, two of the states in the sequence have to be equal. Say, si  sj (i  j). From si, a sequence of i bs leads to an accepting (final) state. Therefore, the same sequence of i bs will also lead to an accepting state from sj. Therefore D would accept ajbi which means that the language accepted by D is not identical to L’3. A contradiction.

Parsing The parsing problem is: Given string of tokens w, find a parse tree whose frontier is w. (Equivalently, find a derivation from w) A parser for a grammar G reads a list of tokens and finds a parse tree if they form a sentence (or reports an error otherwise) Two classes of algorithms for parsing: Top-down Bottom-up

Parser generators A parser generator is a program that reads a grammar and produces a parser The best known parser generator is yacc It produces bottom-up parsers Most parser generators - including yacc - do not work for every CFG; they accept a restricted class of CFG’s that can be parsed efficiently using the method employed by that parser generator

Top-down (predictive) parsing
Starting from parse tree containing just S, build tree down toward input. Expand left-most non-terminal.

Top-down parsing algorithm
Let input = a1a2...an current sentential form (csf) = S loop { suppose csf = a1…akA based on ak+1…, choose production A   csf becomes a1…ak }

Top-down parsing example
Grammar: H: L E ; L | E E a | b Input: a;b Parse tree Sentential form Input L E ; a L a;b E;L a;b a;L a;b

Top-down parsing example (cont.)
Parse tree Sentential form Input L E ; a b a;E a;b a;b a;b

LL(1) parsing Efficient form of top-down parsing
Use only first symbol of remaining input (ak+1) to choose next production. That is, employ a function M:   N P in “choose production” step of algorithm. When this is possible, grammar is called LL(1)

LL(1) examples Example 1: Given input a;b, so next symbol is a.
H: L E ; L | E E a | b Given input a;b, so next symbol is a. Which production to use? Can’t tell.  H not LL(1)

LL(1) examples Example 2: Exp Term Exp’ Exp’  $ | + Exp Term id
(Use $ for “end-of-input” symbol.) Grammar is LL(1): Exp and Term have only one production; Exp’ has two productions but only one is applicable at any time.

Nonrecursive predictive parsing
Maintain a stack explicitly, rather than implicitly via recursive calls Key problem during predictive parsing: determining the production to be applied for a non-terminal

Nonrecursive predictive parsing
Algorithm. Nonrecursive predictive parsing Set ip to point to the first symbol of w$. Push S onto the stack. repeat Let X be the top of the stack symbol and a the symbol pointed to by ip if X is a terminal or $ then if X == a then pop X from the stack and advance ip else error() else // X is a nonterminal if M[X,a] == XY1 Y2 … Y k then pop X from the stack push YkY k-1, …, Y1 onto the stack with Y1 on top (push nothing if Y1 Y2 … Y k is  ) output the production XY1 Y2 … Y k until X == $

LL(1) grammars No left recursion No common prefixes
A  Aa : If this production is chosen, parse makes no progress. No common prefixes A  ab | ag Can fix by “left factoring”: A  aA’ A’  b | g

LL(1) grammars (cont.) No ambiguity
Precise definition requires that production to choose be unique (“choose” function M very hard to calculate otherwise)

LL(1) definition Define: Grammar G = (N, , P, S) is LL(1) iff whenever there are two left-most derivations S * wA  w * wtx S * wA  w * wty it follows that  = Leftmost-derivation: where the leftmost non-terminal is always expanded first In other words, given 1. a string wA in V* and 2. t, the first terminal symbol to be derived from A there is at most one production that can be applied to A to yield a derivation of any terminal string beginning with wt

Checking LL(1)-ness For any sequence of grammar symbols , define set FIRST(a)  S to be FIRST(a) = { a | a * ab for some b } FIRST sets can often be calculated by inspection

FIRST Sets  grammar is LL(1) Exp  Term Exp’ Exp’  $ | + Exp
Term id (Use $ for “end-of-input” symbol) FIRST($) = {$} FIRST(+ Exp) = {+} FIRST($)  FIRST(+ Exp) = {}  grammar is LL(1)

FIRST Sets L E ; L | E E a | b FIRST(E ; L) = {a, b} = FIRST(E)
FIRST(E ; L)  FIRST(E)  {}  grammar not LL(1).

Computing FIRST Sets (no  productions)
Algorithm. Compute FIRST(X) for all grammar symbols X forall X  V do FIRST(X) = {} forall X   do FIRST(X) = {X} repeat forall productions X  Y1Y2 … Yk do FIRST(X) = FIRST(X)  FIRST(Y1) until no more elements are added to any FIRST set

Computing FIRST Sets (with  productions)
Algorithm. Compute FIRST(X) for all grammar symbols X forall X  V do FIRST(X) = {} forall X   do FIRST(X) = {X} forall productions X   do FIRST(X) = FIRST(X)  {} repeat forall productions X  Y1Y2 … Yk do for i = 1, k do FIRST(X) = FIRST(X)  (FIRST(Yi) - {}) if   FIRST(Yi) then break else if i == k then FIRST(X) = FIRST(X)  {} until no more elements are added to any FIRST set

FIRST Sets of Strings of Symbols
FIRST(X1X2…Xn) is the union of FIRST(X1) and all FIRST(Xi) such that   FIRST(Xk) for k = 1, 2, …, i-1 FIRST(X1X2…Xn) contains  iff   FIRST(Xk) for k = 1, 2, …, n

FIRST Sets do not Suffice
Given the productions A  aT x A  bT y T  w T  e T w should be applied when the next input token is w. T e should be applied whenever the next terminal is either x or y

FOLLOW Sets For any nonterminal X, define the set FOLLOW(X)  S as
FOLLOW(X) = {a | S * aXab }

Computing the FOLLOW Set
Algorithm. Compute FOLLOW(X) for all nonterminals X FOLLOW(S) = {$} forall productions A  B do FOLLOW(B) = Follow(B)  (FIRST() - {}) repeat forall productions A  B or A  B with   FIRST() do FOLLOW(B) = FOLLOW(B)  FOLLOW(A) until all FOLLOW sets remain the same

Another Definition of LL(1)
Define: Grammar G is LL(1) if for every A N with productions A  a1 | | an FIRST(ai FOLLOW(A))  FIRST(aj FOLLOW(A)) = {} for all i, j

Top-down Parsing L Start symbol and root of parse tree
Input tokens: <t0,t1,…,ti,...> E0 … En L Input tokens: <ti,...> E0 … En From left to right, “grow” the parse tree downwards ...

Construction of a predictive parsing table
Algorithm. Construction of a predictive parsing table M[,] = {} forall productions A   do forall a  FIRST() do M[A, a] = M[A, a]  {A  } if   FIRST() then forall b  FOLLOW(A) do M[A, b] = M[A, b]  {A  } Make all empty entries of M be error

Recursive Descent Parsing

Recursive Descent Parser
stat → var = expr ; expr → term [+ term] term → factor [* factor] factor → ( expr ) | var | constant | call( fun, expr ) var → identifier

Scanner public class Scanner { private StreamTokenizer input;
private Type lastToken; public enum Type { INVALID_CHAR, NO_TOKEN , VARIABLE, PLUS, FALSE, TRUE, // etc. for remaining tokens, then: EOF }; public Scanner (Reader r) { input = new StreamTokenizer(r); input.resetSyntax(); input.eolIsSignificant(false); input.wordChars('a', 'z'); input.wordChars('A', 'Z'); input.ordinaryChar('+'); input.ordinaryChar('*'); input.ordinaryChar('='); input.ordinaryChar('('); input.ordinaryChar(')'); }

Scanner public Type nextToken() { try { switch (input.nextToken()) {
case StreamTokenizer.TT_EOF: return EOF; case Type.TT_WORD: if (input.sval.equals("false")) return FALSE; else if (input.sval.equals("true")) return TRUE; else return VARIABLE; case '+': return PLUS; break; // etc. } } catch (IOException ex) { return EOF; }

Parser public class Parser { private Scanner scanner;
private Type token; public Expr parse(Reader r) throws SyntaxException { scanner = new Scanner(r); nextToken(); // assigns token Statement stat = statement(); expect(scanner.EOF); return stat; }

Statement // stat ::= variable '=' expr ';'
private Statement stat() throws … { Variable var = variable(); expect(scanner.ASSIGN); Expr exp = expr(); Statement stat = new Statement(var, exp); expect(scanner.SEMICOLON); return stat; }

class Statement { Variable var; Expr expr; Statement(Variable v, Expression e) { this.var = v; this.expr = e; } …

interface Expression { int eval(); String print(); } class Variable implements Expression { String print() { return name; } String name; Variable(String name) {this.name = name); } class Expr extends Expression … class Term extends Expression… class Factor exteds Expression …

class Term extends Expression { Expression first; Term rest; String print() { String res = first.print(); if (rest != null) res += “ + “ + rest.print(); return res; }

int eval() { int res = first. eval(); if (rest. = null) res += rest
int eval() { int res = first.eval(); if (rest != null) res += rest.eval(); return res; }

Expr // expr ::= term ['+' term] private Expr expr() throws … {
Term exp = term(); while (token == scanner.PLUS) { nextToken(); exp = new Term(exp, term()); } return exp;

Term // term ::= factor ['*' term ]
private Term term() throws SyntaxException { Term exp = factor(); // Rest of body: left as an exercise. }

Factor // factor ::= ( expr ) | var
private Factor factor() throws SyntaxException { Factor exp = null; if (token == scanner.LEFT_PAREN) { nextToken(); exp = expr(); expect(scanner.RIGHT_PAREN); } else { exp = variable(); } return exp;

Variable // variable ::= identifier
private Variable variable() throws SyntaxException { if (token == scanner.ID) { Variable exp = new Variable(input.getString()); nextToken(); return exp; }

Constant private Expr constantExpression() throws SyntaxException {
Expr exp = null; // Handle the various cases for constant // expressions: left as an exercise. return exp; }

Utilities private void expect(Type t) throws SyntaxException {
if (token != t) throw SyntaxException... nextToken(); } private void nextToken() { token = scanner.nextToken();

Exercise Write a RD parser for JSON

Object

Example { "name": { "first": "John", "last": "Doe" }, "age": 23, "children": ["Paul", "Mary"] }

Recursive Descent in practice
RD parser are quite used in practice even for real programming languages GCC uses an hand-written RD parser for both C and C++ Clang uses RD for parsing C V8 uses RD for JavaScript Ocaml and Haskell use LALR(1) parser generators

Pascal Syntax Diagrams

Regular Languages

Regular Languages Definition. A regular grammar is one whose productions are all of the type: A  aB A  a A Regular Expression is either: a R1 | R2 R1 R2 R*

Nondeterministic Finite State Automaton
start a b b 1 2 3 b

Regular Languages Theorem. The classes of languages coincide.
Generated by a regular grammar Expressed by a regular expression Recognized by a NDFS automaton Recognized by a DFS automaton coincide.

Deterministic Finite Automaton
space, tab, newline START digit digit NUM $ $ $ KEYWORD letter =, +, -, /, (, ) circle state double circle accept state arrow transition uppercase state names lower case labels transition characters OPERATOR

Scanner code state := start loop
if no input character buffered then read one, and add it to the accumulated token case state of start: case input_char of A..Z, a..z : state := id : state := num else ... end id: : state := id num: 0..9: ... ... end;

Table-driven DFA 0-start 1-num 2-id 3-operator 4-keyword white space
exit letter 2 error digit 1 operator 3 $ 4

Language Classes L0 L0 CSL CFL [NPA] LR(1) LL(1) RL [DFA=NFA]

Question Are regular expressions, as provided by Perl or other languages, sufficient for parsing nested structures, e.g. XML files?

Exercise ( E + 0) -> E ( E * 1) -> E (E * 0) -> 0 ( a + ( b * 0)) => a s/pattern/replacement/ s/( $.*$ \* 0)/0/

(((((((((((( ))))))))))) S -> (S) | () S -> (A A -> S) | ) S -> | (S) <ul> <li> <dt> <ul> ….. </li></ul>

Parsing Giuseppe Attardi Università di Pisa.

Similar presentations

Presentation on theme: "Parsing Giuseppe Attardi Università di Pisa."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parsing Giuseppe Attardi Università di Pisa.

Similar presentations

Presentation on theme: "Parsing Giuseppe Attardi Università di Pisa."— Presentation transcript:

Similar presentations

About project

Feedback