Efficiency in Parsing Arbitrary Grammars. Parsing using CYK Algorithm 1) Transform any grammar to Chomsky Form, in this order, to ensure: 1.terminals.

Slides:



Advertisements
Similar presentations
Grammar vs Recursive Descent Parser
Advertisements

Abstract Syntax Mooly Sagiv html:// 1.
Exercise: Balanced Parentheses
Exercise 1: Balanced Parentheses Show that the following balanced parentheses grammar is ambiguous (by finding two parse trees for some input sequence)
Simplifying CFGs There are several ways in which context-free grammars can be simplified. One natural way is to eliminate useless symbols those that cannot.
Mooly Sagiv and Roman Manevich School of Computer Science
6/12/2015Prof. Hilfinger CS164 Lecture 111 Bottom-Up Parsing Lecture (From slides by G. Necula & R. Bodik)
By Neng-Fa Zhou Syntax Analysis lexical analyzer syntax analyzer semantic analyzer source program tokens parse tree parser tree.
Bottom-Up Syntax Analysis Mooly Sagiv Textbook:Modern Compiler Design Chapter (modified)
ML-YACC David Walker COS 320. Outline Last Week –Introduction to Lexing, CFGs, and Parsing Today: –More parsing: automatic parser generation via ML-Yacc.
Context-Free Grammars Lecture 7
ISBN Chapter 4 Lexical and Syntax Analysis The Parsing Problem Recursive-Descent Parsing.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Parsing — Part II (Ambiguity, Top-down parsing, Left-recursion Removal)
Syntax Analysis Mooly Sagiv Textbook:Modern Compiler Design Chapter 2.2 (Partial)
Bottom-Up Syntax Analysis Mooly Sagiv & Greta Yorsh Textbook:Modern Compiler Design Chapter (modified)
Bottom-up parsing Goal of parser : build a derivation
CSE 413 Programming Languages & Implementation Hal Perkins Autumn 2012 Context-Free Grammars and Parsing 1.
Attribute Grammars They extend context-free grammars to give parameters to non-terminals, have rules to combine attributes Attributes can have any type,
Parser construction tools: YACC
1 Abstract Syntax Tree--motivation The parse tree –contains too much detail e.g. unnecessary terminals such as parentheses –depends heavily on the structure.
CPSC 388 – Compiler Design and Construction Parsers – Context Free Grammars.
LR Parsing Compiler Baojian Hua
BİL 744 Derleyici Gerçekleştirimi (Compiler Design)1 Syntax Analyzer Syntax Analyzer creates the syntactic structure of the given source program. This.
Automated Parser Generation (via CUP)CUP 1. High-level structure JFlexjavac Lexer spec Lexical analyzer text tokens.java CUPjavac Parser spec.javaParser.
Syntax and Backus Naur Form
PART I: overview material
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
C H A P T E R TWO Syntax and Semantic.
CYK Algorithm for Parsing General Context-Free Grammars
Parsing Lecture 5 Fri, Jan 28, Syntax Analysis The syntax of a language is described by a context-free grammar. Each grammar rule has the form A.
Syntax Analysis Mooly Sagiv Textbook:Modern Compiler Design Chapter 2.2 (Partial) 1.
Bernd Fischer RW713: Compiler and Software Language Engineering.
CPS 506 Comparative Programming Languages Syntax Specification.
Compiler verification for fun and profit a week from now, 8:55am in Salle Polyvalente Xavier LeroyInria Paris–Rocquencourt …Compilers are, however, vulnerable.
Prof. Necula CS 164 Lecture 8-91 Bottom-Up Parsing LR Parsing. Parser Generators. Lecture 6.
1 Using Yacc. 2 Introduction Grammar –CFG –Recursive Rules Shift/Reduce Parsing –See Figure 3-2. –LALR(1) –What Yacc Cannot Parse It cannot deal with.
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
Compiler Principles Fall Compiler Principles Lecture 6: Parsing part 5 Roman Manevich Ben-Gurion University.
Top-down Parsing lecture slides from C OMP 412 Rice University Houston, Texas, Fall 2001.
Top-down Parsing Recursive Descent & LL(1) Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.
Top-Down Parsing CS 671 January 29, CS 671 – Spring Where Are We? Source code: if (b==0) a = “Hi”; Token Stream: if (b == 0) a = “Hi”; Abstract.
Top-down Parsing. 2 Parsing Techniques Top-down parsers (LL(1), recursive descent) Start at the root of the parse tree and grow toward leaves Pick a production.
1 A Simple Syntax-Directed Translator CS308 Compiler Theory.
CYK Algorithm for Parsing General Context-Free Grammars.
Computing if a token can follow first(B 1... B p ) = {a  | B 1...B p ...  aw } follow(X) = {a  | S ... ...Xa... } There exists a derivation from.
Compiler Principles Fall Compiler Principles Lecture 5: Parsing part 4 Roman Manevich Ben-Gurion University.
Syntax Analyzer (Parser)
Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 2 Syntax A language that is simple to parse.
CSE 5317/4305 L3: Parsing #11 Parsing #1 Leonidas Fegaras.
Parsing using CYK Algorithm Transform grammar into Chomsky Form: 1.remove unproductive symbols 2.remove unreachable symbols 3.remove epsilons (no non-start.
Earley’s Algorithm J. Earley, "An efficient context-free parsing algorithm", Communications of the Association for Computing Machinery, 13:2:94-102, 1970."An.
LECTURE 4 Syntax. SPECIFYING SYNTAX Programming languages must be very well defined – there’s no room for ambiguity. Language designers must use formal.
CPSC 388 – Compiler Design and Construction Parsers – Syntax Directed Translation.
1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations.
Syntax Analysis (chapter 4) SLR, LR1, LALR: Improving the parser From the previous examples: => LR0 parsing is rather weak. Cannot handle many languages.
Parsing III (Top-down parsing: recursive descent & LL(1) )
Bottom Up Parsing CS 671 January 31, CS 671 – Spring Where Are We? Finished Top-Down Parsing Starting Bottom-Up Parsing Lexical Analysis.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 6: LR grammars and automatic parser generators.
Compilers: Bottom-up/6 1 Compiler Structures Objective – –describe bottom-up (LR) parsing using shift- reduce and parse tables – –explain how LR.
Syntax Analysis Or Parsing. A.K.A. Syntax Analysis –Recognize sentences in a language. –Discover the structure of a document/program. –Construct (implicitly.
Announcements/Reading
CUP: An LALR Parser Generator for Java
A CYK for Any Grammar grammar G, non-terminals A1,...,AK, tokens t1,....tL input word: w = w(0)w(1) …w(N-1) wp..q = w(p)w(p+1) …w(q-1) Triple (A, p, q)
Parsing — Part II (Top-down parsing, left-recursion removal)
Lecture 7: Introduction to Parsing (Syntax Analysis)
R.Rajkumar Asst.Professor CSE
Kanat Bolazar February 16, 2010
Lecture 18 Bottom-Up Parsing or Shift-Reduce Parsing
Parsing with constraint
Presentation transcript:

Efficiency in Parsing Arbitrary Grammars

Parsing using CYK Algorithm 1) Transform any grammar to Chomsky Form, in this order, to ensure: 1.terminals t occur alone on the right-hand side: X:=t 2.no unproductive non-terminals symbols 3.no productions of arity more than two 4.no nullable symbols except for the start symbol 5.no single non-terminal productions X::=Y 6.no non-terminals unreachable from the starting one Have only rules X ::= Y Z, X ::= t Questions: –What is the worst-case increase in grammar size in each step? –Does any step break the property established by previous ones? 2) Apply CYK dynamic programming algorithm

A CYK for Any Grammar Would Do This input: grammar G, non-terminals A 1,...,A K, tokens t 1,....t L word: w  w (0) w (1) …w (N-1) notation: w p..q = w (p) w (p+1) …w (q-1) output: P set of (A, i, j) implying A => * w i..j, A can be: A k, t k, or  P = {(w (i),i,i+1)| 0  i < N-1} repeat { choose rule (A::=B 1...B m )  G if ((A,k 0,k m )  P && (for some k 1,…,k m-1 : ((m=0 && k 0 =k m ) || (B 1,k 0,k 1 ),(B 2,k 1,k 2 ),...,(B m,k m-1,k m )  P))) P := P U {(A,k 0,k m )} } until no more insertions possible What is the maximal number of steps? How long does it take to check step for a rule? for a given grammar

Observation How many ways are there to split a string of length Q into m segments? –number of {0,1} words of length Q+m with m zeros Exponential in m, so algorithm is exponential. For binary rules, m=2, so algorithm is efficient. –this is why we use at most binary rules in CYK –transformation into Chomsky form is polynomial

CYK Parser for Chomsky form input: grammar G, non-terminals A 1,...,A K, tokens t 1,....t L word: w  w (0) w (1) …w (N-1) notation: w p..q = w (p) w (p+1) …w (q-1) output: P set of (A, i, j) implying A => * w i..j, A can be: A k, t k, or  P = {(A,i,i+1)| 0  i < N-1 && ((A ::= w (i) )  G)} // unary rules repeat { choose rule (A::=B 1 B 2 )  G if ((A,k 0,k 2 )  P && for some k 1 : (B 1,k 0,k 1 ),(B 2,k 1,k 2 )  P) P := P U {(A,k 0,k 2 )} } until no more insertions possible return (S,0,N-1)  P Next: not just whether it parses, but compute the trees! Give a bound on the number of elements in P: K(N+1) 2 /2+LN

Computing Parse Results Semantic Actions

A CYK Algorithm Producing Results Rule (A::=B 1...B m, f)  G with semantic action f f : (RUT) m -> RR – results (e.g.trees) T - tokens Useful parser: returning a set of result (e.g. syntax trees) ((A, p, q),r): A => * w p..q and the result of parsing is r P = {((A,i,i+1), f(w (i) ))| 0  i < N-1 && ((A ::=w (i) ),f)  G)} // unary repeat { choose rule (A::=B 1 B 2, f)  G if ((A,k 0,k 2 )  P && for some k 1 : ((B 1,k 0,k 1 ),r 1 ), ((B 2,p 1,p 2 ),r 2 )  P P := P U {( (A,k 0,k 2 ), f(r 1,r 2 ) )} } until no more insertions possible A bound on the number of elements in P? 2 N : squared in each level Compute parse trees using identity functions as semantic actions: ((A ::=w (i) ), x:R => x) ((A::=B 1 B 2 ), (r 1,r 2 ):R 2 => Node A (r 1,r 2 ) )

Computing Abstract Trees for Ambiguous Grammar abstract class Tree case class ID(s:String) extends Tree case class Minus(e1:Tree,e2:Tree) extends Tree Ambiguous grammar: E ::= E – E | Ident type R = Tree Chomsky normal form:semantic actions: E ::= E R(e 1,e 2 ) => Minus(e 1,e 2 ) R ::= M E(_,e 2 ) => e 2 E ::= Identx => ID(x) M ::= –_ => Nil Input string: a – b – c ((E,0,1),ID(a)) ((M,1,2),Nil) ((E,2,3),ID(b)) ((M,3,4),Nil) ((E,4,5),ID(c)) ((R,1,3),ID(b))((R,3,5),ID(c)) ((E,0,3),Minus(ID(a),ID(b))) ((E,2,5),Minus(ID(b),ID(c))) ((R,1,5),Minus(ID(b),ID(c))) ((E,0,5),Minus(Minus(ID(a),ID(b)), ID(c))) ((E,0,5),Minus(ID(a), Minus(ID(b),ID(c)))) P:

A CYK Algorithm with Constraints Rule (A::=B 1...B m, f)  G with partial function semantic action f f : (RUT) m -> Option[R] R – results T - tokens Useful parser: returning a set of results (e.g. syntax trees) ((A, p, q),r): A => * w p..q and the result of parsing is r  R P = {((A,i,i+1), f(w (i) ).get)| 0  i < N-1 && ((A ::=w (i) ),f)  G)} repeat { choose rule (A::=B 1 B 2, f)  G if ((A,k 0,k 2 )  P && for some k 1 : ((B 1,k 0,k 1 ),r 1 ), ((B 2,p 1,p 2 ),r 2 )  P and f(r 1,r 2 ) != None //apply rule only if f is defined P := P U {( (A,k 0,k 2 ), f(r 1,r 2 ).get )} } until no more insertions possible

Resolving Ambiguity using Semantic Actions In Chomsky normal form:semantic action: E ::= E R (e 1,e 2 ) => Minus(e 1,e 2 ) mkMinus R ::= M e(_,e 2 ) => e 2 E ::= Identx => ID(x) M ::= –_ => Nil Input string: a – b – c ((e,0,1),ID(a)) ((M,1,2),Nil) ((e,2,3),ID(b)) ((M,3,4),Nil) ((e,4,5),ID(c)) ((R,1,3),ID(b))((R,3,5),ID(c)) ((e,0,3),Minus(ID(a),ID(b))) ((e,2,5),Minus(ID(b),ID(c))) ((R,1,5),Minus(ID(b),ID(c))) ((e,0,5),Minus(Minus(ID(a),ID(b)), ID(c))) ((e,0,5),Minus(ID(a), Minus(ID(b),ID(c)))) P: def mkMinus(e1 : Tree, e2: Tree) : Option[Tree] = (e1,e2) match { case (_,Minus(_,_)) => None case _ => Some(Minus(e1,e2)) }

Expression with More Operators: All Trees abstract class T case class ID(s:String) extends T case class BinaryOp(e1:T,op:OP,e2:T) extends T Ambiguous grammar: E ::= E (–|^) E | (E) | Ident Chomsky form: semantic action f: type of f (can vary): E ::= E R (e 1,(op,e 2 ))=>BinOp(e 1,op,e 2 ) (T,(OP,T)) => T R ::= O E (op,e 2 )=>(op,e 2 ) (OP,T) => (OP,T) E ::= Ident x => ID(x) Token => T O ::= – _ => MinusOp Token => OP O ::= ^ _ => PowerOp Token => OP E ::= P Q (_,e) => e (Unit,T) => T Q ::= E C (e,_) => e (T,Unit) => T P ::= ( _ => () Token => Unit C ::= ) _ => () Token => Unit

Priorities In addition to the tree, return the priority of the tree –usually the priority is the top-level operator –parenthesized expressions have high priority, as do other 'atomic' expressions (identifiers, literals) Disallow combining trees if the priority of current right-hand-side is higher than priority of results being combining Given: x - y * z with priority of * higher than of - –disallow combining x-y and z using * –allow combining x and y*z using -

Priorities and Associativity abstract class T case class ID(s:String) extends T case class BinaryOp(e1:T,op:OP,e2:T) extends T Ambiguous grammar: E ::= E (–|^) E | (E) | Ident Chomsky form: semantic action f: type of f E ::= E R (T’,(OP,T’)) => Option[T’] R ::= O E type T’ = (Tree,Int) tree,priority E ::= Ident x => ID(x) O ::= – _ => MinusOp O ::= ^ _ => PowerOp E ::= P Q (_,e) => e Q ::= E C (e,_) => e P ::= ( _ => () C ::= ) _ => ()

Priorities and Associativity Chomsky form: semantic action f: type of f E ::= E R mkBinOp (T’,(OP,T’)) => T’ def mkBinOp((e 1,p 1 ):T’, (op:OP,(e 2,p 2 ):T’) ) : Option[T’] = { val p = priorityOf(op) if ( (p < p 1 || (p==p 1 && isLeftAssoc(op)) && (p < p 2 || (p==p 2 && isRightAssoc(op))) Some((BinaryOp(e 1,op,e 2 ),p)) else None // there will another item in P that will apply instead } cf. middle operator: a*b+c*d a+b*c*d a–b–c–d a^b^c^d Parentheses get priority p larger than all operators: E ::= P Q(_,(e,p)) => Some((e,MAX)) Q ::= E C(e,_) => Some(e)

Efficiency of Dynamic Programming Chomsky normal form:semantic action: E ::= E R mkMinus R ::= M e(_,e 2 ) => e 2 E ::= Identx => ID(x) M ::= –_ => Nil Input string: a – b – c Naïve dynamic programming: derive all tuples (X,i,j) increasing j-i Instead: derive only the needed tuples, first time we need them Start from top non-terminal Result: Earley’s parsing algorithm (also needs no normal form!) Other efficient algos for LR(k),LALR(k) – not handle all grammars ((e,0,1),ID(a)) ((M,1,2),Nil) ((e,2,3),ID(b)) ((M,3,4),Nil) ((e,4,5),ID(c)) ((R,1,3),ID(b))((R,3,5),ID(c)) ((e,0,3),Minus(ID(a),ID(b))) ((e,2,5),Minus(ID(b),ID(c))) ((R,1,5),Minus(ID(b),ID(c))) ((e,0,5),Minus(Minus(ID(a),ID(b)), ID(c))) ((e,0,5),Minus(ID(a), Minus(ID(b),ID(c)))) P:

Dotted Rules Like Non-terminals X ::= Y 1 Y 2 Y 3 Chomsky transformation is (a simplification of) this: X ::= W 123 W 123 ::= W 12 Y 3 W 12 ::= W 1 Y 2 W 1 ::= W  Y 1 W  ::=  Early parser: dotted RHS as names of fresh non-terminals: X ::= [Y 1 Y 2 Y 3.] [Y 1 Y 2 Y 3.] ::= [Y 1 Y 2.Y 3 ] Y 3 [Y 1 Y 2.Y 3 ] ::= [Y 1.Y 2 Y 3 ] Y 2 [Y 1.Y 2 Y 3 ] ::= [.Y 1 Y 2 Y 3 ] Y 1 [.Y 1 Y 2 Y 3 ] ::= 

Earley Parser - group the triples by last element: S(q) ={(A,p)|(A,p,q)  P} - dotted rules effectively make productions at most binary

ID- ==IDEOF IDID-ID-IDID-ID==ID-ID==ID ID --ID-ID==-ID==ID - IDID==ID==ID ID ====ID == ID EOF e ::.ID ; ID. |.e – e ; e. – e ; e –. e ; e – e. |.e == e ; e. == e ; e ==. e ; e == e. S ::. e EOF ; e. EOF ; e EOF.

Attribute Grammars They extend context-free grammars to give parameters to non-terminals, have rules to combine attributes Attributes can have any type, but often they are trees Example: –context-free grammar rule: A ::= B C –attribute grammar rules: A ::= B C { Plus($1, $2) } or, e.g. A ::= B:x C:y {: RESULT := new Plus(x.v, y.v) :} Semantic actions indicate how to compute attributes attributes computed bottom-up, or in more general ways

Parser Generators: Attribute Grammar -> Parser 1) Embedded: parser combinators (Scala, Haskell) They are code in some (functional) language defdef ID : Parser = "x" | "y" | "z" def expr : Parser = factor ~ (( "+" ~ factor | "-" ~ factor ) | epsilon) def factor : Parser = term ~ (( "*" ~ term | "/" ~ term ) | epsilon) def term : Parser = ( "(" ~ expr ~ ")" | ID | NUM ) def implementation in Scala: use overloading and implicits 2) Standalone tools: JavaCC, Yacc, ANTLR, CUP –typically generate code in a conventional programming languages (e.g. Java) implicit conversion: string s to skip(s) concatenation <- often not really LL(1) but "try one by one", must put first non-empty, then epsilon

Example in CUP - LALR(1) (not LL(1) ) precedence left PLUS, MINUS; precedence left TIMES, DIVIDE, MOD; // priorities disambiguate precedence left UMINUS; expr ::= expr PLUS expr // ambiguous grammar works here | expr MINUS expr | expr TIMES expr | expr DIVIDE expr | expr MOD expr | MINUS expr %prec UMINUS | LPAREN expr RPAREN | NUMBER ;

Adding Java Actions to CUP Rules expr ::= expr:e1 PLUS expr:e2 {: RESULT = new Integer(e1.intValue() + e2.intValue()); :} | expr:e1 MINUS expr:e2 {: RESULT = new Integer(e1.intValue() - e2.intValue()); :} | expr:e1 TIMES expr:e2 {: RESULT = new Integer(e1.intValue() * e2.intValue()); :} | expr:e1 DIVIDE expr:e2 {: RESULT = new Integer(e1.intValue() / e2.intValue()); :} | expr:e1 MOD expr:e2 {: RESULT = new Integer(e1.intValue() % e2.intValue()); :} | NUMBER:n {: RESULT = n; :} | MINUS expr:e {: RESULT = new Integer(0 - e.intValue()); :} %prec UMINUS | LPAREN expr:e RPAREN {: RESULT = e; :} ;

Which Algorithms Do Tools Implement Many tools use LL(1) –easy to understand, similar to hand-written parser Even more tools use LALR(1) –in practice more flexible than LL(1) –can encode priorities without rewriting grammars –can have annoying shift-reduce conflicts –still does not handle general grammars Today we should probably be using more parsers for general grammars, such as Earley’s (optimized CYK)