CYK Algorithm for Parsing General Context-Free Grammars

Slides:



Advertisements
Similar presentations
Simplifications of Context-Free Grammars
Advertisements

Grammar vs Recursive Descent Parser
CPSC Compiler Tutorial 4 Midterm Review. Deterministic Finite Automata (DFA) Q: finite set of states Σ: finite set of “letters” (input alphabet)
Closure Properties of CFL's
CYK Parser Von Carla und Cornelia Kempa. Overview Top-downBottom-up Non-directional methods Unger ParserCYK Parser.
Exercise 1: Balanced Parentheses Show that the following balanced parentheses grammar is ambiguous (by finding two parse trees for some input sequence)
Simplifying CFGs There are several ways in which context-free grammars can be simplified. One natural way is to eliminate useless symbols those that cannot.
About Grammars CS 130 Theory of Computation HMU Textbook: Sec 7.1, 6.3, 5.4.
Fall 2004COMP 3351 Simplifications of Context-Free Grammars.
Prof. Busch - LSU1 Simplifications of Context-Free Grammars.
How to find and remove unproductive rules in a grammar Roger L. Costello May 1, 2014 New! How to find and remove unreachable rules in a grammar.
FORMAL LANGUAGES, AUTOMATA, AND COMPUTABILITY
CS5371 Theory of Computation
Courtesy Costas Buch - RPI1 Simplifications of Context-Free Grammars.
Costas Buch - RPI1 Simplifications of Context-Free Grammars.
CS Master – Introduction to the Theory of Computation Jan Maluszynski - HT Lecture 4 Context-free grammars Jan Maluszynski, IDA, 2007
Parsing III (Eliminating left recursion, recursive descent parsing)
Costas Busch - RPI1 Context-Free Languages. Costas Busch - RPI2 Regular Languages.
1 Simplifications of Context-Free Grammars. 2 A Substitution Rule substitute B equivalent grammar.
1 Simplifications of Context-Free Grammars. 2 A Substitution Rule Substitute Equivalent grammar.
1 Compilers. 2 Compiler Program v = 5; if (v>5) x = 12 + v; while (x !=3) { x = x - 3; v = 10; } Add v,v,0 cmp v,5 jmplt ELSE THEN: add x, 12,v.
Parsing — Part II (Ambiguity, Top-down parsing, Left-recursion Removal)
Normal forms for Context-Free Grammars
How to Convert a Context-Free Grammar to Greibach Normal Form
January 15, 2014CS21 Lecture 61 CS21 Decidability and Tractability Lecture 6 January 16, 2015.
Cs466(Prasad)L8Norm1 Normal Forms Chomsky Normal Form Griebach Normal Form.
Fall 2004COMP 3351 Context-Free Languages. Fall 2004COMP 3352 Regular Languages.
Prof. Busch - LSU1 Context-Free Languages. Prof. Busch - LSU2 Regular Languages Context-Free Languages.
INHERENT LIMITATIONS OF COMPUTER PROGRAMS CSci 4011.
(2.1) Grammars  Definitions  Grammars  Backus-Naur Form  Derivation – terminology – trees  Grammars and ambiguity  Simple example  Grammar hierarchies.
Efficiency in Parsing Arbitrary Grammars. Parsing using CYK Algorithm 1) Transform any grammar to Chomsky Form, in this order, to ensure: 1.terminals.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 7 Mälardalen University 2010.
The CYK Parsing Method Chiyo Hotani Tanya Petrova CL2 Parsing Course 28 November, 2007.
نظریه زبان ها و ماشین ها فصل دوم Context-Free Languages دانشگاه صنعتی شریف بهار 88.
CONVERTING TO CHOMSKY NORMAL FORM
Formal Languages Context free languages provide a convenient notation for recursive description of languages. The original goal of CFL was to formalize.
Context-free Grammars Example : S   Shortened notation : S  aSaS   | aSa | bSb S  bSb Which strings can be generated from S ? [Section 6.1]
Context-Free Grammars Normal Forms Chapter 11. Normal Forms A normal form F for a set C of data objects is a form, i.e., a set of syntactically valid.
Normal Forms for Context-Free Grammars Definition: A symbol X in V  T is useless in a CFG G=(V, T, P, S) if there does not exist a derivation of the form.
The Pumping Lemma for Context Free Grammars. Chomsky Normal Form Chomsky Normal Form (CNF) is a simple and useful form of a CFG Every rule of a CNF grammar.
Membership problem CYK Algorithm Project presentation CS 5800 Spring 2013 Professor : Dr. Elise de Doncker Presented by : Savitha parur venkitachalam.
CS 44 – Jan. 29 Expression grammars –Associativity √ –Precedence CFG for entire language (handout) CYK algorithm –General technique for testing for acceptance.
Chapter 6 Simplification of Context-free Grammars and Normal Forms These class notes are based on material from our textbook, An Introduction to Formal.
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
Compiler Construction 2011 CYK Algorithm for Parsing General Context-Free Grammars
Closure Properties Lemma: Let A 1 and A 2 be two CF languages, then the union A 1  A 2 is context free as well. Proof: Assume that the two grammars are.
Top-down Parsing. 2 Parsing Techniques Top-down parsers (LL(1), recursive descent) Start at the root of the parse tree and grow toward leaves Pick a production.
Unit-3 Parsing Theory (Syntax Analyzer) PREPARED BY: PROF. HARISH I RATHOD COMPUTER ENGINEERING DEPARTMENT GUJARAT POWER ENGINEERING & RESEARCH INSTITUTE.
Bc. Jozef Lang (xlangj01) Bc. Zoltán Zemko (xzemko01) Increasing power of LL(k) parsers.
11 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 7 School of Innovation, Design and Engineering Mälardalen University 2012.
CYK Algorithm for Parsing General Context-Free Grammars.
Computing if a token can follow first(B 1... B p ) = {a  | B 1...B p ...  aw } follow(X) = {a  | S ... ...Xa... } There exists a derivation from.
Context-Free Languages. Regular Languages Context-Free Languages.
Parsing using CYK Algorithm Transform grammar into Chomsky Form: 1.remove unproductive symbols 2.remove unreachable symbols 3.remove epsilons (no non-start.
Costas Busch - LSU1 Parsing. Costas Busch - LSU2 Compiler Program File v = 5; if (v>5) x = 12 + v; while (x !=3) { x = x - 3; v = 10; } Add v,v,5.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen Department of Computer Science University of Texas-Pan American.
Exercises on Chomsky Normal Form and CYK parsing
Lecture # 10 Grammar Problems. Problems with grammar Ambiguity Left Recursion Left Factoring Removal of Useless Symbols These can create problems for.
1 Context Free Grammars Xiaoyin Wang CS 5363 Spring 2016.
About Grammars Hopcroft, Motawi, Ullman, Chap 7.1, 6.3, 5.4.
Normal Forms for CFG’s Eliminating Useless Variables Removing Epsilon
Ambiguity Parsing algorithms
7. Properties of Context-Free Languages
Simplifications of Context-Free Grammars
Simplifications of Context-Free Grammars
Parsing Costas Busch - LSU.
Kanat Bolazar February 16, 2010
The Cocke-Kasami-Younger Algorithm
Normal forms and parsing
Context-Free Languages
Presentation transcript:

CYK Algorithm for Parsing General Context-Free Grammars

Why Parse General Grammars Can be difficult or impossible to make grammar unambiguous thus LL(k) and LR(k) methods cannot work, for such ambiguous grammars Some inputs are more complex than simple programming languages mathematical formulas: x = y /\ z ? (x=y) /\ z x = (y /\ z) natural language: I saw the man with the telescope. future programming languages

Ambiguity 1) 2) I saw the man with the telescope.

CYK Parsing Algorithm C: Y: K: John Cocke and Jacob T. Schwartz (1970). Programming languages and their compilers: Preliminary notes. Technical report, Courant Institute of Mathematical Sciences, New York University. Y: Daniel H. Younger (1967). Recognition and parsing of context-free languages in time n3. Information and Control 10(2): 189–208. K: T. Kasami (1965). An efficient recognition and syntax-analysis algorithm for context-free languages. Scientific report AFCRL-65-758, Air Force Cambridge Research Lab, Bedford, MA.

Two Steps in the Algorithm Transform grammar to normal form called Chomsky Normal Form (Noam Chomsky, mathematical linguist) Parse input using transformed grammar dynamic programming algorithm “a method for solving complex problems by breaking them down into simpler steps. It is applicable to problems exhibiting the properties of overlapping subproblems”

Balanced Parentheses Grammar Original grammar G S  “” | ( S ) | S S Modified grammar in Chomsky Normal Form: S  “” | S’ S’  N( NS) | N( N) | S’ S’ NS)  S’ N) N(  ( N)  ) Terminals: ( ) Nonterminals: S S’ NS) N) N(

Idea How We Obtained the Grammar S  ( S ) S’  N( NS) | N( N) N(  ( NS)  S’ N) N)  ) Because S can be empty but S’ cannot Chomsky Normal Form transformation can be done fully mechanically

Dynamic Programming to Parse Input Assume Chomsky Normal Form, 3 types of rules: S  “” | S’ (only for the start non-terminal) Nj  t (names for terminals) Ni  Nj Nk (just 2 non-terminals on RHS) Decomposing long input: find all ways to parse substrings of length 1,2,3,… Ni Nj Nk ( )

Parsing an Input S’  N( NS) | N( N) | S’ S’ NS)  S’ N) N(  ( N)  ) 7 ambiguity 6 5 4 3 2 1 N( N) ( ) 1 2 3 4 5 6 8 9

Parsing an Input S’  N( NS) | N( N) | S’ S’ NS)  S’ N) N(  ( N)  ) 7 ambiguity 6 5 4 3 2 1 N( N) ( )

key step of the algorithm: Algorithm Idea wpq – substring from p to q dpq – all non-terminals that could expand to wpq Initially dpp has Nw(p,p) key step of the algorithm: if X  Y Z is a rule, Y is in dp r , and Z is in d(r+1)q then put X into dpq (p <= r < q), in increasing value of (q-p) 7 6 5 4 3 2 1 N( N) ( ) 1 2 3 4 5 6 8 9

Algorithm INPUT: grammar G in Chomsky normal form word w to parse using G OUTPUT: true iff (w in L(G)) N = |w| var d : Array[N][N] for p = 1 to N { d(p)(p) = {X | G contains X->w(p)} for q in {p + 1 .. N} d(p)(q) = {} } for k = 2 to N // substring length for p = 0 to N-k // initial position for j = 1 to k-1 // length of first half val r = p+j-1; val q = p+k-1; for (X::=Y Z) in G if Y in d(p)(r) and Z in d(r+1)(q) d(p)(q) = d(p)(q) union {X} return S in d(0)(N-1) What is the running time as a function of grammar size and the size of input? O( ) ( )

Parsing another Input S’  N( NS) | N( N) | S’ S’ NS)  S’ N) N(  ( N)  ) 7 6 5 4 3 2 1 N( N) ( )

Number of Parse Trees Let w denote word ()()() it has two parse trees Give a lower bound on number of parse trees of the word wn (n is positive integer) w5 is the word ()()() ()()() ()()() ()()() ()()() CYK represents all parse trees compactly can re-run algorithm to extract first parse tree, or enumerate parse trees one by one

Conversion to Chomsky Normal Form (CNF) Steps: (not in the optimal order) remove unproductive symbols remove unreachable symbols remove epsilons (no non-start nullable symbols) remove single non-terminal productions (unit Productions): X::=Y reduce arity of every production to less than two make terminals occur alone on right-hand side

1) Unproductive non-terminals What is funny about this grammar: stmt ::= identifier := identifier | while (expr) stmt | if (expr) stmt else stmt expr ::= term + term | term – term term ::= factor * factor factor ::= ( expr ) There is no derivation of a sequence of tokens from expr In every step will have at least one expr, term, or factor If it cannot derive sequence of tokens we call it unproductive

1) Unproductive non-terminals Productive symbols are obtained using these two rules (what remains is unproductive) Terminals are productive If X::= s1 s2 … sn is a rule and each si is productive then X is productive Delete unproductive symbols. The language recognized by the grammar will not change stmt ::= identifier := identifier | while (expr) stmt | if (expr) stmt else stmt expr ::= term + term | term – term term ::= factor * factor factor ::= ( expr ) program ::= stmt | stmt program

2) Unreachable non-terminals What is funny about this grammar with start symbol ‘program’ program ::= stmt | stmt program stmt ::= assignment | whileStmt assignment ::= expr = expr ifStmt ::= if (expr) stmt else stmt whileStmt ::= while (expr) stmt expr ::= identifier No way to reach symbol ‘ifStmt’ from ‘program’ Can we formulate rules for reachable symbols ?

2) Unreachable non-terminals Reachable terminals are obtained using the following rules (the rest are unreachable) starting non-terminal is reachable (program) If X::= s1 s2 … sn is rule and Delete unreachable nonterminals and their productions X is reachable then every non-terminal in s1 s2 … sn is reachable

3) Removing Empty Strings Ensure only top-level symbol can be nullable program ::= stmtSeq stmtSeq ::= stmt | stmt ; stmtSeq stmt ::= “” | assignment | whileStmt | blockStmt blockStmt ::= { stmtSeq } assignment ::= expr = expr whileStmt ::= while (expr) stmt expr ::= identifier How to do it in this example?

3) Removing Empty Strings - Result program ::= “” | stmtSeq stmtSeq ::= stmt| stmt ; stmtSeq | | ; stmtSeq | stmt ; | ; stmt ::= assignment | whileStmt | blockStmt blockStmt ::= { stmtSeq } | { } assignment ::= expr = expr whileStmt ::= while (expr) stmt whileStmt ::= while (expr) expr ::= identifier

3) Removing Empty Strings - Algorithm Compute the set of nullable non-terminals If X::= 𝑠 1 ⋯ 𝑠 𝑛 is a production and 𝑠𝑖 is nullable then add new rule X::= 𝑠 1 ⋯ 𝑠 𝑖−1 𝑠 𝑖+1 ⋯ 𝑠 𝑛 | 𝑠 1 ⋯ 𝑠 𝑛 Remove all empty right-hand sides If starting symbol S was nullable, then introduce a new start symbol S’ instead, and add rule S’ ::= S | “”

3) Removing Empty Strings Since stmtSeq is nullable, the rule blockStmt ::= { stmtSeq } gives blockStmt ::= { stmtSeq } | { } Since stmtSeq and stmt are nullable, the rule stmtSeq ::= stmt | stmt ; stmtSeq gives stmtSeq ::= stmt | stmt ; stmtSeq | ; stmtSeq | stmt ; | ;

4) Eliminating unit productions Single production is of the form X ::=Y where X,Y are non-terminals program ::= stmtSeq stmtSeq ::= stmt | stmt ; stmtSeq stmt ::= assignment | whileStmt assignment ::= expr = expr whileStmt ::= while (expr) stmt

4) Eliminate unit productions - Result program ::= expr = expr | while (expr) stmt | stmt ; stmtSeq stmtSeq ::= expr = expr | while (expr) stmt | stmt ; stmtSeq stmt ::= expr = expr | while (expr) stmt assignment ::= expr = expr whileStmt ::= while (expr) stmt

4) Unit Production Elimination Algorithm If there is a unit production X ::=Y put an edge (X,Y) into graph If there is a path from X to Z in the graph, and there is rule Z ::= s1 s2 … sn then add rule X ::= s1 s2 … sn At the end, remove all unit productions. program ::= expr = expr | while (expr) stmt | stmt ; stmtSeq stmtSeq ::= expr = expr | while (expr) stmt | stmt ; stmtSeq stmt ::= expr = expr | while (expr) stmt

5) No more than 2 symbols on RHS stmt ::= while (expr) stmt becomes stmt ::= while stmt1 stmt1 ::= ( stmt2 stmt2 ::= expr stmt3 stmt3 ::= ) stmt

6) A non-terminal for each terminal stmt ::= while (expr) stmt becomes stmt ::= Nwhile stmt1 stmt1 ::= N( stmt2 stmt2 ::= expr stmt3 stmt3 ::= N) stmt Nwhile ::= while N( ::= ( N) ::= )

Order of steps in conversion to CNF remove unproductive symbols remove unreachable symbols Reduce arity of every production to <= 2 remove epsilons (no top-level nullable symbols) remove unit productions X::=Y make terminals occur alone on right-hand side Again remove unreachable symbols (if any) Again remove unproductive symbols (if any) What if we swap the steps 2 and 3 ? Potentially exponential blow-up in the # of productions What if we swap the steps 3 and 4 ? Epsilon removal can introduce unit productions

Parsing using CYK Algorithm Transform grammar into Chomsky Form: Have only rules X ::= Y Z, X ::= t, and possibly S ::= “” Apply CYK dynamic programming algorithm