Parsers for programming languages

Slides:



Advertisements
Similar presentations
Automata Theory December 2001 NPDAPart 3:. 2 NPDA example Example: a calculator for Reverse Polish expressions Infix expressions like: a + log((b + c)/d)
Advertisements

CSCI 3130: Formal Languages and Automata Theory Tutorial 5
Pushdown Automata Section 2.2 CSC 4170 Theory of Computation.
Pushdown Automata Consists of –Pushdown stack (can have terminals and nonterminals) –Finite state automaton control Can do one of three actions (based.
Pushdown Automata Part II: PDAs and CFG Chapter 12.
CS 310 – Fall 2006 Pacific University CS310 Pushdown Automata Sections: 2.2 page 109 October 11, 2006.
1 CSC 3130: Automata theory and formal languages Tutorial 4 KN Hung Office: SHB 1026 Department of Computer Science & Engineering.
CS 310 – Fall 2006 Pacific University CS310 Pushdown Automata Sections: 2.2 page 109 October 9, 2006.
CSC 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Regular.
1 Normal Forms for Context-free Grammars. 2 Chomsky Normal Form All productions have form: variable and terminal.
1 Normal Forms for Context-free Grammars. 2 Chomsky Normal Form All productions have form: variable and terminal.
Prof. Fateman CS 164 Lecture 91 Bottom-Up Parsing Lecture 9.
CS5371 Theory of Computation Lecture 8: Automata Theory VI (PDA, PDA = CFG)
CS 3240: Languages and Computation Pushdown Automata & CF Grammars NOTE: THESE ARE ONLY PARTIAL SLIDES RELATED TO WEEKS 9 AND 10. PLEASE REFER TO THE TEXTBOOK.
نظریه زبان ها و ماشین ها فصل دوم Context-Free Languages دانشگاه صنعتی شریف بهار 88.
CSC 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong DFA to regular.
CSCI 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Ambiguity.
CS 461 – Oct. 12 Parsing Running a parse machine –“Goto” (or shift) actions –Reduce actions: backtrack to earlier state –Maintain stack of visited states.
CSCI 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Pushdown.
CSC 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Normal forms.
CSC 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Pushdown.
CSCI 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong LR(0) grammars.
CSC 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Pushdown.
Grammar Set of variables Set of terminal symbols Start variable Set of Production rules.
Bernd Fischer RW713: Compiler and Software Language Engineering.
6. Pushdown Automata CIS Automata and Formal Languages – Pei Wang.
Nondeterminism The Chinese University of Hong Kong Fall 2011
Pushdown Automata.
Parsing Bottom Up CMPS 450 J. Moloney CMPS 450.
Programming Languages Translator
LR(k) grammars The Chinese University of Hong Kong Fall 2009
CSE 105 theory of computation
Ambiguity Parsing algorithms
Syntax Specification and Analysis
Table-driven parsing Parsing performed by a finite state machine.
CSE 105 theory of computation
CS314 – Section 5 Recitation 3
AUTOMATA THEORY VI.
Bottom-Up Syntax Analysis
Pushdown Automata Reading: Chapter 6.
Intro to Theory of Computation
Context-Free Languages
Department of Software & Media Technology
LR(0) grammars The Chinese University of Hong Kong Fall 2010
REGULAR LANGUAGES AND REGULAR GRAMMARS
Pushdown automata and CFG ↔ PDA conversions
CSCI 3130: Formal languages and automata theory Tutorial 6
LR(1) grammars The Chinese University of Hong Kong Fall 2010
More on DFA minimization and DFA equivalence
Intro to Data Structures
Context-Free Grammars
Context-Free Languages
فصل دوم Context-Free Languages
Pushdown automata a_introduction.htm.
LR Parsing. Parser Generators.
Parsers for programming languages
Chapter 2 Context-Free Language - 01
CSE 105 theory of computation
Chapter Fifteen: Stack Machine Applications
LR(1) grammars The Chinese University of Hong Kong Fall 2011
CSE 105 theory of computation
Pushdown automata The Chinese University of Hong Kong Fall 2011
LR(k) grammars The Chinese University of Hong Kong Fall 2008
Normal forms and parsing
CSE 105 theory of computation
Normal Forms for Context-free Grammars
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 7, 10/09/2003 Prof. Roy Levow.
CSE 105 theory of computation
CSE 105 theory of computation
Parsing CSCI 432 Computer Science Theory
Presentation transcript:

Parsers for programming languages The Chinese University of Hong Kong Fall 2009 CSC 3130: Automata theory and formal languages Parsers for programming languages Andrej Bogdanov http://www.cse.cuhk.edu.hk/~andrejb/csc3130

CFG of the java programming language Identifier: IDENTIFIER QualifiedIdentifier: Identifier { . Identifier } Literal: IntegerLiteral FloatingPointLiteral CharacterLiteral StringLiteral BooleanLiteral NullLiteral Expression: Expression1 [AssignmentOperator Expression1]] AssignmentOperator: = += -= *= /= &= |= … from http://java.sun.com/docs/books/jls /second_edition/html/syntax.doc.html#52996

Parsing java programs Simple java program: about 1000 symbols … class Point2d { /* The X and Y coordinates of the point--instance variables */ private double x; private double y; private boolean debug; // A trick to help with debugging public Point2d (double px, double py) { // Constructor x = px; y = py; debug = false; // turn off debugging } public Point2d () { // Default constructor this (0.0, 0.0); // Invokes 2 parameter Point2D constructor // Note that a this() invocation must be the BEGINNING of // statement body of constructor public Point2d (Point2d pt) { // Another consructor x = pt.getX(); y = pt.getY(); … Simple java program: about 1000 symbols

Parsing algorithms How long would it take to parse this? Can we parse faster? No! CYK is the fastest known general-purpose parsing algorithm exhaustive algorithm about 1080 years (longer than life of universe) CYK algorithm about 1 week!

Another way of thinking Scientist: Find an algorithm that can parse strings in any grammar Engineer: Design your grammar so it has a very fast parsing algorithm

An example input: abaabbc S  Tc(1) T  TA(2) | A(3) Stack Input Action S  Tc(1) T  TA(2) | A(3) A  aTb(4) | ab(5)  a ab A T Ta Taa Taab TaA TaT TaTb TA Tc S abaabbc baabbc aabbc abbc bbc bc c  shift reduce (5) reduce (3) reduce (4) reduce (2) reduce (1) input: abaabbc S T A T T A A a b a a b b c

Items S  Tc(1) T  TA(2) T  A(3) A  aTb(4) A  ab(5) S  •Tc Stack Input Action  a ab A T Ta • abaabbc baabbc aabbc abbc shift reduce (5) reduce (3) Idea of parsing algorithm: Try to match complete items to top of stack

Some terminology input: abaabbc S  Tc(1) T  TA(2) | A(3) Stack Input Action S  Tc(1) T  TA(2) | A(3) A  aTb(4) | ab(5)  a ab A T Ta Taa Taab TaA TaT TaTb TA Tc S abaabbc baabbc aabbc abbc bbc bc c  shift reduce (5) reduce (3) reduce (4) reduce (2) reduce (1) input: abaabbc handle valid items: a•Tb, a•b valid items: T•a, T•c, aT•b

Outline of LR(0) parsing algorithm As the string is being read, it is pushed on a stack Algorithm keeps track of all valid items Algorithm can perform two actions: no complete item is valid there is one valid item, and it is complete shift reduce

Running the algorithm A  aAb | ab A  aAb  aabb A Stack Input Valid Items  a aa aab aA aAb A aabb abb bb b  A  •aAb A  •ab A  a•Ab A  a•b A  ab• A  aA•b A  aAb• S R A  aAb | ab A  aAb  aabb

How to update valid items Initial set of valid items Updating valid items on “shift b” After these updates, for every valid item A  a•Cb and production C  •d, we also add as a valid item S  •a for every production S  a A  a•bb is updated to A  ab•b A  a•Xb disappears if X ≠ b a, b: terminals A, B: variables X, Y: mixed symbols a, b: mixed strings notation C  •d

How to update valid items Updating valid items on “reduce b to B” First, we backtrack to valid items before reduce Then, we apply same rules as for “shift B” (as if B were a terminal) A  a•Bb is updated to A  aB•b A  a•Xb disappears if X ≠ B C  •d is added for every valid item A  a•Cb and production C  •d

Viable item updates by NFA States of NFA will be items (plus a start state q0) For every item S  •a we have a transition For every item A  •X we have a transition For every item A  a•Cb and production C  •d e q0 S  •a X A  •X A  X• e A  •C C  •d

Example A  aAb | ab a A A  •aAb A  a•Ab A  aA•b  q0  b A  aAb•

Convert NFA to DFA states correspond to sets of valid items 2 A  a•Ab A  a•b A  •aAb A  •ab 1 4 A A  •aAb A •ab a A  aA•b b b 3 5 A  ab• A  aAb• die states correspond to sets of valid items transitions are labeled by variables / terminals

Shift states and reduce states 2 A  a•Ab A  a•b A  •aAb A  •ab 1 4 A A  •aAb A •ab a A  aA•b b b 3 5 A  ab• A  aAb• 1 2 4 are shift states 3 5 are reduce states

Attempt at parsing with DFA Stack Input DFA state  a aa aab aA aabb abb bb b 1 2 3 ? A  •aAb A  •ab A  a•Ab A  a•b A  ab• A  aA•b S R A  aAb | ab A  aAb  aabb

Remember the state in stack! Input DFA state 1 1a2 1a2a2 1a2a2b3 1a2A4 1a2A4b5 1A aabb abb bb b  1 2 3 4 5 A  •aAb A  •ab A  a•Ab A  a•b A  ab• A  aA•b A  aAb• S R A  aAb | ab A  aAb  aabb

Reconstructing the parse tree Stack Input DFA state 1 12 122 1223 124 1245 aabb abb bb b  1 2 3 4 5 A  •aAb A  •ab A  a•Ab A  a•b A  ab• A  aA•b A  aAb• S R A A a a b b • • • • • A  aAb | ab A  aAb  aabb

LR(0) grammars and deterministic PDAs The parsing procedure can be implemented by a deterministic pushdown automaton A PDA is deterministic if in every state there is at most one possible transition for every input symbol and pop symbol, including e Example: PDA for w#wR is deterministic, but PDA for wwR is not

LR(0) grammars and deterministic PDAs Not every PDA can be made deterministic Since PDAs are equivalent to CFGs, LR(0) parsing algorithm must fail for some CFG, e.g. Why does LR(0) parsing algorithm fail? L = {wwR : w ∈ {a, b}*}

Example 1 L = {wwR : w ∈ {a, b}*} A  aAa | bAb | e A  •aAa A  a•Aa q0 e

Example 1 L = {wwR : w ∈ {a, b}*} A  aAa | bAb | e shift-reduce conflict A  • b A  bAb•

Example 1 input: abba L = {wwR : w ∈ {a, b}*} A  aAa | bAb | e a, b a shift or reduce? shift or reduce? input: abba

When you can’t LR(0) parse Algorithm can perform two actions: What if: no complete item is valid there is one valid item, and it is complete shift (S) reduce (R) some valid items complete, some not more than one valid complete item S / R conflict R / R conflict

Example 2 L = {w#wR : w ∈ {a, b}*} A  aAa | bAb | # a A a A  •aAa e q0 A  •# A  #• e e e b A b A  •bAb e A  b•Ab A  bA•b A  bAb•

Example 2 L = {wwR : w ∈ {a, b}*} A  aAa | bAb | e 4 2 a A  aAa• 1 A  a•Aa 3 A  •aAa A  b•Ab A  aA•a a, b A A  •bAb A  •aAa A  bA•b A  •# A  •bAb 5 A  •# b A  bAb• # # 6 A  #• No S/R or R/R conflicts!

Example 2: parsing A A A b a # a b A Stack State 1 12 122 1226 1223 12234 123 1236 1 2 6 3 4 5 S R A A A b a # a b • • • • • •

Hierarchy of context-free grammars parse using CYK algorithm (slow) LR(∞) grammars … to be continued… java perl python … LR(1) grammars LR(0) grammars parse using LR(0) algorithm