Parsers for programming languages

Slides:



Advertisements
Similar presentations
CSCI 3130: Formal Languages and Automata Theory Tutorial 5
Advertisements

Pushdown Automata Section 2.2 CSC 4170 Theory of Computation.
Pushdown Automata Consists of –Pushdown stack (can have terminals and nonterminals) –Finite state automaton control Can do one of three actions (based.
Pushdown Automata Part II: PDAs and CFG Chapter 12.
CS 310 – Fall 2006 Pacific University CS310 Pushdown Automata Sections: 2.2 page 109 October 11, 2006.
1 CSC 3130: Automata theory and formal languages Tutorial 4 KN Hung Office: SHB 1026 Department of Computer Science & Engineering.
CS 310 – Fall 2006 Pacific University CS310 Pushdown Automata Sections: 2.2 page 109 October 9, 2006.
CS Master – Introduction to the Theory of Computation Jan Maluszynski - HT Lecture 4 Context-free grammars Jan Maluszynski, IDA, 2007
1 Normal Forms for Context-free Grammars. 2 Chomsky Normal Form All productions have form: variable and terminal.
1 Normal Forms for Context-free Grammars. 2 Chomsky Normal Form All productions have form: variable and terminal.
Prof. Fateman CS 164 Lecture 91 Bottom-Up Parsing Lecture 9.
CS 3240: Languages and Computation Pushdown Automata & CF Grammars NOTE: THESE ARE ONLY PARTIAL SLIDES RELATED TO WEEKS 9 AND 10. PLEASE REFER TO THE TEXTBOOK.
CSCI 2670 Introduction to Theory of Computing September 21, 2005.
Pushdown Automata.
نظریه زبان ها و ماشین ها فصل دوم Context-Free Languages دانشگاه صنعتی شریف بهار 88.
CSCI 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Ambiguity.
CS 461 – Oct. 12 Parsing Running a parse machine –“Goto” (or shift) actions –Reduce actions: backtrack to earlier state –Maintain stack of visited states.
CSCI 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Pushdown.
CSC 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Normal forms.
CSC 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Pushdown.
CSCI 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong LR(0) grammars.
CSC 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Pushdown.
Grammar Set of variables Set of terminal symbols Start variable Set of Production rules.
6. Pushdown Automata CIS Automata and Formal Languages – Pei Wang.
Nondeterminism The Chinese University of Hong Kong Fall 2011
Pushdown Automata.
Parsing Bottom Up CMPS 450 J. Moloney CMPS 450.
Programming Languages Translator
LR(k) grammars The Chinese University of Hong Kong Fall 2009
CSE 105 theory of computation
Table-driven parsing Parsing performed by a finite state machine.
FORMAL LANGUAGES AND AUTOMATA THEORY
CSE 105 theory of computation
CS314 – Section 5 Recitation 3
Pushdown Automata PDAs
PDAs Accept Context-Free Languages
Pushdown Automata PDAs
Bottom-Up Syntax Analysis
Pushdown Automata Reading: Chapter 6.
Intro to Theory of Computation
Context-Free Languages
LR(0) grammars The Chinese University of Hong Kong Fall 2010
REGULAR LANGUAGES AND REGULAR GRAMMARS
Pushdown automata and CFG ↔ PDA conversions
CSCI 3130: Formal languages and automata theory Tutorial 6
LR(1) grammars The Chinese University of Hong Kong Fall 2010
More on DFA minimization and DFA equivalence
Intro to Data Structures
Context-Free Grammars
فصل دوم Context-Free Languages
Pushdown automata a_introduction.htm.
LR Parsing. Parser Generators.
Parsers for programming languages
Chapter 2 Context-Free Language - 01
CSE 105 theory of computation
Chapter Fifteen: Stack Machine Applications
LR(1) grammars The Chinese University of Hong Kong Fall 2011
… NPDAs continued.
CSE 105 theory of computation
Pushdown automata The Chinese University of Hong Kong Fall 2011
LR(k) grammars The Chinese University of Hong Kong Fall 2008
Normal forms and parsing
Context Free Grammars-II
CSE 105 theory of computation
Normal Forms for Context-free Grammars
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 7, 10/09/2003 Prof. Roy Levow.
CSE 105 theory of computation
CSE 105 theory of computation
Parsing CSCI 432 Computer Science Theory
Presentation transcript:

Parsers for programming languages The Chinese University of Hong Kong Fall 2008 CSC 3130: Automata theory and formal languages Parsers for programming languages Andrej Bogdanov http://www.cse.cuhk.edu.hk/~andrejb/csc3130

CFG of the java programming language Identifier: IDENTIFIER QualifiedIdentifier: Identifier { . Identifier } Literal: IntegerLiteral FloatingPointLiteral CharacterLiteral StringLiteral BooleanLiteral NullLiteral Expression: Expression1 [AssignmentOperator Expression1]] AssignmentOperator: = += -= *= /= &= |= … from http://java.sun.com/docs/books/jls /second_edition/html/syntax.doc.html#52996

Parsing java programs Simple java program: about 1000 symbols … class Point2d { /* The X and Y coordinates of the point--instance variables */ private double x; private double y; private boolean debug; // A trick to help with debugging public Point2d (double px, double py) { // Constructor x = px; y = py; debug = false; // turn off debugging } public Point2d () { // Default constructor this (0.0, 0.0); // Invokes 2 parameter Point2D constructor // Note that a this() invocation must be the BEGINNING of // statement body of constructor public Point2d (Point2d pt) { // Another consructor x = pt.getX(); y = pt.getY(); … Simple java program: about 1000 symbols

Parsing algorithms How long would it take to parse this? Can we parse faster? No! CYK is the fastest known general-purpose parsing algorithm exhaustive algorithm about 1080 years (longer than life of universe) CYK algorithm about 1 week!

Another way of thinking Scientist: Find an algorithm that can parse strings in any grammar Engineer: Design your grammar so it has a very fast parsing algorithm

An example input: abaabbc S  Tc(1) T  TA(2) | A(3) Stack Input Action S  Tc(1) T  TA(2) | A(3) A  aTb(4) | ab(5)  a ab A T Ta Taa Taab TaA TaT TaTb TA Tc S abaabbc baabbc aabbc abbc bbc bc c  shift reduce (5) reduce (3) reduce (4) reduce (2) reduce (1) input: abaabbc S T A T T A A a b a a b b c

Items S  Tc(1) T  TA(2) T  A(3) A  aTb(4) A  ab(5) S  •Tc Stack Input Action  a ab A T Ta • abaabbc baabbc aabbc abbc shift reduce (5) reduce (3) Idea of parsing algorithm: Try to match complete items to top of stack

Some terminology input: abaabbc S  Tc(1) T  TA(2) | A(3) Stack Input Action S  Tc(1) T  TA(2) | A(3) A  aTb(4) | ab(5)  a ab A T Ta Taa Taab TaA TaT TaTb TA Tc S abaabbc baabbc aabbc abbc bbc bc c  shift reduce (5) reduce (3) reduce (4) reduce (2) reduce (1) input: abaabbc handle valid items: a•Tb, a•b valid items: T•a, T•c, aT•b

Outline of LR(0) parsing algorithm As the string is being read, it is pushed on a stack Algorithm keeps track of all valid items Algorithm can perform two actions: no complete item is viable there is one valid item, and it is complete shift reduce

Running the algorithm A  aAb | ab A  aAb  aabb A Stack Input Valid Items  a aa aab aA aAb A aabb abb bb b  A  •aAb A  •ab A  a•Ab A  a•b A  ab• A  aA•b A  aAb• S R A  aAb | ab A  aAb  aabb

Running the algorithm A  aAb | ab A  aAb  aabb A Stack Input Valid Items  a aa aab aA aAb A aabb abb bb b  A  •aAb A  •ab A  a•Ab A  a•b A  ab• A  aA•b A  aAb• S R A  aAb | ab A  aAb  aabb

How to update viable items Initial set of valid items Updating valid items on “shift b” After these updates, for every valid item A  a•Cb and production C  •d, we also add as a valid item S  •a for every production S  a A  a•bb is updated to A  ab•b A  a•Xb disappears if X ≠ b a, b: terminals A, B: variables X, Y: mixed symbols a, b: mixed strings notation C  •d

How to update viable items Updating valid items on “reduce b to B” First, we backtrack to viable items before reduce Then, we apply same rules as for “shift B” (as if B were a terminal) A  a•Bb is updated to A  aB•b A  a•Xb disappears if X ≠ B C  •d is added for every valid item A  a•Cb and production C  •d

Viable item updates by eNFA States of eNFA will be items (plus a start state q0) For every item S  •a we have a transition For every item A  •X we have a transition For every item A  a•Cb and production C  •d e q0 S  •a X A  •X A  X• e A  •C C  •d

Example A  aAb | ab  A A  •aAb A  a•Ab A  aA•b a  b q0 

Convert eNFA to DFA states correspond to sets of valid items 4 A  aA•b a A b 2 A  a•Ab A  a•b A  •aAb A  •ab 5 1 A  aAb• A  •aAb A •ab a b 3 A  ab• die states correspond to sets of valid items transitions are labeled by variables / terminals

Attempt at parsing with DFA Stack Input DFA state  a aa aab aA aabb abb bb b 1 2 3 ? A  •aAb A  •ab A  a•Ab A  a•b A  ab• A  aA•b S R A  aAb | ab A  aAb  aabb

Remember the state in stack! Input DFA state 1 1a2 1a2a2 1a2a2b3 1a2A4 1a2A4b5 1A aabb abb bb b  1 2 3 4 5 A  •aAb A  •ab A  a•Ab A  a•b A  ab• A  aA•b A  aAb• S R A  aAb | ab A  aAb  aabb

LR(0) grammars and deterministic PDAs The parsing procedure can be implemented by a deterministic pushdown automaton A PDA is deterministic if in every state there is at most one possible transition for every input symbol and pop symbol, including e Example: PDA for w#wR is deterministic, but PDA for wwR is not

LR(0) grammars and deterministic PDAs Not every PDA can be made deterministic Since PDAs are equivalent to CFLs, LR(0) parsing algorithm must fail for some CFLs! When does LR(0) parsing algorithm fail?

Outline of LR(0) parsing algorithm Algorithm can perform two actions: What if: no complete item is valid there is one valid item, and it is complete shift (S) reduce (R) some valid items complete, some not more than one valid complete item S / R conflict R / R conflict

Hierarchy of context-free grammars parse using CYK algorithm (slow) LR(∞) grammars … to be continued… java perl python … LR(1) grammars LR(0) grammars parse using LR(0) algorithm