Parsers for programming languages The Chinese University of Hong Kong Fall 2008 CSC 3130: Automata theory and formal languages Parsers for programming languages Andrej Bogdanov http://www.cse.cuhk.edu.hk/~andrejb/csc3130
CFG of the java programming language Identifier: IDENTIFIER QualifiedIdentifier: Identifier { . Identifier } Literal: IntegerLiteral FloatingPointLiteral CharacterLiteral StringLiteral BooleanLiteral NullLiteral Expression: Expression1 [AssignmentOperator Expression1]] AssignmentOperator: = += -= *= /= &= |= … from http://java.sun.com/docs/books/jls /second_edition/html/syntax.doc.html#52996
Parsing java programs Simple java program: about 1000 symbols … class Point2d { /* The X and Y coordinates of the point--instance variables */ private double x; private double y; private boolean debug; // A trick to help with debugging public Point2d (double px, double py) { // Constructor x = px; y = py; debug = false; // turn off debugging } public Point2d () { // Default constructor this (0.0, 0.0); // Invokes 2 parameter Point2D constructor // Note that a this() invocation must be the BEGINNING of // statement body of constructor public Point2d (Point2d pt) { // Another consructor x = pt.getX(); y = pt.getY(); … Simple java program: about 1000 symbols
Parsing algorithms How long would it take to parse this? Can we parse faster? No! CYK is the fastest known general-purpose parsing algorithm exhaustive algorithm about 1080 years (longer than life of universe) CYK algorithm about 1 week!
Another way of thinking Scientist: Find an algorithm that can parse strings in any grammar Engineer: Design your grammar so it has a very fast parsing algorithm
An example input: abaabbc S Tc(1) T TA(2) | A(3) Stack Input Action S Tc(1) T TA(2) | A(3) A aTb(4) | ab(5) a ab A T Ta Taa Taab TaA TaT TaTb TA Tc S abaabbc baabbc aabbc abbc bbc bc c shift reduce (5) reduce (3) reduce (4) reduce (2) reduce (1) input: abaabbc S T A T T A A a b a a b b c
Items S Tc(1) T TA(2) T A(3) A aTb(4) A ab(5) S •Tc Stack Input Action a ab A T Ta • abaabbc baabbc aabbc abbc shift reduce (5) reduce (3) Idea of parsing algorithm: Try to match complete items to top of stack
Some terminology input: abaabbc S Tc(1) T TA(2) | A(3) Stack Input Action S Tc(1) T TA(2) | A(3) A aTb(4) | ab(5) a ab A T Ta Taa Taab TaA TaT TaTb TA Tc S abaabbc baabbc aabbc abbc bbc bc c shift reduce (5) reduce (3) reduce (4) reduce (2) reduce (1) input: abaabbc handle valid items: a•Tb, a•b valid items: T•a, T•c, aT•b
Outline of LR(0) parsing algorithm As the string is being read, it is pushed on a stack Algorithm keeps track of all valid items Algorithm can perform two actions: no complete item is viable there is one valid item, and it is complete shift reduce
Running the algorithm A aAb | ab A aAb aabb A Stack Input Valid Items a aa aab aA aAb A aabb abb bb b A •aAb A •ab A a•Ab A a•b A ab• A aA•b A aAb• S R A aAb | ab A aAb aabb
Running the algorithm A aAb | ab A aAb aabb A Stack Input Valid Items a aa aab aA aAb A aabb abb bb b A •aAb A •ab A a•Ab A a•b A ab• A aA•b A aAb• S R A aAb | ab A aAb aabb
How to update viable items Initial set of valid items Updating valid items on “shift b” After these updates, for every valid item A a•Cb and production C •d, we also add as a valid item S •a for every production S a A a•bb is updated to A ab•b A a•Xb disappears if X ≠ b a, b: terminals A, B: variables X, Y: mixed symbols a, b: mixed strings notation C •d
How to update viable items Updating valid items on “reduce b to B” First, we backtrack to viable items before reduce Then, we apply same rules as for “shift B” (as if B were a terminal) A a•Bb is updated to A aB•b A a•Xb disappears if X ≠ B C •d is added for every valid item A a•Cb and production C •d
Viable item updates by eNFA States of eNFA will be items (plus a start state q0) For every item S •a we have a transition For every item A •X we have a transition For every item A a•Cb and production C •d e q0 S •a X A •X A X• e A •C C •d
Example A aAb | ab A A •aAb A a•Ab A aA•b a b q0
Convert eNFA to DFA states correspond to sets of valid items 4 A aA•b a A b 2 A a•Ab A a•b A •aAb A •ab 5 1 A aAb• A •aAb A •ab a b 3 A ab• die states correspond to sets of valid items transitions are labeled by variables / terminals
Attempt at parsing with DFA Stack Input DFA state a aa aab aA aabb abb bb b 1 2 3 ? A •aAb A •ab A a•Ab A a•b A ab• A aA•b S R A aAb | ab A aAb aabb
Remember the state in stack! Input DFA state 1 1a2 1a2a2 1a2a2b3 1a2A4 1a2A4b5 1A aabb abb bb b 1 2 3 4 5 A •aAb A •ab A a•Ab A a•b A ab• A aA•b A aAb• S R A aAb | ab A aAb aabb
LR(0) grammars and deterministic PDAs The parsing procedure can be implemented by a deterministic pushdown automaton A PDA is deterministic if in every state there is at most one possible transition for every input symbol and pop symbol, including e Example: PDA for w#wR is deterministic, but PDA for wwR is not
LR(0) grammars and deterministic PDAs Not every PDA can be made deterministic Since PDAs are equivalent to CFLs, LR(0) parsing algorithm must fail for some CFLs! When does LR(0) parsing algorithm fail?
Outline of LR(0) parsing algorithm Algorithm can perform two actions: What if: no complete item is valid there is one valid item, and it is complete shift (S) reduce (R) some valid items complete, some not more than one valid complete item S / R conflict R / R conflict
Hierarchy of context-free grammars parse using CYK algorithm (slow) LR(∞) grammars … to be continued… java perl python … LR(1) grammars LR(0) grammars parse using LR(0) algorithm