Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parsers for programming languages

Similar presentations


Presentation on theme: "Parsers for programming languages"— Presentation transcript:

1 Parsers for programming languages
The Chinese University of Hong Kong Fall 2009 CSC 3130: Automata theory and formal languages Parsers for programming languages Andrej Bogdanov

2 CFG of the java programming language
Identifier: IDENTIFIER QualifiedIdentifier: Identifier { . Identifier } Literal: IntegerLiteral FloatingPointLiteral CharacterLiteral StringLiteral BooleanLiteral NullLiteral Expression: Expression1 [AssignmentOperator Expression1]] AssignmentOperator: = += -= *= /= &= |= from /second_edition/html/syntax.doc.html#52996

3 Parsing java programs Simple java program: about 1000 symbols …
class Point2d { /* The X and Y coordinates of the point--instance variables */ private double x; private double y; private boolean debug; // A trick to help with debugging public Point2d (double px, double py) { // Constructor x = px; y = py; debug = false; // turn off debugging } public Point2d () { // Default constructor this (0.0, 0.0); // Invokes 2 parameter Point2D constructor // Note that a this() invocation must be the BEGINNING of // statement body of constructor public Point2d (Point2d pt) { // Another consructor x = pt.getX(); y = pt.getY(); Simple java program: about 1000 symbols

4 Parsing algorithms How long would it take to parse this?
Can we parse faster? No! CYK is the fastest known general-purpose parsing algorithm exhaustive algorithm about 1080 years (longer than life of universe) CYK algorithm about 1 week!

5 Another way of thinking
Scientist: Find an algorithm that can parse strings in any grammar Engineer: Design your grammar so it has a very fast parsing algorithm

6 An example input: abaabbc S  Tc(1) T  TA(2) | A(3)
Stack Input Action S  Tc(1) T  TA(2) | A(3) A  aTb(4) | ab(5) a ab A T Ta Taa Taab TaA TaT TaTb TA Tc S abaabbc baabbc aabbc abbc bbc bc c shift reduce (5) reduce (3) reduce (4) reduce (2) reduce (1) input: abaabbc S T A T T A A a b a a b b c

7 Items S  Tc(1) T  TA(2) T  A(3) A  aTb(4) A  ab(5) S  •Tc
Stack Input Action a ab A T Ta abaabbc baabbc aabbc abbc shift reduce (5) reduce (3) Idea of parsing algorithm: Try to match complete items to top of stack

8 Some terminology input: abaabbc S  Tc(1) T  TA(2) | A(3)
Stack Input Action S  Tc(1) T  TA(2) | A(3) A  aTb(4) | ab(5) a ab A T Ta Taa Taab TaA TaT TaTb TA Tc S abaabbc baabbc aabbc abbc bbc bc c shift reduce (5) reduce (3) reduce (4) reduce (2) reduce (1) input: abaabbc handle valid items: a•Tb, a•b valid items: T•a, T•c, aT•b

9 Outline of LR(0) parsing algorithm
As the string is being read, it is pushed on a stack Algorithm keeps track of all valid items Algorithm can perform two actions: no complete item is valid there is one valid item, and it is complete shift reduce

10 Running the algorithm A  aAb | ab A  aAb  aabb A Stack Input
Valid Items a aa aab aA aAb A aabb abb bb b A  •aAb A  •ab A  a•Ab A  a•b A  ab• A  aA•b A  aAb• S R A  aAb | ab A  aAb  aabb

11 How to update valid items
Initial set of valid items Updating valid items on “shift b” After these updates, for every valid item A  a•Cb and production C  •d, we also add as a valid item S  •a for every production S  a A  a•bb is updated to A  ab•b A  a•Xb disappears if X ≠ b a, b: terminals A, B: variables X, Y: mixed symbols a, b: mixed strings notation C  •d

12 How to update valid items
Updating valid items on “reduce b to B” First, we backtrack to valid items before reduce Then, we apply same rules as for “shift B” (as if B were a terminal) A  a•Bb is updated to A  aB•b A  a•Xb disappears if X ≠ B C  •d is added for every valid item A  a•Cb and production C  •d

13 Viable item updates by NFA
States of NFA will be items (plus a start state q0) For every item S  •a we have a transition For every item A  •X we have a transition For every item A  a•Cb and production C  •d e q0 S  •a X A  •X A  X• e A  •C C  •d

14 Example A  aAb | ab a A A  •aAb A  a•Ab A  aA•b  q0  b A  aAb•

15 Convert NFA to DFA states correspond to sets of valid items
2 A  a•Ab A  a•b A  •aAb A  •ab 1 4 A A  •aAb A •ab a A  aA•b b b 3 5 A  ab• A  aAb• die states correspond to sets of valid items transitions are labeled by variables / terminals

16 Shift states and reduce states
2 A  a•Ab A  a•b A  •aAb A  •ab 1 4 A A  •aAb A •ab a A  aA•b b b 3 5 A  ab• A  aAb• 1 2 4 are shift states 3 5 are reduce states

17 Attempt at parsing with DFA
Stack Input DFA state a aa aab aA aabb abb bb b 1 2 3 ? A  •aAb A  •ab A  a•Ab A  a•b A  ab• A  aA•b S R A  aAb | ab A  aAb  aabb

18 Remember the state in stack!
Input DFA state 1 1a2 1a2a2 1a2a2b3 1a2A4 1a2A4b5 1A aabb abb bb b 1 2 3 4 5 A  •aAb A  •ab A  a•Ab A  a•b A  ab• A  aA•b A  aAb• S R A  aAb | ab A  aAb  aabb

19 Reconstructing the parse tree
Stack Input DFA state 1 12 122 1223 124 1245 aabb abb bb b 1 2 3 4 5 A  •aAb A  •ab A  a•Ab A  a•b A  ab• A  aA•b A  aAb• S R A A a a b b A  aAb | ab A  aAb  aabb

20 LR(0) grammars and deterministic PDAs
The parsing procedure can be implemented by a deterministic pushdown automaton A PDA is deterministic if in every state there is at most one possible transition for every input symbol and pop symbol, including e Example: PDA for w#wR is deterministic, but PDA for wwR is not

21 LR(0) grammars and deterministic PDAs
Not every PDA can be made deterministic Since PDAs are equivalent to CFGs, LR(0) parsing algorithm must fail for some CFG, e.g. Why does LR(0) parsing algorithm fail? L = {wwR : w ∈ {a, b}*}

22 Example 1 L = {wwR : w ∈ {a, b}*} A  aAa | bAb | e A  •aAa A  a•Aa
q0 e

23 Example 1 L = {wwR : w ∈ {a, b}*} A  aAa | bAb | e
shift-reduce conflict A  • b A  bAb•

24 Example 1 input: abba L = {wwR : w ∈ {a, b}*} A  aAa | bAb | e a, b a
shift or reduce? shift or reduce? input: abba

25 When you can’t LR(0) parse
Algorithm can perform two actions: What if: no complete item is valid there is one valid item, and it is complete shift (S) reduce (R) some valid items complete, some not more than one valid complete item S / R conflict R / R conflict

26 Example 2 L = {w#wR : w ∈ {a, b}*} A  aAa | bAb | # a A a A  •aAa e
q0 A  •# A  #• e e e b A b A  •bAb e A  b•Ab A  bA•b A  bAb•

27 Example 2 L = {wwR : w ∈ {a, b}*} A  aAa | bAb | e
4 2 a A  aAa• 1 A  a•Aa 3 A  •aAa A  b•Ab A  aA•a a, b A A  •bAb A  •aAa A  bA•b A  •# A  •bAb 5 A  •# b A  bAb• # # 6 A  #• No S/R or R/R conflicts!

28 Example 2: parsing A A A b a # a b A Stack State 1 12 122 1226 1223
12234 123 1236 1 2 6 3 4 5 S R A A A b a # a b

29 Hierarchy of context-free grammars
parse using CYK algorithm (slow) LR(∞) grammars to be continued… java perl python LR(1) grammars LR(0) grammars parse using LR(0) algorithm


Download ppt "Parsers for programming languages"

Similar presentations


Ads by Google