Building Readahead FSMs for Grammars

Slides:

Advertisements

Similar presentations

 CS /11/12 Matthew Rodgers.  What are LL and LR parsers?  What grammars do they parse?  What is the difference between LL and LR?  Why do.

Advertisements

1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.

6/12/2015Prof. Hilfinger CS164 Lecture 111 Bottom-Up Parsing Lecture (From slides by G. Necula & R. Bodik)

Lecture #8, Feb. 7, 2007 Shift-reduce parsing,

Parsing V Introduction to LR(1) Parsers. from Cooper & Torczon2 LR(1) Parsers LR(1) parsers are table-driven, shift-reduce parsers that use a limited.

Parsing — Part II (Ambiguity, Top-down parsing, Left-recursion Removal)

Prof. Fateman CS 164 Lecture 91 Bottom-Up Parsing Lecture 9.

1 LR parsing techniques SLR (not in the book) –Simple LR parsing –Easy to implement, not strong enough –Uses LR(0) items Canonical LR –Larger parser but.

Bottom-up parsing Goal of parser : build a derivation

CPSC 388 – Compiler Design and Construction

1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.

Profs. Necula CS 164 Lecture Top-Down Parsing ICOM 4036 Lecture 5.

Prof. Necula CS 164 Lecture 8-91 Bottom-Up Parsing LR Parsing. Parser Generators. Lecture 6.

Top-down Parsing lecture slides from C OMP 412 Rice University Houston, Texas, Fall 2001.

Top-down Parsing. 2 Parsing Techniques Top-down parsers (LL(1), recursive descent) Start at the root of the parse tree and grow toward leaves Pick a production.

Top-Down Parsing.

Bottom Up Parsing CS 671 January 31, CS 671 – Spring Where Are We? Finished Top-Down Parsing Starting Bottom-Up Parsing Lexical Analysis.

Parsing V LR(1) Parsers. LR(1) Parsers LR(1) parsers are table-driven, shift-reduce parsers that use a limited right context (1 token) for handle recognition.

Conflicts in Simple LR parsers A SLR Parser does not use any lookahead The SLR parsing method fails if knowing the stack’s top state and next input token.

Announcements/Reading

Transition Graphs.

Closed book, closed notes

A Simple Syntax-Directed Translator

Programming Languages Translator

Chapter 2 :: Programming Language Syntax

The greatest mathematical tool of all!!

Parsing IV Bottom-up Parsing

Table-driven parsing Parsing performed by a finite state machine.

Parsing — Part II (Top-down parsing, left-recursion removal)

Copyright © Cengage Learning. All rights reserved.

Compiler Construction

Pushdown Automata.

Recognizer for a Language

Top-Down Parsing.

Top-Down Parsing CS 671 January 29, 2008.

Hierarchy of languages

Feedback from Assignment 1

Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

4d Bottom Up Parsing.

Parsing #2 Leonidas Fegaras.

Lecture (From slides by G. Necula & R. Bodik)

Compilers Principles, Techniques, & Tools Taught by Jing Zhang

Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Lecture 7: Introduction to Parsing (Syntax Analysis)

Lecture 8: Top-Down Parsing

Intro to Data Structures

CS 3304 Comparative Languages

Finite Automata Reading: Chapter 2.

4b Lexical analysis Finite Automata

Fundamentals of Data Representation

Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

LR Parsing. Parser Generators.

Parsing #2 Leonidas Fegaras.

Sorting "There's nothing in your head the sorting hat can't see. So try me on and I will tell you where you ought to be." -The Sorting Hat, Harry Potter.

Chapter 2 :: Programming Language Syntax

4b Lexical analysis Finite Automata

Kanat Bolazar February 16, 2010

Chapter 2 :: Programming Language Syntax

Parsing Bottom-Up LR Table Construction.

Parsing Bottom-Up LR Table Construction.

Some Graph Algorithms.

Finishing Tool Construction

Building Readback FSMs for Readahead FSMs

Class Relation.

Readahead FSMs, Readback FSMs, and Reduce States

Trees That Represent Grammars

Scanners/Parsers in a Nutshell

Overview of the Course.

Parsing CSCI 432 Computer Science Theory

Presentation transcript:

Building Readahead FSMs for Grammars 95.3002 Building Readahead FSMs for Grammars

95.3002 Goals for This Section

To understand what a readahead FSM does Goals To understand what a readahead FSM does To understand how to build one. and what do we need to build one. A slightly augmented grammar

95.3002 What’s a Readahead FSM?

How does a Parser Work Again! Find the right end of a handle (while munching inputs, stacking stuff (3 stacks that grow on the right), and moving R (indicator for the right end of a handle)) Find the left end of a handle (while traversing the stack from right to left and moving L) Reduce to a nonterminal A (using the stack contents between L and R, build a new tree and replace everything by an A-token) and repeat until no more input (EndOfFile encountered); equivalent to reaching an accept table.

How does a Parser Work Again! Find the right end of a handle. Find the left end of a handle. Reduce to a nonterminal A. A readahead FSM is used to guide this process Unterminated @G edge indicates that we can be at the right end of a handle for G. Readahead FSM Grammar '|-' a 1 2 G {EndOfFile} -> a *. @G @G is short form for Follow (G) A state is final if it has an @ transition Most books use “goalpost” symbols '|-’and '-|’ (instead of EndOfFile) for goals but don’t discuss scanners with multiple end goalposts. Since a readahead FSM always has 1 initial state, a final state if it has an @ transition  It’s not worth showing which states are initial or final states.

Note When we write or we mean @G {a, b, c} When we write or 2 3 2 3 where @G = Follow (G) = {a, b, c} {a} we mean 2 3 {b} {c} since @G is lookahead information; i.e., look attribute, not read attribute.

If a Grammar is More Complicated There is still exactly one readahead FSM for the whole grammar. Grammar Readahead FSM '|-' A G {EndOfFile} -> A *. A -> a. 1 2 @G a @A 3 It’s convenient to add an extra production and call the result an augmented grammar. Augmented Grammar G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. This also causes @G’ to be added to the readahead FSM.

How do we use it to find the right end of a handle? Ignoring the mechanism involving peek and next to process the input, let’s just assume the input is all there Augmented Grammar Readahead FSM '|-' A G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. 1 2 @G {EndOfFile} @G' G 3 4 a @A 5 Input |- a a EndOfFile Readahead FSM finds the right end of the handle; see R Trace |- a and reach state 5 with @A L R Humans know (but it doesn’t) where the left end is but it told us to reduce it to A because of @A Human (us for now) will find the left end (grayed out) and reduce; see L. |- A a EndOfFile Let’s repeat

How do we use it to find the right end of a handle? Augmented Grammar Readahead FSM '|-' A G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. 1 2 @G {EndOfFile} @G' G 3 4 a @A 5 Showing R only At each step, restart from the beginning, go as far as you can |- a a EndOfFile Reached state 5 with @A |- A a EndOfFile Reached state 5 with @A again |- A A EndOfFile Reached state 2 with @G |- G EndOfFile Reached state 4 with @G' This means STOP

We Have Performed A Sequence of Reductions |- a a EndOfFile G |- A a EndOfFile => AA Flipping the order => Aa |- A A EndOfFile => aa |- G EndOfFile Seems to be replacing the rightmost nonterminal at each step (called a right derivation) These kinds of parsers 1. Work left to right 2. Work bottom up 3. Simulate the reverse of a right derivation

Not Necessary to Restart from the Beginning Each Time Augmented Grammar Readahead FSM '|-' A G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. 1 2 @G {EndOfFile} @G' G 3 4 a @A 5 Also showing the table number stack… Look to the left of the handle for where to restart -1 |-2 a5 a EndOfFile Reached state 5 with @A Resume at 2 (to the left of a) -1 |-2 A2 a5 EndOfFile Reached state 5 with @A again Resume at 2 (to the left of a) -1 |-2 A2 A2 EndOfFile Reached state 2 with @G Resume at 2 (to the left of first A) -1 |-2 G2 EndOfFile Reached state 4 with @G' This means STOP

A More Complex Readahead FSM to Make it Clearer Augmented Grammar Readahead FSM A A '|-' 1 2 3 G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. @G @G {EndOfFile} @G' G G 4 5 a a @A 6 Look to the left of the handle for where to restart 1 2 6 -1 |-2 a6 a EndOfFile Reached state 6 with @A Resume at 2 (to the left of a) 1 2 3 6 -1 |-2 A3 a6 EndOfFile Reached state 6 with @A again Resume at 3 (to the left of a) 1 2 3 3 -1 |-2 A3 A3 EndOfFile Reached state 2 with @G Resume at 2 (to the left of first A) 1 2 4 5 -1 |-2 G4 EndOfFile Reached state 5 with @G' This means STOP

Building Readahead FSMs 95.3002 Building Readahead FSMs

Building Readahead FSMs By hand via tracing By hand via dot pushing Algorithmically

Building Readahead FSMs 95.3002 Building Readahead FSMs By Hand Via Tracing

Building Readahead FSMs By Hand Tracing Start with the augmented grammar and extra production and expand the right part into an FSM adding @G' at the end if it a G'-production. This is a G’ production Augmented Grammar Readahead FSM G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. '|-' G {EndOfFile} @G' 1 2 3 4 Pick another nonterminal in the FSM and expand it the same way (if they have all been expanded, you’re done) YES, you need a loop The G expansion: 0 or more A’s followed by @G

But there is 1 Constraint The FSM we build MUST be deterministic. So, if there is already a path in the FSM for AbBc and the new path to be added is AbBcDd@X, The new path must run along the old path at the end of which we add Dd@X

What to Watch Out For… When Creating Loops Augmented Grammar Readahead FSM G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. '|-' G {EndOfFile} @G' 1 2 3 4 '|-' {EndOfFile} @G' G Let’s expand G 1 2 3 4 A @G more minimal Expansion with 1 A OR '|-' @G' G {EndOfFile} 1 2 3 4 @G A @G Eliminates potential future conflicts Expansion with 2 A’s 5 A

Continuing '|-' G {EndOfFile} @G' G’ -> '|-’ G {EndOfFile} Augmented Grammar Readahead FSM '|-' G {EndOfFile} @G' G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. 1 2 3 4 @G A @G 5 A Expanding first A Readahead FSM '|-' @G' G {EndOfFile} 1 2 4 3 @G A @G 5 A Expanding second A Next slide a @A 6

Share States or Not '|-' G {EndOfFile} @G' G’ -> '|-’ G {EndOfFile} Augmented Grammar Readahead FSM '|-' G {EndOfFile} @G' G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. 1 2 3 4 @G A @G 5 A Expanding second A a @A 6 Readahead FSM '|-' @G' G {EndOfFile} 1 2 3 4 @G @G 5 A A more minimal to share state 6 rather than creating a new state 7 a a @A 6

Creating a Readahead FSM By Hand Via Tracing Another Example Creating a Readahead FSM By Hand Via Tracing

Another Example E {EndOfFile} -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i . Step 1: Create augmented grammar E' -> '|-' E {EndOfFile} E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i . Step 2: Create initial expansion of E' '|-' E {EndOfFile} @E' 1 2 4 3

Another Example All subsequent steps: Pick a nonterminal to expand, expand it, and mark it expanded with a checkmark. E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i . '|-' @E' E {EndOfFile} 1 2 3 4 Pick E

Another Example E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i .  '|-' @E' E {EndOfFile} 1 2 3 4 + T @E 5 6 - T @E 7 8 T @E 9 Keep it deterministic: Note the first E of E+T was already there… Also, it looks like state 5 and 7 are EQUAL, 6 and 8 are EQUAL (so don’t duplicate)

Without duplicating states Another Example E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i .  '|-' E {EndOfFile} @E' 1 2 3 4 + T @E 5 6 Without duplicating states - To 5 T @E 7 Keep it deterministic: Note the first E of E+T was already there…

More compact way of drawing it. Another Example E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i .  '|-' {EndOfFile} @E' E 1 2 3 4 +, - T @E 5 6 More compact way of drawing it. T @E 7

Another Example E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i .  '|-' {EndOfFile} @E' E 1 2 3 4 +, - T @E 5 6 T @E 7 Pick T

Another Example E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i .  '|-' {EndOfFile} @E' E 1 2 3 4 +, - T @E 5 6  T @E 7 *, / P @T 8 9 P @T 10 Keep it deterministic: Note the first T of T*P was already there…

Another Example E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i .  '|-' {EndOfFile} @E' E 1 2 3 4 +, - T @E 5 6  T @E 7 *, / P @T 8 9 P @T 10 Pick P Move up to make room and pick P

Another Example E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i .  '|-' @E' E {EndOfFile} 1 2 3 4 +, - T @E 5 6  T T @E 7  *, / P @T 8 9 P @T 10 '(' ')' E @P 11 12 13 i @P 14 Expand P

   Another Example E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i .  '|-' @E' E {EndOfFile} 1 2 3 4 +, - + T @E 5 6  - T @E 7  *, / P @T 8 9 P @T 10 '(' ')' E @P 11 12 13 i @P 14 Pick E

Avoid duplicating if it’s going to be exactly the same Another Example E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i .  '|-' @E' E {EndOfFile} 1 2 3 4 +, - T @E 5 6  T @E 7  *, / P @T 8 9 P @T 10  '(' ')' E @P 11 12 13 Avoid duplicating if it’s going to be exactly the same T +, - To 7 To 5 i @P 14 Keep it deterministic: Note the first E of E+T was already there…

Another Example E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i .  '|-' {EndOfFile} @E' E 1 2 3 4 +, - T @E 5 6  T @E Pick T 7  *, / P @T 8 9 P @T 10  '(' ')' E @P 11 12 13 T +, - To 7 To 5 i @P 14

Avoid duplicating if it’s going to be exactly the same Another Example E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i .  '|-' {EndOfFile} @E' E 1 2 3 4  +, - T @E 5 6 *, / Avoid duplicating if it’s going to be exactly the same  To 8 P To 10 T @E 7  *, / P @T 8 9 P @T 10  '(' ')' E @P 11 12 13 T +, - To 7 To 5 i @P 14 Keep it deterministic: Note the first T of T*P was already there…

     '|-' @E' E {EndOfFile} 1 2 3 4 E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i . +, - T @E 5 6 *, / To 8 P To 10  T @E 7 *, / P @T  8 9 @T P 10  '(' ')' E @P Make some room 11 12 13 T +, - To 7 To 5 i @P 14

     '|-' @E' E {EndOfFile} 1 2 3 4 E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i . +, - T @E 5 6 *, / To 8 P To 10  T @E Pick P 7 *, / P @T  8 9 @T P 10  '(' ')' E @P 11 12 13 T +, - To 7 To 5 i @P 14

      '|-' {EndOfFile} @E' E 1 2 3 4 E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i . +, - T @E 5 6  *, / To 8 P To 10  '(' To 11 T @E i 7 To 14 *, / P @T  8 9 @T P 10  '(' ')' E @P 11 12 13 T +, - To 7 To 5 i @P 14

      '|-' {EndOfFile} @E' E 1 2 3 4 E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i . +, - T @E 5 6  *, / To 8 P To 10  '(' To 11 T @E i 7 To 14 *, / P @T  8 9 @T P 10 Pick P  '(' ')' E @P 11 12 13 T +, - To 7 To 5 i @P 14

       '|-' @E' E {EndOfFile} 1 2 3 4 E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i . +, - T @E 5 6  *, / To 8 P To 10  '(' To 11 T @E i 7  To 14 *, / P @T  8 9 @T '(' P To 11 10 i  To 14 '(' ')' E @P 11 12 13 T +, - To 7 To 5 i @P 14

       '|-' @E' E {EndOfFile} 1 2 3 4 E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i . +, - T @E 5 6  *, / To 8 P To 10  '(' To 11 T @E i 7  To 14 *, / P @T  8 9 @T '(' P To 11 10 i  To 14 '(' ')' E @P 11 12 13 T +, - To 7 To 5 Pick T i @P 14

        '|-' @E' E {EndOfFile} 1 2 3 4 E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i . +, - T @E 5 6  *, / To 8 P To 10  '(' To 11 T @E i 7  To 14 *, / P @T P  8 9 To 10 @T '(' '(' P To 11 To 11 10 i i  To 14 To 14 '(' ')' E @P 11 12 13  T +, - T * P already there To 7 To 5 T / P already there P To 10 P not there i @P 14

        '|-' E {EndOfFile} @E' 1 2 3 4 E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i . +, - T @E 5 6  *, / To 8 P To 10  '(' To 11 T @E i 7  To 14 P To 10 *, / P @T  8 9 '(' To 11 @T '(' P i To 11 To 14 10 i  To 14 '(' ')' E @P 11 12 13 T * P already there  T +, - T / P already there To 7 To 5 P not there P To 10 Pick P i @P 14

         '|-' E {EndOfFile} @E' 1 2 3 4 +, - T @E E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i . 5 6  *, / To 8 P To 10  '(' To 11 T @E i 7  To 14 *, / P @T  8 9 @T '(' P To 11 10 i  To 14 '(' ')' E @P 11 12 13  T +, - To 7 To 5  P To 10 Every nonterminal has been checked. SO WE ARE DONE '(' To 11 i To 14 i @P 14

This is the readahead FSM for grammar '|-' E {EndOfFile} @E' 1 2 3 4 +, - T @E 5 6 *, / To 8 P To 10 '(' To 11 This is the readahead FSM for grammar T @E i 7 To 14 *, / P @T 8 9 E -> E + T | E – T | T . T -> T * P | T / P | P . P -> '(' E ')' | i . @T '(' P To 11 10 i To 14 '(' ')' E @P 11 12 13 T +, - To 7 To 5 P To 10 '(' To 11 i To 14 i @P 14

What’s Wrong With This Technique Impossible to automate Deciding where to loop back is not so obvious. Doing it by hand for a toy grammar is doable, but not for a real grammar. The grammar for C++ is estimated to be about 1000 productions.

Building Readahead FSMs 95.3002 Building Readahead FSMs By Hand Via Dot Pushing

Building Readahead FSMs By Dot Pushing Introduce the notion of where you are in a production by using a dot. Rule 1: Dots allowed only on transition labels and at the VERY END, not in front of metasymbols or brackets. Example: For productions A -> a b c* d A -> .a b c* d At the beginning A -> a .b c* d After moving the dot right past a 2 dots represents 2 dotted productions A -> a b .c* .d After moving the dot right past b A -> a b c* d. After moving the dot right past d You need a clear rule for moving a dot right.

More compact way of showing it Moving the Dot Right Rule 2: Move a dot right by determining which symbols can come after (you need to understand regular expressions to do this). It may move to more than one place. Because regular expressions can be complex, it can be hard to do. A -> .a (b | c?)* d At the beginning After moving the dot right past a More compact way of showing it A -> a (.b | c?)* d A -> a (b | .c?)* d All valid A -> a (b | c?)* .d A -> a (.b | .c?)* .d A -> a (.b | .c?)* .d After moving the dot right past c (ignoring b and d) A -> a (.b | .c?)* .d Because of *, you can be back to .b and .c if you iterate again AND to .d if you don’t iterate again.

? Moving the Dot Right Now try A -> .a (b | c?)* d – .ab* At the beginning After moving the dot right past a ? It can be done but it’s hard in general Can you write a program to do it. That’s why existing compiler books only allow grammars with productions of the form A -> abc | aBcde | g | PQ i.e., strings separated by “|” Such grammars don’t allow arbitrarily long handles; so, even if the parser automatically constructed trees (they don’t), you could not get a FUNCTION PARAMETER List with an arbitrary number of children.

We Need 2 Building Rules To Proceed Push Right Rule (.a) Move the dot over the a to all places it can go. Notation for pushing right over a:  a Push Down Rule (.A) If the symbol is a nonterminal A, move the dot in front of each “starting” symbol on the right side of the A production. It’s possible to have a dot at the end if there is nothing to the left.. Notation for pushing down below A:  A

Use a set of dotted productions to represent a readahead state. What’s the Plan? Use a set of dotted productions to represent a readahead state. dotted production 1 dotted production 2 The set obtained before down operations … dotted production 1 more dotted production 1 more dotted production 2 The set added by down operations … more dotted production n doing something repeatedly is sometimes called taking the closure Two states are equal if they contain exactly the same dotted productions; i.e. the same productions and the dots at exactly the same places.

To make sure there is no confusion, let’s use a really big dot An example To make sure there is no confusion, let’s use a really big dot Augmented Grammar . G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. Two unequal states G’ -> '|-’ . G {EndOfFile} G -> . A *. A -> . a G -> . A *. A -> . a 2 productions, 3 dots 3 productions, 4 dots

An algorithm Create a “states” collection with the first one built from the extra production of the augmented grammar. Pick the next state in the collection to process it. Apply downs. Make a pass to determine the transitions. For each transition, Construct a potential successor by applying right. If it already exists (there is one equal to it already there), use the existing one. Otherwise, add the potential successor to the collection. This stops once all states have been processed

The algorithm in Pictures … … … … … … states … Processed (downs all done) Unprocessed (no downs done) Step 2: Determine transition names Guarantees the FSM is deterministic Step 3: Compute potential successsors (rights, no downs) … empty a … empty initially A … empty … d Step 4: If they’re new, add to states (on the right), if it’s there already, refer to it. … empty Step 1: Pick a state to process. Fill in downs

Creating a Readahead FSM: Processing State 1 Augmented Grammar Use to increase the dot count. Can’t  G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. Use to compute successors. One, namely  State 1 NEW State 2 '|-' G’ -> .'|-’ G {EndOfFile} G’ -> '|-’ . G {EndOfFile} Step 2: Determine transition names Step 3: Compute potential successsors (no downs) Step 1: Fill in downs Using  Using  Step 4: It’s new (becomes state 2)

Processing State 2 G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. Augmented Grammar G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. State 2 State 2 G’ -> '|-’ . G {EndOfFile} G’ -> '|-’ . G {EndOfFile} 2 dots because of * G -> . A *. A -> . a Step 1: Fill in downs until there are no more downs to do Using 

Processing State 2 G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. Augmented Grammar G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. State 2 G G’ -> '|-’ . G {EndOfFile} A G -> . A *. @G dot at the end A -> . a a Step 2: Determine transition names

Processing State 2 G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. Augmented Grammar G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. New State 3 G’ -> '|-’ G {. EndOfFile} State 2 G G’ -> '|-’ . G {EndOfFile} New State 4 A G -> . A *. G -> . A *. @G A -> . a a New State 5 A -> a . Step 3: Compute potential successsors (no downs) All new, so they become state 3, 4, 5

Processing State 3 G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. Augmented Grammar G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. State 3 NEW State 6 G’ -> '|-’ G {. EndOfFile} {EndOfFile} G’ -> '|-’ G {EndOfFile}. @G' Working ahead, we get this Step 1: Fill in downs Nothing Step 2: Determine transition names Step 3: Compute potential successsors (no downs) Its new, so it becomes state 6

Processing State 4 G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. Augmented Grammar G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. State 4 State 4 G -> . A *. G -> . A *. A -> . a Step 1: Fill in downs until there are no more downs to do Using 

Processing State 4 G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. Augmented Grammar G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. State 4 A G -> . A *. @G dot at the end A -> . a a Step 2: Determine transition names

Processing State 4 G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. Augmented Grammar G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. Already exists as State 4 State 4 G -> . A *. A G -> . A *. @G Already exists as State 5 A -> . a a A -> a . Step 3: Compute potential successsors (no downs) The A-successor exists (compare only portion without downs) The a-successor exists (compare only portion without downs)

Processing State 5 G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. Augmented Grammar G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. State 5 @A A -> a . Step 2: Determine transition names Step 3: Compute potential successsors Step 1: Fill in downs Using  Using  Step 4: Nothing to do

Drawing it All Together State 1 G’ -> .'|-’ G {EndOfFile} '|-' State 6 G’ -> '|-’ G {. EndOfFile} @G' State 2 State 3 G’ -> '|-’ . G {EndOfFile} G’ -> '|-’ G {. EndOfFile} {EndOfFile} G -> . A *. State 4 @G A -> . a G -> . A *. A @G A -> . a a G A a State 5 A -> a . @A

Discarding the Dotted Productions '|-' and numbering… 1 @G' 6 2 {EndOfFile} 3 A This can be redrawn 4 a @G @G G A a @A 5

Redrawn Readahead FSM via dot pushing '|-' @G 1 2 @G' G {EndOfFile} 3 6 A 4 A a @G a @A 5 Readahead FSM via tracing Same FSM (6 states) though drawn slightly differently '|-' G {EndOfFile} @G' 1 2 3 4 @G @G 5 A A a a @A 6

Attaching a Type To Each State Readahead FSM Ra = Readahead Ra '|-' Ra @G 1 2 Ra Ra @G' G {EndOfFile} 3 6 Ra A 4 A a @G Ra a @A 5

What’s Wrong With This Technique Difficult to automate the push right rule because the right side is a regular expression.  a The rest can be automated. No problem with looping states (taken care of automatically by the algorithm) Doing it by hand for a toy grammar is doable, but still not possible for a real grammar. The grammar for Java is probably a little less than 1000 productions.

Building Readahead FSMs 95.3002 Building Readahead FSMs Algorithmically

Just that we are using productions with FSMs in the right part. What’s Different Just that we are using productions with FSMs in the right part.  a  A Still using and We will now call them relations. Let’s redefine the relations more exactly. With the relations, we can describe the algorithm concisely.

where r is an initial state Is a relation really a set? 2 Relations The right relation X = {(p,q) such that there is a production right part of the form } to go right... A -> … X … p q  a  A Still using and We will now call them relations. The down relation A = {(p,r) such that there is a production right part of the form and } ? -> … A … p q to go down... A -> … … r where r is an initial state Is a relation really a set?

Relations

a = b rather than [a,b] is in = What’s a Relation A relation is a set of pairs; e.g., denote the equal relation on the non-negative integers as =. You’re already familiar with = but perhaps not this way of looking at it Then = is {[0,0], [1,1], [2,2], ...} domain range Denote the “add 1” relation on the non-negative integers as Plus1. Then Plus1 = {[0,1], [1,2], [2,3], ...} We are more used to writing a = b rather than [a,b] is in =

What Does It Mean To Apply A Relation Applying a relation R to a set A (denoted AR) means constructing a new set B such that if p is in A and [p,q] is in R (or p R q), then q is in B. {10, 40} Plus1 = {11, 41} And you can apply a relation to the result too. More generally, you can use regular expressions on the relations and even inverse. Plus1-1 = {(q,p) | (p,q) is in Plus} {10, 40} Plus1 Plus1 = {11,41} Plus1 = {12,42} {10, 40} (Plus1 | Plus1 Plus1) = {11,41,12,42} {10, 40} Plus1? = {10,40,11,41} {10, 40} Plus1-1 = {9,39}

Given a relation R and the fact that a R b, A Note About Relations Given a relation R and the fact that a R b, The set {a} R contains b among others The set {b} R-1 contains a among others For example, if a R b, a R c, d R b, then {a} R = {b, c} and {b} R-1 = {a, d} More sophisticated variations {a} R* = {a, b, c} and {a, d} R+ = {b, c} from {a}R from {d}R already there

Small Extension When we write the relation without indicating an X, as in , we mean the union of all X X equivalently, for any X When we write the relation without indicating an A, as in , we mean the union of all A A equivalently, for any A

Back to Building Readahead FSMs

where r is an initial state Recall: 2 Relations The right relation X = {(p,q) such that there is a production right part of the form } to go right... A -> … X … p q X p q The down relation A = {(p,r) such that there is a production right part of the form and } ? -> … A … p q to go down... A A -> … … p r r where r is an initial state

Implications Our example grammar now looks like G’ -> '|-’ G {EndOfFile} G -> A *. A -> a. We already have an algorithm that does this conversion '|-' G {EndOfFile} G' -> 1 2 3 4 G -> 5 A Let me call these right part states a 6 7 A -> We use the same algorithm as before substituting right part states for dotted productions

Instead of This G’ -> .'|-’ G {EndOfFile} '|-' State 1 State 1 G’ -> .'|-’ G {EndOfFile} '|-' State 6 G’ -> '|-’ G {. EndOfFile} @G' State 2 State 2 State 3 G’ -> '|-’ . G {EndOfFile} G’ -> '|-’ G {. EndOfFile} {EndOfFile} G -> . A *. State 4 @G A -> . a G -> . A *. A @G A -> . a a G A a State 5 A -> a . @A These are all new states because no other state has the same set We need to build their successors too.

It Looks LikeThis State 1 '|-' G {EndOfFile} G' -> 1 2 3 4 1 '|-' G -> 5 A a Right part states 6 7 State 2 A -> State 3 2 3 State 6 {EndOfFile} @G' 5, 6 4 State 4 5 @G @G G A 6 Manually check it. a A What do the @ symbols mean? a State 5 Do they contain state objects or state numbers? 6 @A These are all new states because no other state has the same set We need to build their successors too.

Programming Considerations The readaheadState should contain state objects (not state numbers) That way, you can get the transitions you need directly from the right part state objects to compute right and down. You can even renumber the states without affecting anything If you insist on using state numbers, you will need a scheme to get the right part state objects from the state numbers ALL OVER THE PLACE.

This Suggest a Class Hierarchy FiniteStateMachineState isInitial isFinal transitions ReadaheadState withoutDowns withDowns Past names Example unclosured closured State 2 The act of repeatedly applying down is called “taking the closure” in text books 2 withoutDowns = {2} withDowns = {2,5,6} 5, 6 It’s easier to compute right from 1 collection than it is to do it from 2 independent collections In smalltalk: withDowns := withoutDowns shallowCopy

Other Programming Consideration Don’t need set objects. orderedCollection addIfIdenticalAbsent: anObject orderedCollection includesIdentical: anObject Uses == orderedCollection includesAllIdentical: aCollection orderedCollection includes: anObject Uses = orderedCollection includesAll: aCollection Consider an = implemented in class ReadaheadFSM = anotherReadaheadFSM self withoutDowns size = anotherReadaheadFSM withoutDowns size ifFalse: [^false]. self withoutDowns includesAllIdentical: anotherReadaheadFSM withoutDowns ifFalse: [^false]. anotherReadaheadFSM withoutDowns includesAllIdentical: self withoutDowns ifFalse: [^false]. ^true

When You Implement This State 1 Don’t bother with the @ transitions '|-' G {EndOfFile} G' -> 1 2 3 4 1 '|-' G -> 5 A a Right part states 6 7 State 2 A -> State 3 2 3 State 6 {EndOfFile} @G' 5, 6 4 State 4 5 @G @G Because you can tell what they are from the right part states (the ones that are final) G A 6 a A a State 5 7 @A These are all new states because no other state has the same set We need to build their successors too.

The readahead FSM was built using right and down operations; i.e., and Review We described an algorithm that builds a readahead FSM that works from left to right to find a handle. The readahead FSM was built using right and down operations; i.e., and  a  A Can we describe the steps of the process in terms of the relations?

The Table Building Algorithm Let the initial goal dotted production (or equivalent) for a augmented grammar be IG’. The initial readahead state before closure: {IG’} * The initial readahead state after closure: {IG’} Let the R be any readahead state after closure; i.e., a set of dotted productions (or equivalent) M The M-successor of R before closure: R Provided this set has something in it; i.e., something in it has an M-successor M * The M-successor of R after closure: R R M Provided has something in it

What’s Wrong With This Technique Nothing as far as it goes. But in the next section, we will want to build a readback FSM. If we don’t prepare for that task DURING THE PROCESS of building a readahead FSM, it will become difficult. This means you should NOT start to implement this YET. Let’s carry on with the next task. PS: Scanners don’t need readback, so you could implement it for scanners. But why implement it twice, wait for the parser version and use it for scanners too.