Lexical Analysis: DFA Minimization & Wrap Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp.

Slides:



Advertisements
Similar presentations
Lexical Analysis IV : NFA to DFA DFA Minimization
Advertisements

From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
0 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the program well-formed.
Lexical Analysis — Introduction Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis.
Scanner wrap-up and Introduction to Parser. Automating Scanner Construction RE  NFA (Thompson’s construction) Build an NFA for each term Combine them.
Compiler Construction
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
From Cooper & Torczon1 Automating Scanner Construction RE  NFA ( Thompson’s construction )  Build an NFA for each term Combine them with  -moves NFA.
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical.
2. Lexical Analysis Prof. O. Nierstrasz
Parsing VI The LR(1) Table Construction. LR(k) items The LR(1) table construction algorithm uses LR(1) items to represent valid configurations of an LR(1)
Parsing V Introduction to LR(1) Parsers. from Cooper & Torczon2 LR(1) Parsers LR(1) parsers are table-driven, shift-reduce parsers that use a limited.
Compiler Construction
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary  Quoted string in.
1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #5 Introduction.
Lexical Analysis — Part II From Regular Expression to Scanner Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.
CS 426 Compiler Construction
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
LR(1) Parsers Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
1 Compiler Construction Finite-state automata. 2 Today’s Goals More on lexical analysis: cycle of construction RE → NFA NFA → DFA DFA → Minimal DFA DFA.
Lexical Analysis: Building Scanners Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Lexical Analysis Constructing a Scanner from Regular Expressions.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lexical Analysis III : NFA to DFA DFA Minimization Lecture 5 CS 4318/5331 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper.
Top Down Parsing - Part I Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Lexical Analysis: DFA Minimization Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.
CSc 453 Lexical Analysis (Scanning)
Lexical Analysis: DFA Minimization & Wrap Up. Automating Scanner Construction PREVIOUSLY RE  NFA ( Thompson’s construction ) Build an NFA for each term.
Parsing V LR(1) Parsers C OMP 412 Rice University Houston, Texas Fall 2001 Copyright 2000, Keith D. Cooper, Ken Kennedy, & Linda Torczon, all rights reserved.
Top-down Parsing Recursive Descent & LL(1) Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.
Regular Expressions Fundamental Data Structures and Algorithms Peter Lee March 13, 2003.
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
Parsing V LR(1) Parsers N.B.: This lecture uses a left-recursive version of the SheepNoise grammar. The book uses a right-recursive version. The derivations.
Parsing V LR(1) Parsers. LR(1) Parsers LR(1) parsers are table-driven, shift-reduce parsers that use a limited right context (1 token) for handle recognition.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
1 Compiler Construction Vana Doufexi office CS dept.
Chapter 2-II Scanning Sung-Dong Kim Dept. of Computer Engineering, Hansung University.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Lexical Analysis - An Introduction
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lecture 5: Lexical Analysis III: The final bits
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lecture 4 (new improved) Subset construction NFA  DFA
Lecture 4: Lexical Analysis II: From REs to DFAs
Automating Scanner Construction
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Introduction to Parsing
Lexical Analysis: DFA Minimization & Wrap Up
Lexical Analysis - An Introduction
Lecture 5 Scanning.
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Lexical Analysis: DFA Minimization & Wrap Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use.

Automating Scanner Construction RE  NFA ( Thompson’s construction )  Build an NFA for each term Combine them with  -moves NFA  DFA ( subset construction )  Build the simulation DFA  Minimal DFA (today) Hopcroft’s algorithm DFA  RE (not really part of scanner construction) All pairs, all paths problem Union together paths from s 0 to a final state minimal DFA RENFADFA The Cycle of Constructions

DFA Minimization The Big Picture Discover sets of equivalent states Represent each such set with just one state

DFA Minimization The Big Picture Discover sets of equivalent states Represent each such set with just one state Two states are equivalent if and only if: The set of paths leading to them are equivalent    , transitions on  lead to equivalent states (DFA)  -transitions to distinct sets  states must be in distinct sets

DFA Minimization The Big Picture Discover sets of equivalent states Represent each such set with just one state Two states are equivalent if and only if: The set of paths leading to them are equivalent    , transitions on  lead to equivalent states (DFA)  -transitions to distinct sets  states must be in distinct sets A partition P of S Each s  S is in exactly one set p i  P The algorithm iteratively partitions the DFA ’s states

DFA Minimization Details of the algorithm Group states into maximal size sets, optimistically Iteratively subdivide those sets, as needed States that remain grouped together are equivalent Initial partition, P 0, has two sets: {F} & {Q-F} (D =(Q, , ,q 0,F)) Splitting a set (“partitioning a set by a”) Assume q a, & q b  s, and  (q a,a) = q x, &  (q b,a) = q y If q x & q y are not in the same set, then s must be split  q a has transition on a, q b does not  a splits s One state in the final DFA cannot have two transitions on a

DFA Minimization The algorithm P  { F, {Q-F}} while ( P is still changing) T  { } for each set S  P for each    partition S by  into S 1, and S 2 T  T  S 1  S 2 if T  P then P  T Why does this work? Partition P  2 Q Start off with 2 subsets of Q {F} and {Q-F} While loop takes P i  P i+1 by splitting 1 or more sets P i+1 is at least one step closer to the partition with |Q| sets Maximum of |Q | splits Note that Partitions are never combined Initial partition ensures that final states are intact This is a fixed-point algorithm!

Key Idea: Splitting S around  S T R  The algorithm partitions S around  Original set S  Q  S has transitions on  to R, Q, & T

Key Idea: Splitting S around  T R  Original set S  Q  S1S1 S2S2 Could we split S 2 further? Yes, but it does not help asymptotically S 2 is everything in S - S 1

DFA Minimization Refining the algorithm As written, it examines every S  P on each iteration  This does a lot of unnecessary work  Only need to examine S if some T, reachable from S, has split Reformulate the algorithm using a “worklist”  Start worklist with initial partition, F and {Q-F}  When it splits S into S 1 and S 2, place S 2 on worklist This version looks at each S  P many fewer times  Well-known, widely used algorithm due to John Hopcroft

Hopcroft's Algorithm W  {F, Q-F}; P  {F, Q-F}; // W is the worklist, P the current partition while ( W is not empty ) do begin select and remove S from W; // S is a set of states for each  in  do begin let I     –1 ( S ); // I  is set of all states that can reach S on  for each R in P such that R  I  is not empty and R is not contained in I  do begin partition R into R 1 and R 2 such that R 1  R  I  ; R 2  R – R 1 ; replace R in P with R 1 and R 2 ; if R  W then replace R with R 1 in W and add R 2 to W ; else if || R 1 || ≤ || R 2 || then add add R 1 to W ; else add R 2 to W ; end

A Detailed Example Remember ( a | b ) * abb ? (from last lecture) Applying the subset construction: Iteration 3 adds nothing to S, so the algorithm halts a | b q0q0 q1q1 q4q4 q2q2 q3q3  abb contains q 4 (final state) Our first NFA

A Detailed Example The DFA for ( a | b ) * abb Not much bigger than the original All transitions are deterministic Use same code skeleton as before s0s0 a s1s1 b s3s3 b s4s4 s2s2 a b b a a a b

A Detailed Example Applying the minimization algorithm to the DFA s0s0 a s1s1 b s3s3 b s4s4 s2s2 a b b a a a b s 0, s 2 a s1s1 b s3s3 b s4s4 b a a a b final state

DFA Minimization What about a ( b | c ) * ? First, the subset construction: q0q0 q1q1 a  q4q4 q5q5 b q6q6 q7q7 c q3q3 q8q8 q2q2 q9q9      s3s3 s2s2 s0s0 s1s1 c b a b b c c Final states

DFA Minimization Then, apply the minimization algorithm To produce the minimal DFA s3s3 s2s2 s0s0 s1s1 c b a b b c c s0s0 s1s1 a b | c In lecture 5, we observed that a human would design a simpler automaton than Thompson’s construction & the subset construction did. Minimizing that DFA produces the one that a human would design! final states

Abbreviated Register Specification Start with a regular expression r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 minimal DFA RENFADFA The Cycle of Constructions

Abbreviated Register Specification Thompson’s construction produces r 0 r 1 r 2 r 8 r 9 …… s0s0 sfsf                     … minimal DFA RENFADFA The Cycle of Constructions To make it fit, we’ve eliminated the  - transition between “r” and “0”.

Abbreviated Register Specification The subset construction builds This is a DFA, but it has a lot of states … r 0 sf0sf0 s0s0 sf1sf1 1 sf2sf2 2 sf9sf9 sf8sf8 … 9 8 minimal DFA RENFADFA The Cycle of Constructions

Abbreviated Register Specification The DFA minimization algorithm builds This looks like what a skilled compiler writer would do! r s0s0 sfsf 0,1,2,3,4, 5,6,7,8,9 minimal DFA RENFADFA The Cycle of Constructions

Limits of Regular Languages Advantages of Regular Expressions Simple & powerful notation for specifying patterns Automatic construction of fast recognizers Many kinds of syntax can be specified with RE s Example — an expression grammar Term  [a-zA-Z] ([a-zA-z] | [0-9]) * Op  + | - |  | / Expr  ( Term Op ) * Term Of course, this would generate a DFA … If RE s are so useful … Why not use them for everything?

Limits of Regular Languages Not all languages are regular RL’s  CFL’s  CSL’s You cannot construct DFA’s to recognize these languages L = { p k q k } (parenthesis languages) L = { wcw r | w   * } Neither of these is a regular language (nor an RE) But, this is a little subtle. You can construct DFA ’s for Strings with alternating 0’s and 1’s (  | 1 ) ( 01 ) * (  | 0 ) Strings with an even number of 0’s and 1’s RE ’s can count bounded sets and bounded differences

What can be so hard? Poor language design can complicate scanning Reserved words are important if then then then = else; else else = then (PL/I) Insignificant blanks (Fortran & Algol68) do 10 i = 1,25 do 10 i = 1.25 String constants with special characters (C, C++, Java, …) newline, tab, quote, comment delimiters, … Finite closures (Fortran 66 & Basic)  Limited identifier length  Adds states to count length

What can be so hard? (Fortran 66/77) How does a compiler scan this? First pass finds & inserts blanks Can add extra words or tags to create a scannable language Second pass is normal scanner Example due to Dr. F.K. Zadeck

Building Faster Scanners from the DFA Table-driven recognizers waste effort Read (& classify) the next character Find the next state Assign to the state variable Trip through case logic in action() Branch back to the top We can do better Encode state & actions in the code Do transition tests locally Generate ugly, spaghetti-like code Takes (many) fewer operations per input character char  next character; state  s 0 ; call action(state,char); while (char  eof) state   (state,char); call action(state,char); char  next character; if  (state) = final then report acceptance; else report failure;

Building Faster Scanners from the DFA A direct-coded recognizer for r Digit Digit * Many fewer operations per character Almost no memory operations Even faster with careful use of fall-through cases goto s 0 ; s 0 : word  Ø ; char  next character; if (char = ‘r’) then goto s 1 ; else goto s e ; s 1 : word  word + char; char  next character; if (‘0’ ≤ char ≤ ‘9’) then goto s 2 ; else goto s e ; s2: word  word + char; char  next character; if (‘0’ ≤ char ≤ ‘9’) then goto s 2 ; else if (char = eof) then report success; else goto s e ; s e : print error message; return failure;

Building Faster Scanners Hashing keywords versus encoding them directly Some (well-known) compilers recognize keywords as identifiers and check them in a hash table Encoding keywords in the DFA is a better idea  O(1) cost per transition  Avoids hash lookup on each identifier It is hard to beat a well-implemented DFA scanner

Building Scanners The point All this technology lets us automate scanner construction Implementer writes down the regular expressions Scanner generator builds NFA, DFA, minimal DFA, and then writes out the (table-driven or direct-coded) code This reliably produces fast, robust scanners For most modern language features, this works You should think twice before introducing a feature that defeats a DFA-based scanner The ones we’ve seen (e.g., insignificant blanks, non-reserved keywords) have not proven particularly useful or long lasting

Extra Slides Start Here

Some Points of Disagreement with EAC Table-driven scanners are not fast  EaC doesn’t say they are slow; it says you can do better  Scanning is a small part of the work in a compiler, so in most cases it cannot make a large % improvement: decide where to spend your effort! Faster code can be generated by embedding scanner in code  This was shown for both LR-style parsers and for scanners in the 1980s; flex and its derivatives are an example Hashed lookup of keywords is slow  EaC doesn’t say it is slow. It says that the effort can be folded into the scanner so that it has no extra cost. Compilers like GCC use hash lookup. A word must fail in the lookup to be classified as an identifier. With collisions in the table, this can add up. At any rate, the cost is unneeded, since the DFA can do it for O(1) cost per character. But again, what % of total cost to compile is this?

Building Faster Scanners from the DFA Table-driven recognizers waste a lot of effort Read (& classify) the next character Find the next state Assign to the state variable Trip through case logic in action() Branch back to the top We can do better Encode state & actions in the code Do transition tests locally Generate ugly, spaghetti-like code Takes (many) fewer operations per input character char  next character; state  s 0 ; call action(state,char); while (char  eof) state   (state,char); call action(state,char); char  next character; if  (state) = final then report acceptance; else report failure; index

Key Idea: Splitting S around  II S R  This part must have an a-transition to some other state! Find partition I that has an  -transition into S