Automating Scanner Construction

Slides:



Advertisements
Similar presentations
Lexical Analysis IV : NFA to DFA DFA Minimization
Advertisements

Compiler Theory – 08/10(ppt) By Anindhya Sankhla 11CS30004 Group : G 29.
4b Lexical analysis Finite Automata
Complexity and Computability Theory I Lecture #4 Rina Zviel-Girshin Leah Epstein Winter
DFA Minimization Jeremy Mange CS 6800 Summer 2009.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis.
Lexical Analysis: DFA Minimization & Wrap Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp.
Scanner wrap-up and Introduction to Parser. Automating Scanner Construction RE  NFA (Thompson’s construction) Build an NFA for each term Combine them.
Compiler Construction
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
From Cooper & Torczon1 Automating Scanner Construction RE  NFA ( Thompson’s construction )  Build an NFA for each term Combine them with  -moves NFA.
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical.
Parsing VI The LR(1) Table Construction. LR(k) items The LR(1) table construction algorithm uses LR(1) items to represent valid configurations of an LR(1)
Aho-Corasick String Matching An Efficient String Matching.
Parsing V Introduction to LR(1) Parsers. from Cooper & Torczon2 LR(1) Parsers LR(1) parsers are table-driven, shift-reduce parsers that use a limited.
Lexical Analysis — Part II From Regular Expression to Scanner Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.
LR(1) Parsers Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
1 Compiler Construction Finite-state automata. 2 Today’s Goals More on lexical analysis: cycle of construction RE → NFA NFA → DFA DFA → Minimal DFA DFA.
Lexical Analysis: Building Scanners Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis Constructing a Scanner from Regular Expressions.
Lexical Analyzer (Checker)
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
Lexical Analysis III : NFA to DFA DFA Minimization Lecture 5 CS 4318/5331 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper.
Lexical Analysis: DFA Minimization Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.
Lexical Analysis: DFA Minimization & Wrap Up. Automating Scanner Construction PREVIOUSLY RE  NFA ( Thompson’s construction ) Build an NFA for each term.
Parsing V LR(1) Parsers C OMP 412 Rice University Houston, Texas Fall 2001 Copyright 2000, Keith D. Cooper, Ken Kennedy, & Linda Torczon, all rights reserved.
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
Lecture Notes 
Parsing V LR(1) Parsers N.B.: This lecture uses a left-recursive version of the SheepNoise grammar. The book uses a right-recursive version. The derivations.
Parsing V LR(1) Parsers. LR(1) Parsers LR(1) parsers are table-driven, shift-reduce parsers that use a limited right context (1 token) for handle recognition.
Chapter 2-II Scanning Sung-Dong Kim Dept. of Computer Engineering, Hansung University.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
COMP 3438 – Part II - Lecture 3 Lexical Analysis II Par III: Finite Automata Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
Lexical analysis Finite Automata
CSc 453 Lexical Analysis (Scanning)
The time complexity for e-closure(T).
Parsing VI The LR(1) Table Construction
Two issues in lexical analysis
Recognizer for a Language
Review: NFA Definition NFA is non-deterministic in what sense?
N.B.: This lecture uses a left-recursive version of the SheepNoise grammar. The book uses a right-recursive version. The derivations (& the tables) are.
Lexical Analysis - An Introduction
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lecture 5: Lexical Analysis III: The final bits
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lecture 4 (new improved) Subset construction NFA  DFA
Lecture 4: Lexical Analysis II: From REs to DFAs
4b Lexical analysis Finite Automata
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lexical Analysis: DFA Minimization & Wrap Up
CSE 311: Foundations of Computing
4b Lexical analysis Finite Automata
Lexical Analysis - An Introduction
Compiler Construction
Lecture 11 LR Parse Table Construction
Lecture 11 LR Parse Table Construction
Announcements - P1 part 1 due Today - P1 part 2 due on Friday Feb 1st
Presentation transcript:

Automating Scanner Construction RENFA (Thompson’s construction) Build an NFA for each term Combine them with -moves NFA DFA (subset construction) Build the simulation DFA Minimal DFA Hopcroft’s algorithm DFA RE All pairs, all paths problem Union together paths from s0 to a final state minimal DFA RE NFA DFA The Cycle of Constructions from Cooper & Torczon

RE NFA using Thompson’s Construction Key idea NFA pattern for each symbol & each operator Join them with  moves in precedence order S0 S1 a S3 S4 b NFA for ab  S0 S1 a NFA for a NFA for a | b S0 S1 S2 a S3 S4 b S5  S0 S1  S3 S4 NFA for a* a Ken Thompson, CACM, 1968 from Cooper & Torczon

Example of Thompson’s Construction Let’s try a ( b | c )* 1. a, b, & c 2. b | c 3. ( b | c )* S0 S1 a S2 S3 b S4 S5 c S2 S3 b S4 S5 c S6 S7 S2 S3 b S4 S5 c S6 S7 S8 S9  from Cooper & Torczon

Example of Thompson’s Construction (continued) 4. a ( b | c )* Of course, a human would design something simpler ... S0 S1 a  S2 S3 b S4 S5 c S6 S7 S8 S9 S0 S1 a b | c But, we can automate production of the more complex one ... from Cooper & Torczon

NFA DFA with Subset Construction Need to build a simulation of the NFA Two key functions Move(si,a) is set of states reachable by a from si -closure(si) is set of states reachable by  from si The algorithm Start state derived from s0 of the NFA Take its -closure Work outward, trying each    and taking its -closure Iterative algorithm that halts when the states wrap back on themselves Sounds more complex than it is… from Cooper & Torczon

NFA DFA with Subset Construction The algorithm: s0 -closure(q0n ) while ( S is still changing ) for each si  S for each    s? -closure(move(si,)) if ( s?  S ) then add s? to S as sj T[si,]  sj Let’s think about why this works The algorithm halts: 1. S contains no duplicates (test before adding) 2. 2Qn is finite 3. while loop adds to S, but does not remove from S (monotone)  the loop halts S contains all the reachable NFA states It tries each character in each si. It builds every possible NFA configuration.  S and T form the DFA from Cooper & Torczon

NFA DFA with Subset Construction Example of a fixed-point computation Monotone construction of some finite set Halts when it stops adding to the set Proofs of halting & correctness are similar These computations arise in many contexts Other fixed-point computations Canonical construction of sets of LR(1) items Quite similar to the subset construction Classic data-flow analysis (& Gaussian Elimination) Solving sets of simultaneous set equations We will see many more fixed-point computations from Cooper & Torczon

NFA DFA with Subset Construction Remember ( a | b )* abb ? Applying the subset construction: Iteration 3 adds nothing to S, so the algorithm halts a | b q0 q1 q4 q2 q3  a b contains q4 (final state) from Cooper & Torczon

NFA DFA with Subset Construction The DFA for ( a | b )* abb Not much bigger than the original All transitions are deterministic Use same code skeleton as before s0 a s1 b s3 s4 s2 from Cooper & Torczon

Where are we? Why are we doing this? RENFA (Thompson’s construction)  Build an NFA for each term Combine them with -moves NFA DFA (subset construction)  Build the simulation DFA Minimal DFA Hopcroft’s algorithm DFA RE All pairs, all paths problem Union together paths from s0 to a final state Enough theory for today minimal DFA RE NFA DFA The Cycle of Constructions from Cooper & Torczon

Building Faster Scanners from the DFA Table-driven recognizers waste a lot of effort Read (& classify) the next character Find the next state Assign to the state variable Trip through case logic in action() Branch back to the top We can do better Encode state & actions in the code Do transition tests locally Generate ugly, spaghetti-like code Takes (many) fewer operations per input character char  next character; state  s0 ; call action(state,char); while (char  eof) state  (state,char); if (state) = final then report acceptance; else report failure; from Cooper & Torczon

Building Faster Scanners from the DFA A direct-coded recognizer for r Digit Digit* Many fewer operations per character Almost no memory operations Even faster with careful use of fall-through cases goto s0; s0: word  Ø; char  next character; if (char = ‘r’) then goto s1; else goto se; s1: word  word + char; char  next character; if (‘0’ ≤ char ≤ ‘9’) then goto s2; s2: word  word + char; else if (char = eof) then report acceptance; se: print error message; return failure; from Cooper & Torczon

Building Faster Scanners Hashing keywords versus encoding them directly Some compilers recognize keywords as identifiers and check them in a hash table (some well-known compilers do this!) Encoding it in the DFA is a better idea O(1) cost per transition Avoids hash lookup on each identifier It is hard to beat a well-implemented DFA scanner from Cooper & Torczon