Lexical Analysis: Building Scanners Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

Formal Language, chapter 4, slide 1Copyright © 2007 by Adam Webber Chapter Four: DFA Applications.
INSTRUCTION SET ARCHITECTURES
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
0 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the program well-formed.
Lexical Analysis — Introduction Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis.
Lexical Analysis: DFA Minimization & Wrap Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp.
Scanner wrap-up and Introduction to Parser. Automating Scanner Construction RE  NFA (Thompson’s construction) Build an NFA for each term Combine them.
CSc 453 Lexical Analysis (Scanning)
Introduction to Code Optimization Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
Code Shape III Booleans, Relationals, & Control flow Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled.
1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical.
Parsing V Introduction to LR(1) Parsers. from Cooper & Torczon2 LR(1) Parsers LR(1) parsers are table-driven, shift-reduce parsers that use a limited.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Lexical Analysis — Part II From Regular Expression to Scanner Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Parsing IV Bottom-up Parsing Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
LR(1) Parsers Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
1 Compiler Construction Finite-state automata. 2 Today’s Goals More on lexical analysis: cycle of construction RE → NFA NFA → DFA DFA → Minimal DFA DFA.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
CS 536 Fall Scanner Construction  Given a single string, automata and regular expressions retuned a Boolean answer: a given string is/is not in.
Lexical Analysis: DFA Minimization Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
CSc 453 Lexical Analysis (Scanning)
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
CMSC 330: Organization of Programming Languages Finite Automata NFAs  DFAs.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Lexical Analysis: DFA Minimization & Wrap Up. Automating Scanner Construction PREVIOUSLY RE  NFA ( Thompson’s construction ) Build an NFA for each term.
Top-down Parsing Recursive Descent & LL(1) Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.
Lexical Analysis (Scanning) Lexical Analysis (Scanning)
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
Boolean & Relational Values Control-flow Constructs Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.
Prof. Necula CS 164 Lecture 31 Lexical Analysis Lecture 3-4.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Unix RE’s Text Processing Lexical Analysis.   RE’s appear in many systems, often private software that needs a simple language to describe sequences.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
Introduction to Optimization
Lecture 2 Lexical Analysis
Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012.
CSc 453 Lexical Analysis (Scanning)
The time complexity for e-closure(T).
Two issues in lexical analysis
Lexical and Syntax Analysis
Lexical Analysis - An Introduction
Introduction to Optimization
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lecture 5: Lexical Analysis III: The final bits
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Code Shape III Booleans, Relationals, & Control flow
Automating Scanner Construction
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lexical Analysis: DFA Minimization & Wrap Up
Introduction to Optimization
Lexical Analysis - An Introduction
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Lexical Analysis: Building Scanners Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. COMP 412 FALL 2010

Comp 412, Fall Table-Driven Scanners Common strategy is to simulate DFA execution Table + Skeleton Scanner —So far, we have used a simplified skeleton In practice, the skeleton is more complex —Character classification for table compression —Building the lexeme —Recognizing subexpressions  Practice is to combine all the REs into one DFA  Must recognize individual words without hitting EOF state  s 0 ; while (state  exit) do char  NextChar( )// read next character state   (state,char);// take the transition r s0s0 sfsf 0 … 90 … 9 0 … 90 … 9

Comp 412, Fall Table-Driven Scanners Character Classification Group together characters by their actions in the DFA —Combine identical columns in the transition table,  —Indexing  by class shrinks the table Idea works well in ASCII (or EBCDIC ) —compact, byte-oriented character sets —limited range of values Not clear how it extends to larger character sets ( unicode ) state  s 0 ; while (state  exit) do char  NextChar( )// read next character cat  CharCat(char)// classify character state   (state,cat)// take the transition Obvious algorithm is O(|  | 2 |S|). Can you do better?

Comp 412, Fall Table-Driven Scanners Building the Lexeme Scanner produces syntactic category ( part of speech ) —Most applications want the lexeme (word), too This problem is trivial —Save the characters state  s 0 lexeme  empty string while (state  exit) do char  NextChar( )// read next character lexeme  lexeme + char// concatenate onto lexeme cat  CharCat(char)// classify character state   (state,cat)// take the transition

Comp 412, Fall Table-Driven Scanners Choosing a Category from an Ambiguous RE We want one DFA, so we combine all the REs into one —Some strings may fit RE for more than 1 syntactic category  Keywords versus general identifiers  Would like to encode them into the RE & recognize them —Scanner must choose a category for ambiguous final states  Classic answer: specify priority by order of REs (return 1 st ) Alternate Implementation Strategy ( Quite popular ) Build hash table of keywords & fold keywords into identifiers Preload keywords into hash table Makes sense if —Scanner will enter all identifiers in the table —Scanner is hand coded Othersise, let the DFA handle them ( O(1) cost per character ) Separate keyword table can make matters worse

Comp 412, Fall Table-Driven Scanners Scanning a Stream of Words Real scanners do not look for 1 word per input stream —Want scanner to find all the words in the input stream, in order —Want scanner to return one word at a time —Syntactic Solution: can insist on delimiters  Blank, tab, punctuation, …  Do you want to force blanks everywhere? in expressions? —Implementation solution  Run DFA to error or EOF, back up to accepting state Need the scanner to return token, not boolean —Token is pair —Use a map from DFA’s state to Part of Speech (PoS)

Comp 412, Fall Table-Driven Scanners Handling a Stream of Words // recognize words state  s 0 lexeme  empty string clear stack push (bad) while (state  s e ) do char  NextChar( ) lexeme  lexeme + char if state ∈ S A then clear stack push (state) cat  CharCat(char) state   (state,cat) end; // clean up final state while (state ∉ S A and state ≠ bad) do state ← pop() truncate lexeme roll back the input one character end; // report the results if (state ∈ S A ) then return else return invalid PoS: state  part of speech Need a clever buffering scheme, such as double buffering to support roll back This scanner differs significantly from the ones shown in Chapter 2 of EaC 1e.

Avoiding Excess Rollback Some REs can produce quadratic rollback —Consider ab | (ab)* c and its DFA —Input “ababababc”  s 0, s 1, s 3, s 4, s 3, s 4, s 3, s 4, s 5 —Input “abababab”  s 0, s 1, s 3, s 4, s 3, s 4, s 3, s 4, rollback 6 characters  s 0, s 1, s 3, s 4, s 3, s 4, rollback 4 characters  s 0, s 1, s 3, s 4, rollback 2 characters  s 0, s 1, s 3 This behavior is preventable —Have the scanner remember paths that fail on particular inputs —Simple modification creates the “maximal munch scanner” Comp 412, Fall a s0s0 s1s1 s2s2 s5s5 s3s3 s4s4 b c a a c b DFA for ab | (ab)* c c Not too pretty

Comp 412, Fall Maximal Munch Scanner // recognize words state  s 0 lexeme  empty string clear stack push (bad,bad) while (state  s e ) do char  NextChar( ) InputPos  InputPos + 1 lexeme  lexeme + char if Failed[state,InputPos] then break; if state ∈ S A then clear stack push (state,InputPos) cat  CharCat(char) state   (state,cat) end // clean up final state while (state ∉ S A and state ≠ bad) do Failed[state,InputPos)  true 〈 state,InputPos 〉 ← pop() truncate lexeme roll back the input one character end // report the results if (state ∈ S A ) then return else return invalid InitializeScanner() InputPos  0 for each state s in the DFA do for i  0 to |input| do Failed[s,i]  false end;

Maximal Munch Scanner Uses a bit array Failed to track dead-end paths —Initialize both InputPos & Failed in InitializeScanner() —Failed requires space ∝ |input stream|  Can reduce the space requirement with clever implementation Avoids quadratic rollback —Produces an efficient scanner —Can your favorite language cause quadratic rollback?  If so, the solution is inexpensive  If not, you might encounter the problem in other applications of these technologies Comp 412, Fall Thomas Reps, “`Maximal munch’ tokenization in linear time”, ACM TOPLAS, 20(2), March 1998, pp

Comp 412, Fall Table-Driven Versus Direct-Coded Scanners Table-driven scanners make heavy use of indexing Read the next character Classify it Find the next state Branch back to the top Alternative strategy: direct coding Encode state in the program counter —Each state is a separate piece of code Do transition tests locally and directly branch Generate ugly, spaghetti-like code More efficient than table driven strategy —Fewer memory operations, might have more branches state  s 0 ; while (state  exit) do char  NextChar( ) cat  CharCat(char ) state   (state,cat); state  s 0 ; while (state  exit) do char  NextChar( ) cat  CharCat(char ) state   (state,cat); index Code locality as opposed to random access in 

Comp 412, Fall Table-Driven Versus Direct-Coded Scanners Overhead of Table Lookup Each lookup in CharCat or  involves an address calculation and a memory operation —CharCat(char) 0 + char x w w is sizeof(el’t of CharCat) —  (state,cat)  0 + (state x cols + cat) x w cols is # of columns in  w is sizeof(el’t of  ) The references to CharCat and  expand into multiple ops Fair amount of overhead work per character Avoid the table lookups and the scanner will run faster

Comp 412, Fall Building Faster Scanners from the DFA A direct-coded recognizer for r Digit Digit start: accept  s e lexeme  “” count  0 goto s 0 s 0 : char  NextChar lexeme  lexeme + char count++ if (char = ‘r’) then goto s 1 else goto s out s 1 : char  NextChar lexeme  lexeme + char count++ if (‘0’  char  ‘9’) then goto s 2 else goto s out s 2 : char  NextChar lexeme  lexeme + char count  0 accept  s 2 if (‘0’  char  ‘9’) then goto s 2 else goto s out s out : if (accept  s e ) then begin for i  1 to count RollBack() report success end else report failure Fewer (complex) memory operations No character classifier Use multiple strategies for test & branch

Comp 412, Fall Building Faster Scanners from the DFA A direct-coded recognizer for r Digit Digit start: accept  s e lexeme  “” count  0 goto s 0 s 0 : char  NextChar lexeme  lexeme + char count++ if (char = ‘r’) then goto s 1 else goto s out s 1 : char  NextChar lexeme  lexeme + char count++ if (‘0’  char  ‘9’) then goto s 2 else goto s out s 2 : char  NextChar lexeme  lexeme + char count  1 accept  s 2 if (‘0’  char  ‘9’) then goto s 2 else goto s out s out : if (accept  s e ) then begin for i  1 to count RollBack() report success end else report failure If end of state test is complex (e.g., many cases), scanner generator should consider other schemes Table lookup (with classification?) Binary search Direct coding the maximal munch scanner is easy, too.

Comp 412, Fall What About Hand-Coded Scanners? Many (most?) modern compilers use hand-coded scanners Starting from a DFA simplifies design & understanding Avoiding straight-jacket of a tool allows flexibility —Computing the value of an integer  In LEX or FLEX, many folks use sscanf() & touch chars many times  Can use old assembly trick and compute value as it appears —Combine similar states ( serial or parallel ) Scanners are fun to write —Compact, comprehensible, easy to debug, … —Don’t get too cute ( e.g., perfect hashing for keywords )

Comp 412, Fall Building Scanners The point All this technology lets us automate scanner construction Implementer writes down the regular expressions Scanner generator builds NFA, DFA, minimal DFA, and then writes out the (table-driven or direct-coded) code This reliably produces fast, robust scanners For most modern language features, this works You should think twice before introducing a feature that defeats a DFA-based scanner The ones we’ve seen (e.g., insignificant blanks, non-reserved keywords) have not proven particularly useful or long lasting Of course, not everything fits into a regular language …