1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis.

Slides:



Advertisements
Similar presentations
Lexical Analysis IV : NFA to DFA DFA Minimization
Advertisements

CSE 311 Foundations of Computing I
4b Lexical analysis Finite Automata
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
Lexical Analysis: DFA Minimization & Wrap Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp.
Compiler Construction
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture4: Regular Expressions Prof. Amos Israeli.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
From Cooper & Torczon1 Automating Scanner Construction RE  NFA ( Thompson’s construction )  Build an NFA for each term Combine them with  -moves NFA.
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical.
CSC 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Regular.
2. Lexical Analysis Prof. O. Nierstrasz
CS5371 Theory of Computation Lecture 6: Automata Theory IV (Regular Expression = NFA = DFA)
Compiler Construction
1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #5 Introduction.
Lexical Analysis — Part II From Regular Expression to Scanner Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Regular Expressions (RE) Empty set Φ A RE denotes the empty set Empty string λ A RE denotes the set {λ} Symbol a A RE denotes the set {a} Alternation M.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
1 Compiler Construction Finite-state automata. 2 Today’s Goals More on lexical analysis: cycle of construction RE → NFA NFA → DFA DFA → Minimal DFA DFA.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis Constructing a Scanner from Regular Expressions.
Lexical Analyzer (Checker)
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
Lexical Analysis III : NFA to DFA DFA Minimization Lecture 5 CS 4318/5331 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper.
Lexical Analysis: DFA Minimization Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
Pembangunan Kompilator.  A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that language, and.
Lexical Analysis: DFA Minimization & Wrap Up. Automating Scanner Construction PREVIOUSLY RE  NFA ( Thompson’s construction ) Build an NFA for each term.
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Regular Languages Chapter 1 Giorgi Japaridze Theory of Computability.
1 Compiler Construction Vana Doufexi office CS dept.
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Compilers Lexical Analysis 1. while (y < z) { int x = a + b; y += x; } 2.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Lexical analysis Finite Automata
Two issues in lexical analysis
Recognizer for a Language
Review: NFA Definition NFA is non-deterministic in what sense?
Lexical Analysis - An Introduction
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lecture 4: Lexical Analysis II: From REs to DFAs
4b Lexical analysis Finite Automata
Automating Scanner Construction
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lexical Analysis: DFA Minimization & Wrap Up
4b Lexical analysis Finite Automata
Lexical Analysis - An Introduction
Compiler Construction
Lecture 5 Scanning.
Presentation transcript:

1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis

2 First Phase: Lexical Analysis (Scanning) Scanner Maps stream of characters into tokens –Basic unit of syntax Characters that form a word are its lexeme Its syntactic category is called its token Scanner discards white space and comments Scanner works as a subroutine of the parser Source code Scanner IR Parser Errors token get next token

3 Lexical Analysis Specify tokens using Regular Expressions Translate Regular Expressions to Finite Automata Use Finite Automata to generate tables or code for the scanner Scanner Generator specifications (regular expressions) source code tokens tables or code

4 Consider the problem of recognizing register names in an assembler Register  R (0|1|2| … |9) (0|1|2| … |9) * Allows registers of arbitrary number Requires at least one digit RE corresponds to a recognizer (or DFA) Example S0S0 S2S2 S1S1 R (0|1|2| … 9) accepting state (0|1|2| … 9) Recognizer for Register initial state

5 RDigit Digit * allows arbitrary numbers Accepts R00000 Accepts R99999 What if we want to limit it to R0 through R31 ? Write a tighter regular expression –Register  R ( (0|1|2) ( 0|1|2| … | 9 | ) | (4|5|6|7|8|9) | (3 (0|1| )) ) –Register  R0|R1|R2| … |R31|R00|R01|R02| … |R09 Produces a more complex DFA Has more states Same cost per transition Same basic implementation What if we need a tighter specification for registers?

6 Tighter register specification The DFA for Register  R ( (0|1|2) ( 0|1|2| … | 9 | ) | (4|5|6|7|8|9) | (3 (0|1| )) ) Accepts a more constrained set of registers Same set of actions, more states S0S0 S5S5 S1S1 R S4S4 S3S3 S6S6 S2S2 0,1,2 3 0,1 4,5,6,7,8,9 (0|1|2| … 9)

7 Tighter register specification To implement the recognizer Use the same code skeleton Use transition table and final states for the new RE Bigger tables, more space, same asymptotic costs Better syntax checking at the same cost Final = { s 2, s 3, s 4, s 5, s 6 }

8 Non-deterministic Finite Automata Non-deterministic Finite Automata (NFA) for the RE ( a | b ) * abb This is a little different S 0 has a transition on (empty string) – - transitions are allowed S 1 has two transitions on “a” –Transition function  : S    2 S maps (state, symbol) pairs to sets of states This is a non-deterministic finite automaton ( NFA ) a | b S0S0 S1S1 S4S4 S2S2 S3S3 abb

9 Non-deterministic Finite Automata An NFA accepts a string x iff there exists a path though the transition graph from s 0 to a final state and the edge labels spell x Transitions on consume no input To “run” (simulate) the NFA, –Start in s 0 and take all the transitions for each character –At each iteration add the states reachable by -transitions Why study NFAs? They are the key to automating the RE  DFA construction We can paste together NFAs with -transitions NFA becomes an NFA

10 NFA Simulation Two key functions Move(q, a) is set of states reachable by “a” from states in q -closure(q) is set of states reachable by from states in q states = -closure ( {s 0 } ); char = get_next_char(); while (char != EOF) { states = -closure(Move (states,char)); char = get_next_char(); } if (states  Final is not empty) report acceptance; else report failure;

11 Relationship between NFA s and DFA s DFA is a special case of an NFA DFA has no - transitions DFA ’s transition function is single-valued Same rules will work DFA can be simulated with an NFA –Obviously NFA can be simulated with a DFA –Less obvious Simulate sets of possible states that NFA can reach with states of the DFA Possible exponential blowup in the state space Still, one state per character in the input stream

12 Automating Scanner Construction To build a scanner: 1Write down the RE that specifies the tokens 2Translate the RE to an NFA 3Build the DFA that simulates the NFA 4Systematically shrink the DFA 5Turn it into code or table Scanner generators Lex, Flex, Jlex, … (inside JavaCC) work along these lines Algorithms are well-known and well-understood Interface to parser is important

13 Automating Scanner Construction RE  NFA ( Thompson’s construction ) Build an NFA for each term Combine them with -moves NFA  DFA ( subset construction ) Build the simulation DFA  Minimal DFA Hopcroft’s algorithm DFA  RE All pairs, all paths problem Union together paths from s 0 to a final state minimal DFA RENFADFA The Cycle of Constructions

14 RE  NFA using Thompson’s Construction Key idea NFA pattern for each symbol & each operator Join them with moves in precedence order S0S0 S1S1 a NFA for a S0S0 S1S1 a S3S3 S4S4 b NFA for ab NFA for a | b S0S0 S1S1 S2S2 a S3S3 S4S4 b S5S5 S0S0 S1S1 S3S3 S4S4 NFA for a * a Ken Thompson, C ACM, 1968 Subsection in the textbook

15 Example: Thompson’s Construction Let’s try a ( b | c ) * 1. a, b, & c 2. b | c 3. ( b | c ) * S0S0 S1S1 a S0S0 S1S1 b S0S0 S1S1 c S1S1 S2S2 b S3S3 S4S4 c S0S0 S5S5 S2S2 S3S3 b S4S4 S5S5 c S1S1 S6S6 S0S0 S7S7

16 Example: Thompson’s Construction 4. a ( b | c ) * S0S0 S1S1 a S4S4 S5S5 b S6S6 S7S7 c S3S3 S8S8 S2S2 S9S9 S0S0 S1S1 a b | c But, we can automate production of the more complex one... Given a regular expressions r the generated NFA is of size |N| = O(|r|) At most two new states are created at each step Each state has at most two incoming and two outgoing transitions Simulating an NFA constructed with Thompson’s construction on a string x takes O(|N|  |x|) Of course, a human would design something simpler...

17 NFA  DFA with Subset Construction Need to simulate the NFA with the DFA The algorithm States of the DFA correspond to sets of states of NFA –DFA is keeping track of all the states the NFA can reach after reading an input string Start state of the DFA derived from start state of the NFA –Take -closure of the start state of the NFA Work forward, trying each    and taking its -closure Iterative algorithm that halts when the states wrap back on themselves

18 NFA  DFA with Subset Construction s 0 = -closure(q 0 ); Initialize S with {s 0 }; while (there exists an unmarked s i  S) { mark s i ; for each  in  { s ? = -closure(move(s i,,  )) if ( s ? is not in S ) add s ? to S as an unmarked state s j ; T[s i,  ] = s j ; } Why does this algorithm work? The algorithm halts: 1. S contains no duplicates (tests before adding) 2. 2 Q is finite (Q is the set of states of the NFA) 3. while loop adds to S, but does not remove from S (monotone)  the loop halts S contains all the reachable NFA states It tries each character in each s i. It builds every possible NFA configuration.  S and T form the DFA initial state of NFA initial state of DFA S is the set of states of the constructed DFA T is the transition function of the DFA A DFA state is an accepting state if it contains an accepting NFA state

19 NFA  DFA with Subset Construction Example of a fixed-point computation Monotone construction of some finite set Halts when it stops adding to the set Proofs of halting and correctness are similar These computations arise in many contexts Other fixed-point computations Canonical construction of sets of LR(1) items –Quite similar to the subset construction Data-flow analysis algorithms for compiler optimization Automated verification –Verification of an invariant condition can be written as a fixpoint computation

20 Example: NFA  DFA with Subset Construction Remember ( a | b ) * abb ? Applying the subset construction: a | b q0q0 q1q1 q4q4 q2q2 q3q3 abb contains q 4 (final state) Iteration 4 does not add a new state to S and all the states are marked, so the algorithm halts

21 Example: NFA  DFA with Subset Construction The DFA for ( a | b ) * abb Not much bigger than the original All transitions are deterministic Use same code skeleton as before s0s0 a s1s1 b s3s3 b s4s4 s2s2 a b b a a a b

22 DFA Minimization Important Result: Every regular language is recognized by a minimum-state DFA that is unique up to state names Main ideas in DFA minimization Discover sets of equivalent states Represent each such set with just one state Two states are equivalent if and only if: The set of paths leading to them are equivalent    , transitions on  lead to equivalent states transitions to distinct sets  states must be in distinct sets A partition P of S Each s  S is in exactly one set p i  P The algorithm iteratively partitions the DFA ’s states

23 DFA Minimization Details of the algorithm Group states into maximal size sets, optimistically Iteratively subdivide those sets, look for states that can be distinguished from each other States that remain grouped together are equivalent Initial partition, P 0, has two sets {F} & {Q-F} (D =(Q, , ,q 0,F)) Splitting a set Split a set S into smaller groups if states in S have transition to different sets for the same input Assume q 1, q 2, q 3  S, and α  Σ  (q 1, α) = q x,  (q 2, α) = q y, and  (q 3, α) = q z If q x, q y, and q z are not in the same set, then S must be split One state in the final DFA cannot have two transitions on α

24 DFA Minimization The algorithm P = { F, {Q-F}}; while ( P is still changing) { T = { } ; for each set S  P { for each    { partition S by  into S 1, S 2, …, S k ; T = T  S 1, S 2, …, S k ; } if (T != P) P = T; } Why does this work? Partition P  2 Q Start off with 2 subsets of Q {F} and {Q-F} While loop takes P i  P i+1 by splitting 1 or more sets P i+1 is at least one step closer to the partition with |Q | sets Maximum of |Q | splits Note that Partitions are never combined Initial partition ensures that final states are intact This is a fixed-point algorithm! Subsection in the textbook

25 Example: DFA Minimization Enough theory, does this stuff work? –Recall our example: ( a | b) * abb s0s0 a s1s1 b s3s3 b s4s4 s2s2 a b b a a a b s 0, s 2 a s1s1 b s3s3 b s4s4 b a a a b final state start state

26 Example: DFA Minimization What about a ( b | c ) * ? First, the subset construction: q0q0 q1q1 a q4q4 q5q5 b q6q6 q7q7 c q3q3 q8q8 q2q2 q9q9 s3s3 s2s2 s0s0 s1s1 c b a b b c c Final states

27 Example: DFA Minimization Then, apply the minimization algorithm To produce the minimal DFA s3s3 s2s2 s0s0 s1s1 c b a b b c c s0s0 s1s1 a b | c Previously noted that a human would design a simpler automaton than Thompson’s construction did. The algorithms produce that same DFA ! final states start state

28 Last Link in the Cycle: DFA to RE Translation Basic idea is to eliminate the states of DFA one by one and replace them by regular expressions on the transitions. First add a new start and a new final state and connect them to the start and final states of the input DFA with -transitions –These are the only states which will not be eliminated, the regular expression on the transition between them will be the final result Delete states of the input DFA one by one using: qiqi qdqd qjqj R4R4 R1R1 R3R3 R2R2 qiqi qjqj (R 1 )(R 2 )*(R 3 ) | (R 4 ) Before After eliminating state q d Apply this rule for each q i and q j (Also for cases where q i = q j )

29 Example: DFA to RE Translation 1 2 a a a b b 3 1 a a a b b S 3 2 F S a ba | ab 3 2 F b b aa | b ab bb Initial automaton Step 1) After adding new start and final states Step 2) After deleting state 1 S a(aa|b)* (ba|a)(aa|b)* | a(aa|b)*ab | b 3 F (ba|a)(aa|b)*ab | bb Step 3) After deleting state 2 S F (a(aa|b)*ab|b)((ba|a)(aa|b)*ab|bb)*((ba|a)(aa|b)*| )|a(aa|b)* Step 4) After deleting state 3 We get the result

30 Equivalence of FAs and REs RE  NFA ( Thompson’s construction ) Build an NFA for each term Combine them with -moves NFA  DFA ( subset construction ) Build the simulation DFA  Minimal DFA Hopcroft’s algorithm DFA  RE All pairs, all paths problem Union together paths from s 0 to a final state minimal DFA RENFADFA The Cycle of Constructions