CS308 Compiler Principles Lexical Analyzer Fan Wu Department of Computer Science and Engineering Shanghai Jiao Tong University Fall 2012.

Slides:



Advertisements
Similar presentations
4b Lexical analysis Finite Automata
Advertisements

1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis.
LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006.
Lexical Analyzer Second lecture. Compiler Construction Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical.
Lexical Analysis - Scanner Computer Science Rensselaer Polytechnic Compiler Design Lecture 2.
Lecture # 5. Topics Minimization of DFA Examples What are the Important states of NFA? How to convert a Regular Expression directly into a DFA ?
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
1 Pertemuan Lexical Analysis (Scanning) Matakuliah: T0174 / Teknik Kompilasi Tahun: 2005 Versi: 1/6.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction
Topic #3: Lexical Analysis
Finite-State Machines with No Output
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
Chapter 3 Chang Chi-Chung The Role of the Lexical Analyzer Lexical Analyzer Parser Source Program Token Symbol Table getNextToken error.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Overview of Previous Lesson(s) Over View  Strategies that have been used to implement and optimize pattern matchers constructed from regular expressions.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Lexical Analysis Constructing a Scanner from Regular Expressions.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Lexical Analyzer (Checker)
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Chapter 3 Chang Chi-Chung The Role of the Lexical Analyzer Lexical Analyzer Parser Source Program Token Symbol Table getNextToken error.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
1 Lexical Analysis and Lexical Analyzer Generators Chapter 3 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University,
Pembangunan Kompilator.  A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that language, and.
Lecture # 4 Chapter 1 (Left over Topics) Chapter 3 (continue)
Fall 2003CS416 Compiler Design1 Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. Normally a lexical.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Overview of Previous Lesson(s) Over View  Algorithm for converting RE to an NFA.  The algorithm is syntax- directed, it works recursively up the parse.
Lexical Analyzer CS308 Compiler Theory1. 2 Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. Normally.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
1 February 23, February 23, 2016February 23, 2016February 23, 2016 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
CS 154 Formal Languages and Computability February 11 Class Meeting Department of Computer Science San Jose State University Spring 2016 Instructor: Ron.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Compilers Lexical Analysis 1. while (y < z) { int x = a + b; y += x; } 2.
COMP 3438 – Part II - Lecture 3 Lexical Analysis II Par III: Finite Automata Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analyzer in Perspective
CS510 Compiler Lecture 2.
Chapter 3 Lexical Analysis.
Introduction to Lexical Analysis
The time complexity for e-closure(T).
Two issues in lexical analysis
Recognizer for a Language
Review: NFA Definition NFA is non-deterministic in what sense?
Lexical Analysis and Lexical Analyzer Generators
Recognition of Tokens.
Finite Automata & Language Theory
Chapter 3. Lexical Analysis (2)
Presentation transcript:

CS308 Compiler Principles Lexical Analyzer Fan Wu Department of Computer Science and Engineering Shanghai Jiao Tong University Fall 2012

Compiler Principles Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. –strips out comments and whitespaces –returns a token when the parser asks for –correlates error messages with the source program 2

Compiler Principles Token A token is a pair of a token name and an optional attribute value. –Token name specifies the pattern of the token –Attribute stores the lexeme of the token Tokens –Keyword: “begin”, “if”, “else”, … –Identifier: string of letters or digits, starting with a letter –Integer: a non-empty string of digits –Punctuation symbol: “,”, “;”, “(”, “)”, … Regular expressions are widely used to specify patterns of the tokens. 3

Compiler Principles Token Example 4

Compiler Principles Input Buffering Why a compiler needs buffers? Buffer Pairs: alternately reload Two pointers –lexemeBegin –forward Sentinels: a mark for buffer end 5

Compiler Principles Lookahead with Sentinels 6

Compiler Principles Terminology of Languages Alphabet: a finite set of symbols –ASCII –Unicode String: a finite sequence of symbols on an alphabet –  is the empty string –|s| is the length of string s –Concatenation: xy represents x followed by y –Exponentiation: s n = s s s.. s ( n times) s 0 =  Language: a set of strings over some fixed alphabet –  the empty set is a language –The set of well-formed C programs is a language 7

Compiler Principles Operations on Languages Union: L 1  L 2 = { s | s  L 1 or s  L 2 } Concatenation: L 1 L 2 = { s 1 s 2 | s 1  L 1 and s 2  L 2 } (Kleene) Closure: Positive Closure: 8

Compiler Principles Example L 1 = {a,b,c,d} L 2 = {1,2} L 1  L 2 = {a,b,c,d,1,2} L 1 L 2 = {a1,a2,b1,b2,c1,c2,d1,d2} L 1 * = all strings using letters a,b,c,d including the empty string L 1 + = all strings using letters a,b,c,d without the empty string 9

Compiler Principles Regular Expressions Regular expression is a representation of a language that can be built from the operators applied to the symbols of some alphabet. A regular expression is built up of smaller regular expressions (using defining rules). Each regular expression r denotes a language L(r). A language denoted by a regular expression is called as a regular set. 10

Compiler Principles Regular Expressions (Rules) Regular expressions over alphabet  Reg. Expr Language it denotes  L(  ) = {  } a   L(a) = {a} (r 1 ) | (r 2 ) L(r 1 )  L(r 2 ) (r 1 ) (r 2 ) L(r 1 ) L(r 2 ) (r) * (L(r)) * (r)L(r) Extension (r) + = (r)(r) * (L(r)) + (r)? = (r) |  L(r)  {  } zero or one instance [a 1 -a n ] L(a 1 |a 2 |…|a n ) character class 11

Compiler Principles Regular Expressions (cont.) We may remove parentheses by using precedence rules: –* highest –concatenation second highest –|lowest (a(b) * )|(c)  ab * |c Example: –  = {0,1} –0|1 => {0,1} –(0|1)(0|1) => {00,01,10,11} –0 * => { ,0,00,000,0000,....} –(0|1) * => all strings with 0 and 1, including the empty string 12

Compiler Principles Regular Definitions We can give names to regular expressions, and use these names as symbols to define other regular expressions. A regular definition is a sequence of the definitions of the form: d 1  r 1 where d i is a innovative symbol and d 2  r 2 r i is a regular expression over symbols … in  {d 1,d 2,...,d i-1 } d n  r n alphabet previously defined symbols 13

Compiler Principles Regular Definitions Example Example: Identifiers in Pascal letter  A | B |... | Z | a | b |... | z digit  0 | 1 |... | 9 id  letter (letter | digit ) * –If we try to write the regular expression representing identifiers without using regular definitions, that regular expression will be complex. (A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) * 14

Compiler Principles Grammar Regular Definitions 15

Compiler Principles Transition Diagram State: represents a condition that could occur during scanning –start/initial state: –accepting/final state: lexeme found –intermediate state: Edge: directs from one state to another, labeled with one or a set of symbols 16

Compiler Principles Transition Diagram for relop Transition Diagram for `` relop  | = | = | <> ’’ 17

Compiler Principles Transition-Diagram-Based Lexical Analyzer Implementation of relop transition diagram 18

Compiler Principles Transition Diagram for Others A transition diagram for id's and keywords A transition diagram for unsigned numbers 19

Compiler Principles Practice Draw the transition diagram for recognizing the following regular expression a(a|b)*a a a a|b 123 a a b ba

Compiler Principles Finite Automata A finite automaton is a recognizer that takes a string, and answers “yes” if the string matches a pattern of a specified language, and “no” otherwise. Two kinds: –Nondeterministic finite automaton (NFA) no restriction on the labels of their edges –Deterministic finite automaton (DFA) exactly one edge with a distinguished symbol goes out of each state Both NFA and DFA have the same capability We may use NFA or DFA as lexical analyzer 21

Compiler Principles Nondeterministic Finite Automaton (NFA) A NFA consists of: –S : a set of states –Σ : a set of input symbols (alphabet) –A transition function: maps state-symbol pairs to sets of states –s 0 : a start (initial) state –F : a set of accepting states (final states) NFA can be represented by a transition graph Accepts a string x, if and only if there is a path from the starting state to one of accepting states such that edge labels along this path spell out x. Remarks –The same symbol can label edges from one state to several different states –An edge may be labeled by ε, the empty string 22

Compiler Principles NFA Example (1) The language recognized by this NFA is (a|b) * a b 23

Compiler Principles NFA Example (2) NFA accepting aa* |bb* 24

Compiler Principles Implementing an NFA S   -closure({s 0 }){ set all of states can be accessible from s 0 by  -transitions } c  nextchar() while (c != eof) { begin S   -closure(move(S,c)) c  nextchar end if (S  F !=  ) then{ if S contains an accepting state } return “yes” else return “no” { set of all states can be accessible from a state in S by a transition on c} 25

Compiler Principles Deterministic Finite Automaton (DFA) A Deterministic Finite Automaton (DFA) is a special form of a NFA. –No state has ε - transition –For each symbol a and state s, there is at most one a labeled edge leaving s. The language recognized by this DFA is also (a|b) * a b start 26

Compiler Principles Algorithm to Simulate DFA Stateab Transition table DFA for (a|b) * a b Simulating a DFA 27

Compiler Principles Implementing a DFA s  s 0 { start from the initial state } c  nextchar{ get the next character from the input string } while (c != eof) do{ do until the end of the string } begin s  move(s,c){ transition function } c  nextchar end if (s in F) then { if s is an accepting state } return “yes” else return “no” 28

Compiler Principles NFA vs. DFA CompactibilityReadabilitySpeed NFAGood Slow DFABad Fast DFAs are widely used to build lexical analyzers. NFA DFA The language recognized 0*1* 29

Compiler Principles NFA vs. DFA CompactibilityReadabilitySpeed NFAGood Slow DFABad Fast DFAs are widely used to build lexical analyzers. NFA DFA The language recognized (a|b) * a b 30

Compiler Principles 31 (a) (b) a a aaa Test Yourself 1) What are the languages presented by the two FAs? 2) For a language only accepting characters from {0,1}, please design a DFA which represents all strings containing three ‘0’s. 31

Compiler Principles Regular Expression  NFA McNaughton-Yamada-Thompson (MYT) construction –Simple and systematic –Guarantees the resulting NFA will have exactly one final state, and one start state. –Construction starts from the simplest parts (alphabet symbols). –For a complex regular expression, sub- expressions are combined to create its NFA. 32

Compiler Principles MYT Construction Basic rules: for subexpressions with no operators –For expression  –For a symbol a in the alphabet  if  start if a 33

Compiler Principles MYT Construction Cont’d Inductive rules: for constructing larger NFAs from the NFAs of subexpressions (Let N(r 1 ) and N(r 2 ) denote NFAs for regular expressions r1 and r2, respectively) –For regular expression r 1 | r 2 i N(r 1 ) N(r 2 ) f     start 34

Compiler Principles MYT Construction Cont’d –For regular expression r 1 r 2 –For regular expression r * i N(r 1 ) f N(r 2 ) start N(r) if     start 35

Compiler Principles 36 Example: (a|b) * a a: a b b: (a|b): a b     b     a    (a|b) * :   b     a    a (a|b) * a: 36

Compiler Principles Properties of the Constructed NFA 1.N(r) has at most twice as many states as there are operators and operands in r. –This bound follows from the fact that each step of the algorithm creates at most two new states. 2.N(r) has one start state and one accepting state. The accepting state has no outgoing transitions, and the start state has no incoming transitions. 3.Each state of N(r) other than the accepting state has either one outgoing transition on a symbol in  or two outgoing transitions, both on . 37

Compiler Principles Conversion of an NFA to a DFA Approach: Subset Construction –each state of the constructed DFA corresponds to a set of NFA states Details ① Create transition table Dtran for the DFA ② Insert  -closure(s 0 ) to Dstates as initial state ③ Pick a not visited state T in Dstates ④ For each input symbol a, Create state  -closure(move(T, a)), and add it to Dstates and Dtran  Repeat step (3) and (4) until all states in Dstates are vistited 38

Compiler Principles The Subset Construction 39

Compiler Principles NFA to DFA Example NFA for (a|b) * abb Transition table for DFAEquivalent DFA 40 4

Compiler Principles 41 Converting an NFA into a DFA (Example) b     a    a S 0 =  -closure({0}) = {0,1,2,4,7} S 0 into DS as an unmarked state  mark S 0  -closure(move(S 0,a)) =  -closure({3,8}) = {1,2,3,4,6,7,8} = S 1 S 1 into DS  -closure(move(S 0,b)) =  -closure({5}) = {1,2,4,5,6,7} = S 2 S 2 into DS transfunc[S 0,a]  S 1 transfunc[S 0,b]  S 2  mark S 1  -closure(move(S 1,a)) =  -closure({3,8}) = {1,2,3,4,6,7,8} = S 1  -closure(move(S 1,b)) =  -closure({5}) = {1,2,4,5,6,7} = S 2 transfunc[S 1,a]  S 1 transfunc[S 1,b]  S 2  mark S 2  -closure(move(S 2,a)) =  -closure({3,8}) = {1,2,3,4,6,7,8} = S 1  -closure(move(S 2,b)) =  -closure({5}) = {1,2,4,5,6,7} = S 2 transfunc[S 2,a]  S 1 transfunc[S 2,b]  S 2 41

Compiler Principles 42 Converting an NFA into a DFA (Example) S 0 is the start state of DFA since 0 is a member of S 0 ={0,1,2,4,7} S 1 is an accepting state of DFA since 8 is a member of S 1 = {1,2,3,4,6,7,8} b a a b b a S1S1 S2S2 S0S0 42

Compiler Principles Regular Expression  DFA First, augment the given regular expression by concatenating a special symbol # r  r# augmented regular expression Second, create a syntax tree for the augmented regular expression. –All leaves are alphabet symbols (plus # and the empty string) –All inner nodes are operators Third, number each alphabet symbol (plus #) (position numbers) 43

Compiler Principles 44 Regular Expression  DFA Cont’d (a|b) * a  (a|b) * a# augmented regular expression  *  | b a # a each symbol is at a leaf each symbol is numbered (positions) inner nodes are operators Syntax tree of (a|b) * a# 3F 2 1 b     a    a 4 #  44

Compiler Principles 45 followpos Then we define the function followpos for the positions (positions assigned to leaves). followpos(i) -- the set of positions which can follow the position i in the strings generated by the augmented regular expression. Example: ( a | b) * a # followpos(1) = {1,2,3} followpos(2) = {1,2,3} followpos(3) = {4} followpos(4) = {} followpos() is just defined for leaves, not defined for inner nodes. 45

Compiler Principles firstpos, lastpos, nullable To compute followpos, we need three more functions defined for the nodes (not just for leaves) of the syntax tree. –firstpos(n) -- the set of the positions of the first symbols of strings generated by the sub- expression rooted by n. –lastpos(n) -- the set of the positions of the last symbols of strings generated by the sub- expression rooted by n. –nullable(n) -- true if the empty string is a member of strings generated by the sub- expression rooted by n; false otherwise 46

Compiler Principles Usage of the Functions  *  | b a # a (a|b) * a  (a|b) * a# augmented regular expression Syntax tree of (a|b) * a# n m nullable(n) = false nullable(m) = true firstpos(n) = {1, 2, 3} lastpos(n) = {3} 47

Compiler Principles 48 Computing nullable, firstpos, lastpos nnullable(n)firstpos(n)lastpos(n) leaf labeled  true  leaf labeled with position i false{i} | c 1 c 2 nullable(c 1 ) or nullable(c 2 ) firstpos(c 1 )  firstpos(c 2 )lastpos(c 1 )  lastpos(c 2 )  c 1 c 2 nullable(c 1 ) and nullable(c 2 ) if (nullable(c 1 )) firstpos(c 1 )  firstpos(c 2 ) else firstpos(c 1 ) if (nullable(c 2 )) lastpos(c 1 )  lastpos(c 2 ) else lastpos(c 2 ) * c 1 truefirstpos(c 1 )lastpos(c 1 )

Compiler Principles How to evaluate followpos Two-rules define the function followpos: 1.If n is concatenation-node with left child c 1 and right child c 2, and i is a position in lastpos(c 1 ), then all positions in firstpos(c 2 ) are in followpos(i). 2.If n is a star-node, and i is a position in lastpos(n), then all positions in firstpos(n) are in followpos(i). If firstpos and lastpos have been computed for each node, followpos of each position can be computed by making one depth-first traversal of the syntax tree. 49

Compiler Principles 50 Example -- ( a | b) * a #  *  | b a # a {1} {1,2,3} {3} {1,2,3} {1,2} {2} {4} {3} {1,2} {2} red – firstpos blue – lastpos Then we can calculate followpos followpos(1) = {1,2,3} followpos(2) = {1,2,3} followpos(3) = {4} followpos(4) = {} After we calculate follow positions, we are ready to create DFA for the regular expression.

Compiler Principles Algorithm (RE  DFA) 1.Create the syntax tree of (r) # 2.Calculate nullable, firstpos, lastpos, followpos 3.Put firstpos(root) into the states of DFA as an unmarked state. 4.while (there is an unmarked state S in the states of DFA) do –mark S –for each input symbol a do let s 1,...,s n are positions in S and symbols in those positions are a S’  followpos(s 1 ) ...  followpos(s n ) Dtran[ S,a ]  S’ if ( S’ is not in the states of DFA) –put S’ into the states of DFA as an unmarked state. the start state of DFA is firstpos(root) the accepting states of DFA are all states containing the position of # 51

Compiler Principles Example -- ( a | b) * a # followpos(1)={1,2,3} followpos(2)={1,2,3} followpos(3)={4} followpos(4)={} S 1 =firstpos(root)={1,2,3}  mark S 1 a: followpos(1)  followpos(3)={1,2,3,4}=S 2 Dtran[S 1,a]=S 2 b: followpos(2)={1,2,3}=S 1 Dtran[S 1,b]=S 1  mark S 2 a: followpos(1)  followpos(3)={1,2,3,4}=S 2 Dtran[S 2,a]=S 2 b: followpos(2)={1,2,3}=S 1 Dtran[S 2,b]=S 1 start state: S 1 accepting states: {S 2 } S1S1 S2S2 a b b a 52

Compiler Principles 53 Example -- ( a |  ) b c * # followpos(1)={2} followpos(2)={3,4} followpos(3)={3,4} followpos(4)={} S 1 =firstpos(root)={1,2}  mark S 1 a: followpos(1)={2}=S 2 Dtran[S 1,a]=S 2 b: followpos(2)={3,4}=S 3 Dtran[S 1,b]=S 3  mark S 2 b: followpos(2)={3,4}=S 3 Dtran[S 2,b]=S 3  mark S 3 c: followpos(3)={3,4}=S 3 Dtran[S 3,c]=S 3 start state: S 1 accepting states: {S 3 } S3S3 S2S2 S1S1 c a b b

Compiler Principles Minimizing Number of DFA States For any regular language, there is always a unique minimum state DFA, which can be constructed from any DFA of the language. Algorithm: –Partition the set of states into two groups: G 1 : set of accepting states G 2 : set of non-accepting states –For each new group G partition G into subgroups such that states s 1 and s 2 are in the same group iff for all input symbols a, states s 1 and s 2 have transitions to states in the same group. –Start state of the minimized DFA is the group containing the start state of the original DFA. –Accepting states of the minimized DFA are the groups containing the accepting states of the original DFA. 54

Compiler Principles 55 Minimizing DFA – Example (1) ba a a b b G 1 = {2} G 2 = {1,3} G 2 cannot be partitioned because Dtran[1,a]=2Dtran[1,b]=3 Dtran[3,a]=2Dtran[3,b]=3 So, the minimized DFA (with minimum states) is 12 a a b b

Compiler Principles 56 Minimizing DFA – Example (2) Groups: {1,2,3}{4} a b 1->21->3 2->22->3 3->43->3 {1,2} {3} no more partitioning Minimized DFA b b b a a a a b b a a a b b 56

Compiler Principles 57 Architecture of A Lexical Analyzer 57

Compiler Principles An NFA for Lex program Create an NFA for each regular expression Combine all the NFAs into one Introduce a new start state Connect it with ε- transitions to the start states of the NFAs 58

Compiler Principles Pattern Matching with NFA ① The lexical analyzer read in input calculates the set of states it is in at each symbol. ② Eventually, it reach a point with no next state. ③ It looks backwards in the sequence of sets of states, until it finds a set including one or more accepting states. ④ It picks the one associated with the earliest pattern in the list from the Lex program. ⑤ It performs the associated action of the pattern. 59

Compiler Principles Pattern Matching with NFA -- Example Input: aaba Report pattern: a*b + 60

Compiler Principles Pattern Matching with DFA ① Convert the NFA for all the patterns into an equivalent DFA. For each DFA state with more than one accepting NFA states, choose the pattern, who is defined earliest, the output of the DFA state. ② Simulate the DFA until there is no next state. ③ Trace back to the nearest accepting DFA state, and perform the associated action. Input: abba Report pattern abb 61

Compiler Principles Summary How lexical analysers work – Convert REs to NFA – Convert NFA to DFA – Minimise DFA – Use the minimised DFA to recognise tokens in the input – Use priorities, longest matching rule 62

Compiler Principles Homework Exercise (c) Exercise (c) Exercise (a) Due date: Sept. 29, 2012 (Saturday) 63