LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006
Faculty of IT - HCMUTLexical Analysis2 Outline Introduction to Lexical Analysis Token specification –Language –Regular Expressions (REs) Token recoginition –REs NFA (Thompson’s construction, Algorithm 3.3) –NFA DFA (subset construction, Algorithm 3.2) –DFA minimal DFA (Algorithm 3.6) Programming
Faculty of IT - HCMUTLexical Analysis3 Introduction Read the input characters Produce as output a sequence of tokens Eliminate white space and comments lexical analyzer parser symbol table source program token get next token
Faculty of IT - HCMUTLexical Analysis4 Why ? Simplify design Improve compiler efficiency Enhance compiler portability
Faculty of IT - HCMUTLexical Analysis5 Tokens, Patterns, Lexemes TokenSample Lexeme Informal description of pattern const if relation,>= or >= idpi, count, x2letter followed by letters or digits num3.14, 25, 6.02E3any numeric constant literal“core dumped”any characters between “ and “ except “
Faculty of IT - HCMUTLexical Analysis6 Outline Introduction Token specification –Language –Regular Expressions (REs) Token recoginition –REs NFA (Thompson’s construction, Algorithm 3.3) –NFA DFA (subset construction, Algorithm 3.2) –DFA minimal DFA (Algorithm 3.6) Programming
Faculty of IT - HCMUTLexical Analysis7 Alphabet, Strings and Languages Alphabet ∑ : any finite set of symbols –The Vietnamese alphabet {a, á, à, ả, ã, ạ, b, c, d, đ,…} –The binary alphabet {0,1} –The ASCII alphabet String: a finite sequence of symbols drawn from ∑ : –Length |s| of a string s: the number of symbols in s –The empty string, denoted , | | = 0 Language: any set of strings over ∑ ; –its two special cases: : the empty set { }
Faculty of IT - HCMUTLexical Analysis8 Examples of Languages ∑ ={ a, á, à, ả, ã, ạ, b, c, d, đ,… } –Vietnamese language ∑ = { 0,1 } –A string is an instruction –The set of Pentium instructions ∑ = the ASCII set –A string is a program –The set of C programs
Faculty of IT - HCMUTLexical Analysis9 Terms (Fig.3.7) TermDefinition prefix of sa string obtained by removing 0 or more trailing symbols of s; e.g. ban is a prefix of banana suffix of sa string formed by deleting 0 or more the leading symbols of s; e.g. na is a suffix of banana substring of sa string obtained by deleting a prefix and a suffix from s; e.g. nan is a substring of banana proper prefix, suffix or substring of s Any nonempty string x that is, respectively, a prefix, suffix os substring of s such that s x
Faculty of IT - HCMUTLexical Analysis10 String operations String concatenation –If x and y are strings, xy is the string formed by appending y to x. E.g.: x = hom, y = nay xy = homnay – is the identity: y = y; x = x String exponentiation –s 0 = –s i = s i-1 s E.g. s = 01, s 0 = , s 2 = 0101, s 3 =
Faculty of IT - HCMUTLexical Analysis11 Language Operations (Fig 3.8) TermDefinition union: L ML M = { s | s L or s M } concatenation: LM LM= { st | s L or t M } Kleene closure: L * L * = L 0 L LL LLL … where L 0 = { } 0 or more concatenations of L positive closure: L + L + = L LL LLL … 1 or more concatenations of L
Faculty of IT - HCMUTLexical Analysis12 Examples L = {A,B,…,Z,a,b,…,z} D = {0,1,…,9} ExampleLanguage L D LD L 4 L * L(L D) * D + letters and digits strings consists of a letter followed by a digit all four-letter strings all strings of letters, including all strings of letters and digits beginning with a letter all strings of one or more digits
Faculty of IT - HCMUTLexical Analysis13 Regular Expressions (Res) over Alphabet ∑ Inductive base: 1. is a RE, denoting the RL { } 2.a ∑ is a RE, denoting the RL {a} Inductive step: Suppose r and s are REs, denoting the language L(r) and L(s). Then 3.(r)|(s) is a RE, denoting the RL L(r) L(s) 4.(r)(s) is a RE, denoting the RL L(r)L(s) 5.(r)* is a RE, denoting the RL (L(r))* 6.(r) is a RE, denoting the RL L(r)
Faculty of IT - HCMUTLexical Analysis14 Precedence and Associativity Precedence: –“*” has the highest precedence –“concatenation” has the second highest precedence –“|” has the lowest precedence Associativity: –all are left-associative E.g.: (a)|((b)*(c)) a|b*c Unnecessary parentheses can be removed
Faculty of IT - HCMUTLexical Analysis15 Example ∑ = {a, b} 1.a|b denotes {a,b} 2.(a|b)(a|b) denotes {aa,ab,ba,bb} 3.a* denotes { ,a,aa,aaa,aaaa,…} 4.(a|b)* denotes ? 5.a|a*b denotes ?
Faculty of IT - HCMUTLexical Analysis16 Notational Shorthands One or more instances +: r+ = rr* –denotes the language (L(r))+ –has the same precedence and associativity as * Zero or one instance ?: r? = r| –denotes the language (L(r) { }) Character classes –[abc] denotes a|b|c –[A-Z] denotes A|B|…|Z –[a-zA-Z_][a-zA-Z0-9_]* denotes ?
Faculty of IT - HCMUTLexical Analysis17 Outline Introduction Token specification –Language –Regular Expressions (REs) Token recoginition –REs NFA (Thompson’s construction, Algorithm 3.3) –NFA DFA (subset construction, Algorithm 3.2) –DFA minimal DFA (Algorithm 3.6) Programming
Faculty of IT - HCMUTLexical Analysis18 Overview RE NFADFA mDFA
Faculty of IT - HCMUTLexical Analysis19 Nondeterministic finite automata A nondeterministic finite automaton (NFA) is a mathematical model that consists of –a finite set of states S –a set of input symbols ∑ –a transition function move: S ∑ S –a start state s 0 –a finite set of final or accepting states F
Faculty of IT - HCMUTLexical Analysis20 Transition graph state transition start state final state AB a A A A
Faculty of IT - HCMUTLexical Analysis21 Transition table ab 0{0,1}{0} 1-{2} 2-{3} Input symbol State
Faculty of IT - HCMUTLexical Analysis22 Acceptance A NFA accepts an input string x iff there is some path in the transition graph from start state to some accepting state such that the edge labels along this path spell out x. A B A B A B A B A B A B A ? error 0 1 0
Faculty of IT - HCMUTLexical Analysis23 Deterministic finite automata A deterministic finite automaton (DFA) is a special case of NFA in which 1.no state has an -transition, and 2.for each state s and input symbol a, there is at most one edge labeled a leaving s.
Faculty of IT - HCMUTLexical Analysis24 Thompson’s construction of NFA from REs guided by the syntactic structure of the RE r For , For a in ∑ if if a
Faculty of IT - HCMUTLexical Analysis25 Thompson’s construction (cont’d) Suppose N(s) and N(t) are NFA’s for REs s and t –For s|t, –For st, –For s*, –For (s), use N(s) itself N(s) N(t) i f N(s) i f N(t) i f
Faculty of IT - HCMUTLexical Analysis26 Outline Introduction Token specification –Language –Regular Expressions (REs) Token recoginition –REs NFA (Thompson’s construction) –NFA DFA (subset construction) –DFA minimal DFA (Algorithm 3.6) Programming
Faculty of IT - HCMUTLexical Analysis27 Subset construction OperationDescription -closure(s) Set of NFA states reachable from state s on -transition alone -closure(T) Set of NFA states reachable from some state s in T on -transition alone move(T,a)Set of NFA states to which there is a transition on input a from some state s in T s : an NFA state T : a set of NFA states
Faculty of IT - HCMUTLexical Analysis28 Subset construction (cont’d) Let s 0 be the start state of the NFA; Dstates contains the only unmarked state -closure(s 0 ); while there is an unmarked state T in Dstates do begin mark T for each input symbol a do begin U := -closure(move(T; a)); if U is not in Dstates then Add U as an unmarked state to Dstates; DTran[T; a] := U; end;
Faculty of IT - HCMUTLexical Analysis29 DFA Let ( ∑, S, T, F, s 0 ) be the original NFA. The DFA is: The alphabet: ∑ The states: all states in Dstates The transitions: DTran The accepting states: all states in Dstates containing at least one accepting state in F of the NFA The start state: -closure(s0)
Faculty of IT - HCMUTLexical Analysis30 Outline Introduction Token specification –Language –Regular Expressions (REs) Token recoginition –REs NFA (Thompson’s construction) –NFA DFA (subset construction) –DFA minimal DFA (Algorithm 3.6) Programming
Faculty of IT - HCMUTLexical Analysis31 Minimise a DFA Initially, create two states: 1.one is the set of all final states: F 2.the other is the set of all non-final states: S - F while (more splits are possible) { Let S = {s 1,…, s n } be a state and c be any char in ∑ Let t 1,…, t n be the successor states to s 1,…, s n under c if (t 1,…, t n don't all belong to the same state) { Split S into new states so that s i and s j remain in the same state iff t i and t j are in the same state }
Faculty of IT - HCMUTLexical Analysis32 Example ABD E C b b b b b a a a a a Step1: {A,B,C,D}{E} For a, {B,B,B,B} For b, {C,D,C,E} Split {A,B,C} {D}{E} Step 2: For b, {C,D,C} Split {A,C} {B} {D} {E} Step 3: For a, {B,B} For b, {C,C} Terminate ABD E b b b b b a a a a
Faculty of IT - HCMUTLexical Analysis33 Outline Introduction Token specification –Language –Regular Expressions (REs) Token recoginition –REs NFA (Thompson’s construction) –NFA DFA (subset construction) –DFA minimal DFA (Algorithm 3.6) Programming
Faculty of IT - HCMUTLexical Analysis34 Input Buffering begin…begin… Scanner eof if (forward at end of first half) { reload second half forward++ } else if (forward at end of second half) { reload first half forward = 0 } else forward++
Faculty of IT - HCMUTLexical Analysis35 Input Buffering begin…begin… Scanner eof forward = forward + 1 if (forward↑=eof) { if (forward at end of first half) { reload second half forward++ } else if (forward at end of second half) { reload first half forward = 0 } else terminate the analysis }
Faculty of IT - HCMUTLexical Analysis36 Transition Diagrams relop < = > other return(relop,LE) return(relop,NE) return(relop,LT) id letter(letter|digit)* 56 7 letter letter or digit other return(id,lexeme) Transition diagram is a DFA in which there is no edge leaving out of a final state
Faculty of IT - HCMUTLexical Analysis37 Implementation token nexttoken() { while (1) { switch (state) { case 0: c = nextchar(); if (c == ‘<‘) state = 1; else state = fail(0); break; case 1: c = nextchar(); if (c == ‘=‘) state = 2; else if (c == ‘>’ state = 3; else state = 4; break; case 2: retract(0); return new Token(relop,”<=”); case 4: retract(1); return new Token(relop,”<”); case 5: c = nextchar(); if (Character.isLetter(c)) state = 6; else state = fail(5); break; case 6: c = nextchar(); if (Character.isLetter(c) ||Character.isDigit(c)) continue; else state = 7; break; case 7: retract(1); return new Token(id, getLexeme());
Faculty of IT - HCMUTLexical Analysis38 Implemetation (cont’d) int fail(int current_state) { forward = beginning; switch (current_state) { case 0: return 5; case 5: error(); } void retract(int flag) { if (flag ==1) move forward back get lexeme from beginning to forward move forward onward beginning = forward state = 0 } b│e│g│i│n│:│=│ │ │…
Faculty of IT - HCMUTLexical Analysis39 Outline Introduction Token specification –Language –Regular Expressions (REs) Token recoginition –REs NFA (Thompson’s construction) –NFA DFA (subset construction) –DFA minimal DFA (Algorithm 3.6) Programming