Two issues in lexical analysis Specifying tokens (regular expression) Identifying tokens specified by regular expression.
How to recognize tokens specified by regular expressions? A recognizer for a language is a program that takes a string x as input and answers “yes” if x is a sentence of the language and “no” otherwise. In the context of lexical analysis, given a string and a regular expression, a recognizer of the language specified by the regular expression answer “yes” if the string is in the language. A regular expression can be compiled into a recognizer (automatically) by constructing a finite automata which can be deterministic or non-deterministic.
Non-deterministic finite automata (NFA) A non-deterministic finite automata (NFA) is a mathematical model that consists of: (a 5-tuple a set of states Q a set of input symbols a transition function that maps state-symbol pairs to sets of states. A state q0 that is distinguished as the start (initial) state A set of states F distinguished as accepting (final) states. An NFA accepts an input string x if and only if there is some path in the transition graph from the start state to some accepting state. Show an NFA example (page 116, Figure 3.21).
For example, here is an NFA that recognizes the language ???. An NFA is non-deterministic in that (1) same character can label two or more transitions out of one state (2) empty string can label transitions. For example, here is an NFA that recognizes the language ???. An NFA can easily implemented using a transition table. State a b 0 {0, 1} {0} 1 - {2} 2 - {3} a 1 2 3 a b b b
The algorithm that recognizes the language accepted by NFA. Input: an NFA (transition table) and a string x (terminated by eof). output “yes” if accepted, “no” otherwise. S = e-closure({s0}); a = nextchar; while a != eof do begin S = e-closure(move(S, a)); a := next char; end if (intersect (S, F) != empty) then return “yes” else return “no” Note: e-closure({S}) are the state that can be reached from states in S through transitions labeled by the empty string.
Example: recognizing ababb from previous NFA Example2: Use the example in Fig. 3.27 for recognizing ababb Space complexity O(|S|), time complexity O(|S|^2|x|)??
Construct an NFA from a regular expression: Input: A regular expression r over an alphabet Output: An NFA N accepting L( r ) Algorithm (3.3, pages 122): For , construct the NFA For a in , construct the NFA Let N(s) and N(t) be NFA’s for regular s and t: for s|t, construct the NFA N(s|t): For st, construct the NFA N(st): For s*, construct the NFA N(s*): a N(s) N(t) N(s) N(t) N(s)
Example: r = (a|b)*abb. Example: using algorithm 3.3 to construct N( r ) for r = (ab | a)*b* | b.
Using NFA, we can recognize a token in O(|S|^2|X|) time, we can improve the time complexity by using deterministic finite automaton instead of NFA. An NFA is deterministic (a DFA) if no transitions on empty-string for each state S and an input symbol a, there is at most one edge labeled a leaving S. What is the time complexity to recognize a token when a DFA is used?
Algorithm to convert an NFA to a DFA that accepts the same language (algorithm 3.2, page 118) initially e-closure(s0) is the only state in Dstates and it is unmarked while there is an unmarked state T in Dstates do begin mark T; for each input symbol a do begin U := e-closure(move(T, a)); if (U is not in Dstates) then add U as an unmarked state to Dstates; Dtran[T, a] := U; end end; Initial state = e-closure(s0), Final state = ?
Example: page 120, fig 3.27. Question: for a NFA with |S| states, at most how many states can its corresponding DFA have?