Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lexical Analysis.

Similar presentations


Presentation on theme: "Lexical Analysis."— Presentation transcript:

1 Lexical Analysis

2 Lexical Analysis What is lexical analysis?
How to specify the acceptable tokens in a language? E.g., Identifier, integers, odd number of 1’s, etc. How to identify tokens based on the language? Intuitively RE to NFA to DFA to scanner Combine multiple REs Is it necessary to look ahead? Rules to resolve the potential ambiguities Tool: lex, input RE, output scanner (tutorial)

3 What is Lexical Analysis?
Scanner = Lexical analyzer Input: Program viewed as a string Output: A sequence of tokens Process: match the language definitions and identify the tokens Token An indivisible unit with a certain logical meaning in the language, Key words: if, else, while, etc. Identifiers: abc, xyz, etc. Integers: 123, 456, etc. Operators: +, *, etc.

4 How to Specify Acceptable Tokens
Use regular expression (RE) Formal description, no ambiguity RE notation : the set of all characters that are acceptable in the language X | Y : alternation X, Y are strings that are already defined in the language X • Y : concatenation X* : repetition  : empty string ( ) : enclose an expression Precedence: ( ), *, •, |

5 How to Specify Acceptable Tokens
Some additional conventions, but not in formal RE X+ : X • X* X? : optional, none or one appearance of X [a-f] : one of the characters in the range [a-f, A-F] : multiple ranges (or characters)

6 How to Specify Acceptable Tokens
Examples digit = [0-9] alphabet = [a-z, A-Z] sign = + |  |  decimal-point = . identifier = alphabet • (alphabet | digit)* integer = sign • (0 | [1-9] • digit*) float = integer | integer • decimal-point • digit* | sign • decimal-point • digit+ keyword = if | else | while | … (actually i • f | …)

7 How to Identify Tokens Intuitively Finite state automata (FA)
read (ch); case (ch) an alphabet: … a numeral: … For each case, need to read further and has many more possibilities Various cases has to merge together The program can be very complicated and hard to understand Finite state automata (FA) Make the scanner process much easier RE can be mapped to NFA automatically NFA can be converted to DFA automatically DFA can be converted to scanner automatically

8 Finite State Automata Finite state automata Why finite
A = (S, , s0, F, T) S: all states in the FA : all symbols accepted by the language F: accepting states T: all transitions S   → S ( or S    {} → S ) Why finite Has a bounded number of states Use bounded memory space

9 Finite State Automata Processing input string
Starting from s0, for each input character, make a transition on the automata If no transition possible for the input character → error When the input string is fully consumed If at a final state → accept Otherwise → error If accept then s  L(RE) L(RE): the language defined by RE FA representing RE Input string s Accept/No

10 NFA and DFA NFA DFA 1 2 3 a b start
Allow one state-symbol pair to be mapped to multiple states Allow  transition ( (s1, ) → s2 ) RE  NFA mapping is very straightforward DFA Deterministic mapping One state-symbol pair can only be mapped to multiple states No  transition 1 2 3 a b start (a|b)*abb

11 NFA and RE Example NFA Accepting any number of 1’s
 = {0,1} Accepting any number of 1’s followed by a single 0 RE = 1*0 1 1 Accepting any substrings with 00 at the end RE = (0|1)*00

12 Systematically Constructing NFA from RE
For a single character For a simple substring Repetition start a a start a b ab A+ start a b (ab)+ A a*b start A* start a b

13 Systematically Constructing NFA from RE
Alternate Given NFAs for A and B Construct A|B Concatenation Concatenate substrings A and B B A start A B A B A

14 Lexical Analysis based on NFA
Example NFA  = {0,1} RE = b*(a|b) For input b From state 1, should it go to state 2 or back to state 1 Consider go to state 2 For bbbb, need to backtrack Processing input string based on NFA May need to backtrack May end up exploring all the paths in the NFA In more complicated NFA, this can be very high cost Solution: convert NFA to DFA b a 1 2 b

15 Converting NFA to DFA Conversion -enclosure Move (T, a)
Given an NFA find the DFA with the minimum number of states that has the same behavior as the NFA for all inputs -enclosure Given a set of NFA states T, -enclosure(T) is the set of states that are reachable through -transition from any state in T Move (T, a) All states that are reachable from T after reading a character a Don’t forget to include the -enclosure of all the states

16 Converting NFA to DFA (cont.)
Simulating NFA, subset construction Construct the initial state of the DFA By finding the -enclosure of the initial state From a state T in the DFA, for each input characters a Find the set of states in Move (T, a) Make Move (T, a) a state in DFA if it is not there yet If Move (T, a) contains at least one final state in NFA, then mark it as a final state in DFA For convenience, consider the set of characters with the same Move (T, a) all together Repeat the step above for all states in DFA that has not been processed yet (use a stack to keep track of)

17 Converting NFA to DFA (cont.)
(a|b)*abb (Intuitive Construction) a start a b b S0 S1 S2 S3 b

18 (a|b)*abb (Intuitive Construction)

19 RE-NFA-DFA (different construction)
(a|b)*abb Formal Construction

20 (a|b)*abb

21 (a|b)*abb

22 =  DFA Minimization (a|b)*abb Merge A and C A Can be merged
A, C, E has the same transitions Merge A and C A Can be merged But E is a final state and A and C are not

23 DFA Minimization The previous method do not always get the minimal DFA
Actually can be minimized further a b S0 S1 S2 a b 1 3 - 2 4 Cannot merge further b b a S3 S4 b a b S0 S1 S2 b a b S3

24 DFA Minimization How to identify states that can be merged? Method
Starting from states s and t, for all strings x If the acceptance decision is always the same, then s and t are indistinguishable (equivalent) Final states and non-final states can never be merged Method Initialization: Divide the states into two groups, final states and non-final states Division within a group G If for each input symbol a, two states s and t in G have transitions on a to the same group, then s and t stay in the same group Otherwise, divide G and put s and t to different groups Repeat the division, until no changes on grouping

25 DFA Minimization Why the DFA minimization would work?
If on the nth round of group division, if s and t are in the same group, it means For all string x of length n or less, s and t are indistinguishable Reasoning At k-th iteration, assume that for all strings of length k, s and t are indistinguishable After division of group G at iteration k+1, s and t are still in the same group It means s and t are indistinguishable for all strings of length k+1 Since one more input symbol is tested in the division process

26 Implementing a Language Processor from a DFA
Follow the execution of DFA loop case State is when state1=> case Next_Character is when ‘a’ => State := state3; when ‘b’ => State := state1; …… when others => End_token_processing; end case; when state2 … …… end case; end loop;

27 DFA and Scanner -- Issues
RE  NFA  DFA  language processor Only determines whether the entire input string is accepted Not good enough for lexical analysis Scanner Scanner is supposed to recognize many different tokens Each token can be defined as an RE Scanner does not process the entire input string at once It is not clear how to cut the input string into tokens Input string s DFA representing RE Accept/No

28 DFA and Scanner -- Issues
How to build a DFA for a scanner For each token, construct an RE Construct RE1, RE2, … RE = RE1 | RE2 | … Multiple final states, each is for a specific token When converting NFA to DFA, the final states for REi should be carried along Allow acceptance of substrings when matching REi

29 Scanner using FA -- Multiple REs
Build an NFA for multiple REs Detailed NFA-DFA conversion steps Modified from the notes of Prof. Amaral, Univ. Alberta IF f 2 3 a-z ID i a-z 4 5 6 7 8 0-9 1 NUM 0-9 0-9 9 10 11 12 13 14 any character 15 error

30 Scanner using FA -- Ambiguity
6-7-8 15 a-e, g-z, 0-9 a-z,0-9 0-9 f i a-h j-z other ID NUM IF error Consider if11  potential ambiguity Is it keyword if and number 11? Or is it just id if11? IF, ID (state 3 is for IF and 8 is for ID) Consider longest match Consider if  still has ambiguity Is it a keyword or an id? Satisfies both Consider first match

31 Scanner using FA -- Ambiguity Resolution
Longest match Implies the need to lookahead When to terminate? Till there is no further transition feasible What if the termination condition is met at a non-final state? Need to backtrack Not the case in most of the modern languages First match Should arrange the REs properly E.g., keywords REs should appear before the id RE In practice for keywords There is no need to have DFA with all keywords in it Reducing the number of states to save space Simply recognize all of them as identifiers and have a different DFA or a hash table for keywords identification Note: There are cases other than keywords where first match is needed (e.g., >=)

32 Scanner using FA -- Modified Example
T1 = abb* T2 = ba Input: abbbbbba Process abbbbbb, when the next a comes, what will happen? Go back to state 1, move to state 2, find that the input cannot be accepted But this is an acceptable string with two tokens What can be done Backtracking Lookahead a b b 1 2 3 b a 4 5

33 Scanner using FA -- Another Example
T1 = abb* T2 = ca Input: abbbbcacaabb Process abbbb, when c comes, what to do? It is a bit different from regular FA When there is no transition for the next input symbol If it is at a final state, then accept the partial string and return the corresponding token id Go back to the initial state a b b 1 2 3 c a 4 5

34 Scanner using FA -- Modified Example
T1 = abb* T2 = bca Input: abbbbbbca Process abbbbbb, when the next c comes, what will happen? Go back to state 1, find that the input cannot be accepted But this is an acceptable string with two tokens What can be done Backtracking Lookahead a b b 1 2 3 b c a 4 5 6

35 Scanner using FA -- Backtracking
How will the scanner know where to stop and accept the token Backtracking Mark the state that may require backtracking After accepting, mark the location in the input In case the next input is not acceptable, go back to the marked place Input: abbbbbbca ab*b*b*b*b*b*c -- cannot go further Accept ab*b*b*b*b*b and process ‘c’ from starting state, but fails Backtrack to the nearest *, accept ab*b*b*b*b, try ‘bca’ from staring state, succeeds Problem: could be costly, sometimes may need to backtrack to many tokens before b a b 1 2 3 * b c a 4 5 6

36 Scanner using FA -- Lookahead
How will the scanner know where to stop and accept the token Lookahead Allow the user to specify a lookahead string Accept only after the lookahead string matches T1 = abb* /bc T2 = ba At state 3, when seeing b, always lookahead to determine whether to return the token before the current “b” or continue the b* loop Input: abbbbbbca abb: lookahead, no bc, continue abbb: lookahead, no bc, continue abbbbb: lookahead, there is bc, accept abbbbb b a b 1 2 3 /bc b c a 4 5 6

37 Scanner using FA -- Lookahead
How will the scanner know where to stop and accept the token Lookahead Input: abbbcaabbca abb: lookahead, there is bc, accept abb Continue to accept bca No need to lookahead Continue to ab, lookahead, there is bc, accept ab Continue to accept bca again Tokens: abb, bca, ab, bca b a b 1 2 3 /bc b c a 4 5 6

38 Scanner using FA -- Lookahead
Another lookahead example Input: abbba abb, lookahead, there is ba, accept abb Continue with ba and accept ba Input: abbbab abb: lookahead, there is ba, accept abb How to handle the remaining b? Error!  change the lookahead string to ba$ or baa Only accept when seeing ba$ or baa When bab, do not accept Input: abbbaba The old lookahead string would work and the new set would not abbbababababa……ba or abbbababababa……bab It is not possible to know till the end of the input No lookahead string with any fixed length can do b a b 1 2 3 /ba ? b a 4 5

39 Scanner using FA -- White Space
White space include blank, tab, and newline They are not tokens, but need to define how to process them Use a branch in the NFA for white space processing Simply skip the white spaces (“ ” | “\n” | “\t”) {/* do nothing*/} start 1 \n do nothing \t blank \n 5 \t blank

40 Scanner using FA -- What’s different
How the DFA for scanner differs from the regular DFA Needs to have a token marker for each final state Final states for different tokens are distinguishable Needs to have ambiguity resolution rules DFA execution is different Not to exhaust the entire input string Accept when the DFA cannot go further, break the input string Go back to the starting state after accepting Need to backtrack or lookahead

41 Miscellaneous -- Language Design Issues
In PL/1, id can be keywords if then then then = else; else else = then; Cannot resolve the ambiguity till parsing time Unless impose lookahead rules In FORTRAN, blanks are ignored (not just skipped in FA) do 10 i = 1,25 (is it keyword “do” or identifier “do10i”) do 10 i = 1.25 (= do10i = 1.25) Similar problem, lookahead is necessary Make life easier: be strict? All keywords starts by a different character All id starts by character z …… Too many rules for the programmer to remember If FA is used, then these specialized rules is not helping Sometimes the FA can be more complicated E.g., length of id has to be <= 6 characters

42 Miscellaneous -- Regular Language
Languages that can be specified by a regular expression is called a regular language Can RE specify all languages? No But RE is quite powerful already Keeping counts Famous example that RE cannot specify: Matching ( and ), there can be the same or more ( than ) in an prefix, but the same numbers of ( and ) in the entire string L = { pk qk, for any k } How about  = {0,1}, L = { s | s has even number of 0’s and 1’s }

43 Miscellaneous -- Regular Language
 = {0,1} L = { s | s has even number of 0’s and 1’s } (00|11)*((01|10)(00|11)*(01|10)(00|11)*)* Specify using grammar E = 1 E 1 E | 0 E 0 E | 

44 Miscellaneous -- NFA or DFA for Scanner
Processing input based on NFA May need to backtrack and end up exploring all paths in the NFA Converting NFA to DFA Assume that the NFA has K states The number of states of the corresponding DFA is bounded by 2K This can be very inefficient But in practice, the number of states in DFA is not too much higher than the number of states in the original NFA Can be a tradeoff Between space and time Use NFA to save space, use DFA to save time Actually, DFA is a special case of NFA

45 Lexical Analysis -- Summary
Read Chapter 3 of the textbook Except for 3.9 REs  NFA  DFA  Language processor DFA for Scanner Final state handling Ambiguity resolution Backtracking and lookahead Lexical analysis tool: lex


Download ppt "Lexical Analysis."

Similar presentations


Ads by Google