Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012.

Similar presentations


Presentation on theme: "Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012."— Presentation transcript:

1 Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012

2 Building a Lexical Analyzer
Part I Building a Lexical Analyzer Joey Paquet, 2000, 2002, 2012

3 Roles of the Scanner Removal of comments Case conversion
Comments are not part of the program’s meaning Multiple-line comments? Nested comments? Case conversion Is the lexical definition case sensitive? For identifiers For keywords Joey Paquet, 2000, 2002, 2012

4 Roles of the Scanner Removal of white spaces
Blanks, tabulars, carriage returns Is it possible to identify tokens in a program without spaces? Interpretation of compiler directives #include, #ifdef, #ifndef and #define are directives to “redirect the input” of the compiler May be done by a pre-compiler Joey Paquet, 2000, 2002, 2012

5 Roles of the Scanner Communication with the symbol table
A symbol table entry is created when an identifier is encountered The lexical analyzer cannot create the whole entries Convert the input file to a token stream Input file is a character stream Lexical convention: literals, operators, keywords, punctuation Joey Paquet, 2000, 2002, 2012

6 Tokens and Lexemes Token: An element of the lexical definition of the language. Lexeme: A sequence of characters identified as a token. id relop openpar if then assignop semi distance,rate,time,a,x >=,<,== ( = ; Joey Paquet, 2000, 2002, 2012

7 Design of a Lexical Analyzer
Steps 1- Construct a set of regular expressions (REs) that define the form of any valid token 2- Derive an NDFA from the REs 3- Derive a DFA from the NDFA 4- Translate to a state transition table 5- Implement the table 6- Implement the algorithm to interpret the table Joey Paquet, 2000, 2002, 2012

8 Regular Expressions  : { } s : {s | s in s^} a : {a}
 : { } s : {s | s in s^} a : {a} r | s : {r | r in r^} or {s | s in s^} s* : {sn | s in s^ and n>=0} s+ : {sn | s in s^ and n>=1} id ::= letter(letter|digit)* Joey Paquet, 2000, 2002, 2012

9 Derive NDFA from REs Could derive DFA from REs but:
Much easier to do NDFA, then derive DFA No standard way of deriving DFAs from REs Use Thompson’s construction (c.f. Louden’s) letter digit Joey Paquet, 2000, 2002, 2012

10 Derive DFA from NDFA Use subset construction (c.f. Louden’s)
May be optimized Easier to implement: No  edges Deterministic (no backtracking) digit letter letter digit [other] Joey Paquet, 2000, 2002, 2012

11 Generate State Transition Table
letter digit [other] 1 2 letter digit other final N N Y Joey Paquet, 2000, 2002, 2012

12 Implementation Concerns
Backtracking Principle : A token is normally recognized only when the next character is read. Problem : Maybe this character is part of the next token. Example : x<1. “<“ is recognized only when “1” is read. In this case, we have to backtrack on character to continue token recognition. Can include the occurrence of these cases in the state transition table. Joey Paquet, 2000, 2002, 2012

13 Implementation Concerns
Ambiguity Problem : Some tokens’ lexemes are subsets of other tokens. Example : n-1. Is it <n><-><1> or <n><-1>? Solutions : Postpone the decision to the syntactic analyzer Do not allow sign prefix to numbers in the lexical specification Interact with the syntactic analyzer to find a solution. (Induces coupling) Joey Paquet, 2000, 2002, 2012

14 Example Alphabet : Simple tokens : Composite tokens : Words :
{:, *, =, (, ), <, >, {, }, [a..z], [0..9]} Simple tokens : {(, ), {, }, :, <, >} Composite tokens : {:=, >=, <=, <>, (*, *)} Words : id ::= letter(letter | digit)* num ::= digit* Joey Paquet, 2000, 2002, 2012

15 Example Ambiguity problems: Backtracking: Character Possible tokens
Must back up a character when we read a character that is part of the next token. Occurences are coded in the table Character Possible tokens : :, := > >, >= < <, <=, <> ( (, (* * *, *) Joey Paquet, 2000, 2002, 2012

16 l d Final state with backtracking Final state l d d { } sp ( ) * * : =
2 3 d d 4 5 { } 6 7 sp 1 19 20 ( ) * * 8 9 10 11 : 12 13 = < 14 15 = > 16 > 17 18 = Joey Paquet, 2000, 2002, 2012

17 Table-driven Scanner (Table)
{ } ( * ) : = < > sp final/backup 1 2 4 6 19 8 12 14 17 3 yes [id] 5 yes [num] 7 no [{…}] 20 9 10 11 no [(*…*)] 13 no [:=] 15 16 no [<=] no [<>] 18 no [>=] no [error] yes [various] Table-driven Scanner (Table) Joey Paquet, 2000, 2002, 2012

18 Table-driven Scanner (Algorithm)
nextToken() state = 0 token = null do lookup = nextChar() state = Table(state, lookup) if (isFinalState(state)) token = createToken() if (Table(state, “backup”) == yes) backupChar() until (token != null) return (token) Joey Paquet, 2000, 2002, 2012

19 Table-driven Scanner nextToken() nextChar() backupChar()
Extract the next token in the program (called by syntactic analyzer) nextChar() Read the next character in the input program backupChar() Backs up one character in the input file Joey Paquet, 2000, 2002, 2012

20 Table-driven Scanner isFinalState(state) table(state, column)
Returns TRUE if state is a final state table(state, column) Returns the value corresponding to [state, column] in the state transition table. createToken() Creates and returns a structure that contains the token type, its location in the source code, and its value (for literals). Joey Paquet, 2000, 2002, 2012

21 Hand-written Scanner nextToken() c = nextChar() case (c) of
"[a..z],[A..Z]": while (c in {[a..z],[A..Z],[0..9]}) do s = makeUpString() if ( isReservedWord(s) )then token = createToken(RESWORD,null) else token = createToken(ID,s) backupChar() "[0..9]": while (c in [0..9]) do v = makeUpValue() token = createToken(NUM,v) Hand-written Scanner Joey Paquet, 2000, 2002, 2012

22 Hand-written Scanner "{": c = nextChar() while ( c != "}" ) do "(":
if ( c == "*" ) then repeat while ( c != "*" ) do until ( c != ")" ) return ":": if ( c == "=" ) then token = createToken(ASSIGNOP,null) else token = createToken(COLON,null) backupChar() Hand-written Scanner Joey Paquet, 2000, 2002, 2012

23 Hand-written Scanner "<": c = nextChar() if ( c == "=" ) then
token = createToken(LEQ,null) else if ( c == ">" ) then token = createToken(NEQ,null) else token = createToken(LT,null) backupChar() ">": token = createToken(GEQ,null) token = createToken(GT,null) ")": token = createToken(RPAR,null) "*": token = createToken(STAR,null) "=": token = createToken(EQ,null) end case return token Hand-written Scanner Joey Paquet, 2000, 2002, 2012

24 Error recovery in Lexical Analysis
Part II Error recovery in Lexical Analysis Joey Paquet, 2000, 2002, 2012

25 Possible Lexical Errors
Depends on the accepted conventions: Invalid character letter not allowed to terminate a number numerical overflow identifier too long end of line before end of string Are these lexical errors? Joey Paquet, 2000, 2002, 2012

26 Accepted or Not? 123a <Error> or <num><id>? <Error> related to machine’s limitations “Hello world <CR> Either <CR> is skipped or <Error> ThisIsAVeryLongVariableName = 1 Limit identifier length? Joey Paquet, 2000, 2002, 2012

27 Error Recovery Techniques
Finding only the first error is not acceptable Panic Mode: Skip characters until a valid character is read Guess Mode: do pattern matching between erroneous strings and valid strings Example: (beggin vs. begin) Rarely implemented Joey Paquet, 2000, 2002, 2012

28 Conclusions Joey Paquet, 2000, 2002, 2012

29 Possible Implementations
Lexical Analyzer Generator (e.g. Lex) safe, quick Must learn software, unable to handle unusual situations Table-Driven Lexical Analyzer general and adaptable method, same function can be used for all table-driven lexical analyzers Building transition table can be tedious and error-prone Joey Paquet, 2000, 2002, 2012

30 Possible Implementations
Hand-written Can be optimized, can handle any unusual situation, easy to build for most languages Error-prone, not adaptable or maintainable Joey Paquet, 2000, 2002, 2012

31 Lexical Analyzer’s Modularity
Why should the Lexical Analyzer and the Syntactic Analyzer be separated? Modularity/Maintainability : system is more modular, thus more maintainable Efficiency : modularity = task specialization = easier optimization Reusability : can change the whole lexical analyzer without changing other parts Joey Paquet, 2000, 2002, 2012


Download ppt "Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012."

Similar presentations


Ads by Google