Download presentation
Presentation is loading. Please wait.
Published byGriffin Rose Modified over 9 years ago
1
Joey Paquet, 2000, 20021 Lecture 2 Lexical Analysis
2
Joey Paquet, 2000, 20022 Part I Building a Lexical Analyzer
3
Joey Paquet, 2000, 20023 Roles of the Scanner Removal of comments –Comments are not part of the program’s meaning Multiple-line comments? Nested comments? Case conversion –Is the lexical definition case sensitive? For identifiers For keywords
4
Joey Paquet, 2000, 20024 Roles of the Scanner Removal of white spaces –Blanks, tabulars, carriage returns and line feeds –Is it possible to identify tokens in a program without spaces? Interpretation of compiler directives –#include, #ifdef, #ifndef and #define are directives to “redirect the input” of the compiler –May be done by a precompiler
5
Joey Paquet, 2000, 20025 Roles of the Scanner Communication with the symbol table –A symbol table entry is created when an identifier is encountered –The lexical analyzer cannot create the whole entries Preparation of the output listing –Output the analyzed code –Output error messages and warnings –Output a table of summary data
6
Joey Paquet, 2000, 20026 Tokens and Lexemes Token: An element of the lexical definition of the language. Lexeme: A sequence of characters identified as a token. id relop openpar if then assignop semi distance,rate,time,a,x >=,<,== ) if then = ;
7
Joey Paquet, 2000, 20027 Design of a Lexical Analyzer Steps 1- Construct a set of regular expressions (REs) that define the form of all valid token 2- Derive an NDFA from the REs 3- Derive a DFA from the NDFA 4- Translate to a state transition table 5- Implement the table 6- Implement the algorithm to interpret the table
8
Joey Paquet, 2000, 20028 Regular Expressions id ::= letter(letter|digit)* : { } s : {s | s in s ^ } a : { a } r | s : {r | r in r ^ } or {s | s in s ^ } s*: {s n | s in s ^ and n>=0} s + : {s n | s in s ^ and n>=1}
9
Joey Paquet, 2000, 20029 Derive NDFA from REs Could derive DFA from REs but: –Much easier to do NDFA, then derive DFA –No standard way of deriving DFAs from Res –Use Thompson’s construction (Louden’s) letter digit letter
10
Joey Paquet, 2000, 200210 Derive DFA from NDFA Use subset construction (Louden’s) May be optimized Easier to implement: –No edges –Determinist (no backtracking) letter digit [other] digit letter digit
11
Joey Paquet, 2000, 200211 Generate State Transition Table letter digit [other] 0 12 letter digit other final 0 1 N 1 1 1 2 N 2 Y
12
Joey Paquet, 2000, 200212 Implementation Concerns Backtracking –Principle : A token is normally recognized only when the next character is read. –Problem : Maybe this character is part of the next token. –Example : x<1. “<“ is recognized only when “1” is read. In this case, we have to backtrack on character to continue token recognition. –Can include the occurrence of these cases in the state transition table.
13
Joey Paquet, 2000, 200213 Implementation Concerns Ambiguity –Problem : Some tokens’ lexemes are subsets of other tokens. –Example : n-1. Is it or ? –Solutions : Postpone the decision to the syntactic analyzer Do not allow sign prefix to numbers in the lexical specification Interact with the syntactic analyzer to find a solution. (Induces coupling)
14
Joey Paquet, 2000, 200214 Example Alphabet : –{:, *, =, (, ),, {, }, [a..z], [0..9]} Simple tokens : –{(, ), {, }, :, } Composite tokens : –{:=, >=,, (*, *)} Words : –id ::= letter(letter | digit)* –num ::= digit*
15
Joey Paquet, 2000, 200215 Example Ambiguity problems: Backtracking: –Must back up a character when we read a character that is part of the next token. –Occurences are coded in the table Character Possible tokens : :, := > >, >= ( (, (* * *, *)
16
Joey Paquet, 2000, 200216 l d 2 3 4 5 6 7 8 10 1112 13 14 15 16 17 18 19 20 9 l d { ( } d ** ) : < > 1 sp = > = = Final state with backtracking Final state
17
Joey Paquet, 2000, 200217 ld{}(*):=<>spbackup 1246198 121914171 2223333333333 3111111111111yes [id] 4545555555555 5111111111111yes [num] 6666766666666 7111111111111no [{…}] 820 9 99999910999999 9999991199999 111111111111no [(*…*)] 1220 1320 13111111111111no [:=] 1420 15201620 15111111111111no [<=] 16111111111111no [<>] 1720 1820 18111111111111no [>=] 19111111111111no 20111111111111yes [various] Table-driven Scanner (Table)
18
Joey Paquet, 2000, 200218 nextToken() state = 0 token = null do lookup = nextChar() state = Table(state, lookup) if (isFinalState(state)) token = createToken() if (Table(state, “backup”) == yes) backupChar() until (token != null) return (token) Table-driven Scanner (Algorithm)
19
Joey Paquet, 2000, 200219 Table-driven Scanner nextToken() –Extract the next token in the program (called by syntactic analyzer) nextChar() –Read the next character (except spaces) in the input program backupChar() –Backs up one character in the input file
20
Joey Paquet, 2000, 200220 Table-driven Scanner isFinalState(state) –Returns TRUE if state is a final state table(state, column) –Returns the value corresponding to [state, column] in the state transition table. createToken() –Creates and returns a structure that contains the token type, its location in the source code, and its value (for literals).
21
Joey Paquet, 2000, 200221 nextToken() c = nextChar() case (c) of "[a..z],[A..Z]": c = nextChar() while (c in {[a..z],[A..Z],[0..9]}) do s = makeUpString() c = nextChar() if ( isReservedWord(s) )then token = createToken(RESWORD,null) else token = createToken(ID,s) backupChar() "[0..9]": c = nextChar() while (c in [0..9]) do v = makeUpValue() c = nextChar() token = createToken(NUM,v) backupChar() Hand-written Scanner
22
Joey Paquet, 2000, 200222 "{": c = nextChar() while ( c != "}" ) do c = nextChar() token = createToken(LBRACK,null) "(": c = nextChar() if ( c == "*" ) then c = nextChar() repeat while ( c != "*" ) do c = nextChar() until ( c != ")" ) return else token = createToken(LPAR,null) ":": c = nextChar() if ( c == "=" ) then token = createToken(ASSIGNOP,null) else token = createToken(COLON,null) backupChar() Hand-written Scanner
23
Joey Paquet, 2000, 200223 "<": c = nextChar() if ( c == "=" ) then token = createToken(LEQ,null) else if ( c == ">" ) then token = createToken(NEQ,null) else token = createToken(LT,null) backupChar() ">": c = nextChar() if ( c == "=" ) then token = createToken(GEQ,null) else token = createToken(GT,null) backupChar() ")": token = createToken(RPAR,null) "}": token = createToken(RBRACK,null) "*": token = createToken(STAR,null) "=": token = createToken(EQ,null) end case return token Hand-written Scanner
24
Joey Paquet, 2000, 200224 Part II Error recovery in Lexical Analysis
25
Joey Paquet, 2000, 200225 Possible Lexical Errors Depends on the accepted conventions: –Invalid character –letter not allowed to terminate a number –numerical overflow –identifier too long –end of line before end of string –Are these lexical errors?
26
Joey Paquet, 2000, 200226 Accepted or Not? 123a – 123456789012345678901234567 – related to machine’s limitations “Hello world –Either is skipped or ThisIsAVeryLongVariableName = 1 –Limit identifier length?
27
Joey Paquet, 2000, 200227 Error Recovery Techniques Finding only the first error is not acceptable Panic Mode: –Skip characters until a valid character is read Guess Mode: –do pattern matching between erroneous strings and valid strings –Example: (beggin vs. begin) –Rarely implemented
28
Joey Paquet, 2000, 200228 Conclusions
29
Joey Paquet, 2000, 200229 Possible Implementations Lexical Analyzer Generator (e.g. Lex) +safe, quick –Must learn software, unable to handle unusual situations Table-Driven Lexical Analyzer +general and adaptable method, same function can be used for all table-driven lexical analyzers –Building transition table can be tedious and error-prone
30
Joey Paquet, 2000, 200230 Possible Implementations Hand-written +Can be optimized, can handle any unusual situation, easy to build for most languages –Error-prone, not adaptable or maintainable
31
Joey Paquet, 2000, 200231 Lexical Analyzer’s Modularity Why should the Lexical Analyzer and the Syntactic Analyzer be separated? –Modularity/Maintainability : system is more modular, thus more maintainable –Efficiency : modularity = task specialization = easier optimization –Reusability : can change to whole lexical analyzer without changing other parts
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.