Joey Paquet, 2000, 20021 Lecture 2 Lexical Analysis.

Joey Paquet, 2000, 20021 Lecture 2 Lexical Analysis

Joey Paquet, 2000, 20022 Part I Building a Lexical Analyzer

Joey Paquet, 2000, 20023 Roles of the Scanner Removal of comments –Comments are not part of the program’s meaning Multiple-line comments? Nested comments? Case conversion –Is the lexical definition case sensitive? For identifiers For keywords

Joey Paquet, 2000, 20024 Roles of the Scanner Removal of white spaces –Blanks, tabulars, carriage returns and line feeds –Is it possible to identify tokens in a program without spaces? Interpretation of compiler directives –#include, #ifdef, #ifndef and #define are directives to “redirect the input” of the compiler –May be done by a precompiler

Joey Paquet, 2000, 20025 Roles of the Scanner Communication with the symbol table –A symbol table entry is created when an identifier is encountered –The lexical analyzer cannot create the whole entries Preparation of the output listing –Output the analyzed code –Output error messages and warnings –Output a table of summary data

Joey Paquet, 2000, 20026 Tokens and Lexemes Token: An element of the lexical definition of the language. Lexeme: A sequence of characters identified as a token. id relop openpar if then assignop semi distance,rate,time,a,x >=,<,== ) if then = ;

Joey Paquet, 2000, 20027 Design of a Lexical Analyzer Steps 1- Construct a set of regular expressions (REs) that define the form of all valid token 2- Derive an NDFA from the REs 3- Derive a DFA from the NDFA 4- Translate to a state transition table 5- Implement the table 6- Implement the algorithm to interpret the table

Joey Paquet, 2000, 20028 Regular Expressions id ::= letter(letter|digit)*  : { } s : {s | s in s ^ } a : { a } r | s : {r | r in r ^ } or {s | s in s ^ } s*: {s n | s in s ^ and n>=0} s + : {s n | s in s ^ and n>=1}

Joey Paquet, 2000, 20029 Derive NDFA from REs Could derive DFA from REs but: –Much easier to do NDFA, then derive DFA –No standard way of deriving DFAs from Res –Use Thompson’s construction (Louden’s)       letter digit  letter

Joey Paquet, 2000, 200210 Derive DFA from NDFA Use subset construction (Louden’s) May be optimized Easier to implement: –No  edges –Determinist (no backtracking) letter digit [other] digit letter digit

Joey Paquet, 2000, 200211 Generate State Transition Table letter digit [other] 0 12 letter digit other final 0 1 N 1 1 1 2 N 2 Y

Joey Paquet, 2000, 200212 Implementation Concerns Backtracking –Principle : A token is normally recognized only when the next character is read. –Problem : Maybe this character is part of the next token. –Example : x<1. “<“ is recognized only when “1” is read. In this case, we have to backtrack on character to continue token recognition. –Can include the occurrence of these cases in the state transition table.

Joey Paquet, 2000, 200213 Implementation Concerns Ambiguity –Problem : Some tokens’ lexemes are subsets of other tokens. –Example : n-1. Is it or ? –Solutions : Postpone the decision to the syntactic analyzer Do not allow sign prefix to numbers in the lexical specification Interact with the syntactic analyzer to find a solution. (Induces coupling)

Joey Paquet, 2000, 200214 Example Alphabet : –{:, *, =, (, ),, {, }, [a..z], [0..9]} Simple tokens : –{(, ), {, }, :, } Composite tokens : –{:=, >=,, (*, *)} Words : –id ::= letter(letter | digit)* –num ::= digit*

Joey Paquet, 2000, 200215 Example Ambiguity problems: Backtracking: –Must back up a character when we read a character that is part of the next token. –Occurences are coded in the table Character Possible tokens : :, := > >, >= ( (, (* * *, *)

Joey Paquet, 2000, 200216 l d 2 3 4 5 6 7 8 10 1112 13 14 15 16 17 18 19 20 9 l d { ( } d ** ) : < > 1 sp = > = = Final state with backtracking Final state

Joey Paquet, 2000, 200217 ld{}(*):=<>spbackup 1246198 121914171 2223333333333 3111111111111yes [id] 4545555555555 5111111111111yes [num] 6666766666666 7111111111111no [{…}] 820 9 99999910999999 9999991199999 111111111111no [(*…*)] 1220 1320 13111111111111no [:=] 1420 15201620 15111111111111no [<=] 16111111111111no [<>] 1720 1820 18111111111111no [>=] 19111111111111no 20111111111111yes [various] Table-driven Scanner (Table)

Joey Paquet, 2000, 200218 nextToken() state = 0 token = null do lookup = nextChar() state = Table(state, lookup) if (isFinalState(state)) token = createToken() if (Table(state, “backup”) == yes) backupChar() until (token != null) return (token) Table-driven Scanner (Algorithm)

Joey Paquet, 2000, 200219 Table-driven Scanner nextToken() –Extract the next token in the program (called by syntactic analyzer) nextChar() –Read the next character (except spaces) in the input program backupChar() –Backs up one character in the input file

Joey Paquet, 2000, 200220 Table-driven Scanner isFinalState(state) –Returns TRUE if state is a final state table(state, column) –Returns the value corresponding to [state, column] in the state transition table. createToken() –Creates and returns a structure that contains the token type, its location in the source code, and its value (for literals).

Joey Paquet, 2000, 200221 nextToken() c = nextChar() case (c) of "[a..z],[A..Z]": c = nextChar() while (c in {[a..z],[A..Z],[0..9]}) do s = makeUpString() c = nextChar() if ( isReservedWord(s) )then token = createToken(RESWORD,null) else token = createToken(ID,s) backupChar() "[0..9]": c = nextChar() while (c in [0..9]) do v = makeUpValue() c = nextChar() token = createToken(NUM,v) backupChar() Hand-written Scanner

Joey Paquet, 2000, 200222 "{": c = nextChar() while ( c != "}" ) do c = nextChar() token = createToken(LBRACK,null) "(": c = nextChar() if ( c == "*" ) then c = nextChar() repeat while ( c != "*" ) do c = nextChar() until ( c != ")" ) return else token = createToken(LPAR,null) ":": c = nextChar() if ( c == "=" ) then token = createToken(ASSIGNOP,null) else token = createToken(COLON,null) backupChar() Hand-written Scanner

Joey Paquet, 2000, 200223 "<": c = nextChar() if ( c == "=" ) then token = createToken(LEQ,null) else if ( c == ">" ) then token = createToken(NEQ,null) else token = createToken(LT,null) backupChar() ">": c = nextChar() if ( c == "=" ) then token = createToken(GEQ,null) else token = createToken(GT,null) backupChar() ")": token = createToken(RPAR,null) "}": token = createToken(RBRACK,null) "*": token = createToken(STAR,null) "=": token = createToken(EQ,null) end case return token Hand-written Scanner

Joey Paquet, 2000, 200224 Part II Error recovery in Lexical Analysis

Joey Paquet, 2000, 200225 Possible Lexical Errors Depends on the accepted conventions: –Invalid character –letter not allowed to terminate a number –numerical overflow –identifier too long –end of line before end of string –Are these lexical errors?

Joey Paquet, 2000, 200226 Accepted or Not? 123a – 123456789012345678901234567 – related to machine’s limitations “Hello world –Either is skipped or ThisIsAVeryLongVariableName = 1 –Limit identifier length?

Joey Paquet, 2000, 200227 Error Recovery Techniques Finding only the first error is not acceptable Panic Mode: –Skip characters until a valid character is read Guess Mode: –do pattern matching between erroneous strings and valid strings –Example: (beggin vs. begin) –Rarely implemented

Joey Paquet, 2000, 200228 Conclusions

Joey Paquet, 2000, 200229 Possible Implementations Lexical Analyzer Generator (e.g. Lex) +safe, quick –Must learn software, unable to handle unusual situations Table-Driven Lexical Analyzer +general and adaptable method, same function can be used for all table-driven lexical analyzers –Building transition table can be tedious and error-prone

Joey Paquet, 2000, 200230 Possible Implementations Hand-written +Can be optimized, can handle any unusual situation, easy to build for most languages –Error-prone, not adaptable or maintainable

Joey Paquet, 2000, 200231 Lexical Analyzer’s Modularity Why should the Lexical Analyzer and the Syntactic Analyzer be separated? –Modularity/Maintainability : system is more modular, thus more maintainable –Efficiency : modularity = task specialization = easier optimization –Reusability : can change to whole lexical analyzer without changing other parts

Joey Paquet, 2000, 20021 Lecture 2 Lexical Analysis.

Similar presentations

Presentation on theme: "Joey Paquet, 2000, 20021 Lecture 2 Lexical Analysis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Joey Paquet, 2000, 20021 Lecture 2 Lexical Analysis.

Similar presentations

Presentation on theme: "Joey Paquet, 2000, 20021 Lecture 2 Lexical Analysis."— Presentation transcript:

Similar presentations

About project

Feedback