Download presentation
Presentation is loading. Please wait.
Published byEmery Moody Modified over 9 years ago
1
Lecture 6 Lexical Analysis Xiaoyin Wang CS5363 Programming Languages and Compilers
2
Last Class Finite State Machines From Grammar to FSM DFA vs. NFA
3
Today’s Class Scanner Concepts Tokens Basic Strategy Implementation: two-block buffer Lexical Errors
4
Scanner Example Input text // this statement does very little if (x >= y) y = 42; Token Stream Note: tokens are atomic items, not character strings IFLPARENID(x)GEQID(y) RPAREN ID(y)BECOMESINT(42)SCOLON
5
Lexical Analyzer in Perspective lexical analyzer parser symbol table source program token get next token Important Issue: What are Responsibilities of each Box ? Focus on Lexical Analyzer and Parser
6
The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set ASCII ISO Latin-1 ISO 10646 (16-bit = unicode) Others (EBCDIC, JIS, etc)
7
The Output A series of tokens Punctuation ( ) ;, [ ] Operators + - ** := Keywords begin end if Identifiers Square_Root String literals “hello this is a string” Character literals ‘x’ Numeric literals 123 4_5.23e+2 16#ac#
8
Lexical Analysis: Terminology token: a name for a set of input strings with related structure. –Example: “identifier,” “integer constant” pattern: a rule describing the set of strings associated with a token. –Example: “a letter followed by zero or more letters, digits, or underscores.” lexeme: the actual input string that matches a pattern. –Example: count
9
Examples Input: count = 123 Tokens: identifier : Rule: “ letter followed by …” Lexeme: count assg_op : Rule: = Lexeme: = integer_const : Rule: “ digit followed by …” Lexeme: 123
10
Attributes for Tokens If more than one lexeme can match the pattern for a token, the scanner must indicate the actual lexeme that matched. This information is given using an attribute associated with the token. –Example: The program statement count = 123 –yields the following token-attribute pairs: identifier, pointer to the string “count” assg_op, integer_const, the integer value 123
11
Free form vs Fixed form Free form languages White space does not matter Tabs, spaces, new lines, carriage returns Only the ordering of tokens is important Fixed format languages Layout is critical Fortran, label in cols 1-6 COBOL, area A B Lexical analyzer must worry about layout
12
Lexical Analyzer in Perspective LEXICAL ANALYZER Scan Input Remove WS, NL, … Identify Tokens Create Symbol Table Insert Tokens into ST Generate Errors Send Tokens to Parser PARSER Perform Syntax Analysis Actions Dictated by Token Order Update Symbol Table Entries Create Abstract Rep. of Source Generate Errors And More…. (We’ll see later)
13
What Factors Have Influenced the Functional Division of Labor ? Separation of Lexical Analysis From Parsing Presents a Simpler Conceptual Model From a Software Engineering Perspective Division Emphasizes High Cohesion and Low Coupling Implies Well Specified Parallel Implementation
14
What Factors Have Influenced the Functional Division of Labor ? Separation Increases Compiler Efficiency (I/O Techniques to Enhance Lexical Analysis) Separation Promotes Portability. This is critical today, when platforms (OSs and Hardware) are numerous and varied! Emergence of Platform Independence - Java
15
Today’s Class Scanner Concepts Tokens Basic Strategy Implementation: two-block buffer Lexical Errors
16
Introducing Basic Terminology TokenSample LexemesInformal Description of Pattern const if relation id num literal const if, >, >= pi, count, D2 3.1416, 0, 6.02E23 “core dumped” const if or >= or > letter followed by letters and digits any numeric constant any characters between “ and “ except “ Classifies Pattern Actual values are critical. Info is : 1. Stored in symbol table 2. Returned to parser
17
Punctuation Typically individual special characters Such as ‘(‘, ‘)’ Sometimes double characters E.g. (* treated as a kind of bracket Returned just as identity of token And perhaps location: For error message and debugging purposes
18
Punctuation in TL 15.0 LP: ( RP: ) SC: ; All of them are single characters, very easy to identify
19
Operators Like punctuation No real difference for lexical analyzer Typically single or double special chars Operators + - Operations := Returned just as identity of token And perhaps location
20
Operators in TL 15.0 ASGN: := MULTIPLICATIVE: * | div | mod Additive: + | - Compare: = | != | | = Up to 3 characters
21
Keywords Reserved identifiers E.g. BEGIN END in Pascal, if in C Returned just as token identity With possible location information Unreserved keywords (e.g. PL/1) Handled as identifiers (parser distinguishes)
22
Keywords in TL 15.0 program | begin | end if | then | else do | while true| false var | as | int | bool Reserved identifiers: can be identified together with identifiers
23
Identifiers Rules differ –Length, allowed characters, separators Need to build table –So that a1 is recognized as a1 –Typical structure: hash table Lexical analyzer returns token type –And key to table entry –Table entry includes location information
24
Identifiers in TL 15.0 [A-Z][A-Z0-9]* Simple and differentiated from keywords
25
Numeric Literals Also need a table Typically record value –E.g. 123 = 0123 = 01_23 (Ada) –But usually do not use int for values Because may have different characteristics Float stuff much more complex –Denormal numbers, correct rounding –Very delicate stuff
26
Number Literals in TL 15.0 [1-9][0-9]* | 0 Simple and differentiated from keywords
27
String Literals Text must be stored Actual characters are important –Not like identifiers –Character set issues –Table needed Lexical analyzer returns key to table May or may not be worth hashing
28
Character Literals Similar issues to string literals Lexical Analyzer returns –Token type –Identity of character Note, cannot assume character set of host machine, may be different
29
Handling Comments Comments have no effect on program Can therefore be eliminated by scanner But may need to be retrieved by tools Error detection issues –E.g. unclosed comments Scanner does not return comments
30
Comments in TL 15.0 Starts with ‘%’ until the end of the line No in-line comments Easy to remove
31
Case Sensitiveness Some languages have case equivalence –Pascal, Ada Some do not –C, Java Lexical analyzer ignores case if needed –This_Routine = THIS_RouTine –Error analysis may need exact casing
32
Today’s Class Scanner Tokens Basic Strategy Implementation: two-block buffer Lexical Errors
33
Basic Strategy Parse input to generate the token stream Model the whole language as a regular language Parse the input!
34
Basic Strategy Each token can be modeled with a finite state machine Language = Token*
35
Example of Token Models <=
36
Example of Token Models Number literal
37
Example of Token Models Number literal
38
Combine All Models Token*...
39
Basic Strategy However, this will not work … The reason The grammar is ambiguous One token can be prefix of the other ‘abc’ can be ‘a’ and ‘bc’ | ‘ab’ and ‘c’ | ‘abc’ ‘<=’ can be ‘<’ and ‘=’ | ‘<=’
40
Basic Strategy The rule to remove ambiguity Always identify the shortest token Basic strategy works, but programs are hard to write Always identify the longest token We need to revise the basic strategy Add a ‘backspace’ operation after finding a token
41
Combine All Models with Backspace Token*...
42
An Example q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) To q0
43
q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) To q0
44
q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) To q0
45
q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0
46
q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0
47
q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0 (
48
q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0 (
49
q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0 (
50
q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0 (id: A2
51
q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0 (id: A2
52
q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0 (id: A2
53
q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0 (id: A2>=
54
q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0 (id: A2>=
55
q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0 (id: A2>=
56
q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0 (id: A2>= Num: 35
57
q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0 (id: A2>= Num: 35
58
q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0 (id: A2>= Num: 35)
59
Transition diagrams Transition diagram for relop
60
Transition diagrams (cont.) Transition diagram for reserved words and identifiers
61
Transition diagrams (cont.) Transition diagram for unsigned numbers
62
Transition diagrams (cont.) Transition diagram for whitespace
63
Finite Automata: An Example A finite automaton to match C-style in-line comments:
64
Today’s Class Scanner Tokens Basic Strategy Implementation: two-block buffer Lexical Errors
65
Issues to Address Speed –Lexical analysis can take a lot of time –Minimize processing per character I/O is also an issue (read large blocks) –We compile frequently Compilation time is important –Especially during development
66
Interface to Lexical Analyzer Convert entire file to a file of tokens –Lexical analyzer is separate phase Parser calls lexical analyzer –Get next token –This approach avoids extra I/O –Parser builds tree as we go along
67
Basic Scanning technique Use 1 character of look-ahead –Obtain char with getc() Do a case analysis –Based on lookahead char –Based on current lexeme Outcome –If char can extend lexeme, all is well, go on. –If char cannot extend lexeme: Figure out what the complete lexeme is and return its token Put the lookahead back into the symbol stream
68
Input Buffering Scanner performance is crucial: –This is the only part of the compiler that examines the entire input program one character at a time. –Disk input can be slow. –The scanner accounts for ~25-30% of total compile time. We need lookahead to determine when a match has been found. Scanners use double-buffering to minimize the overheads associated with this.
69
I/O - Key For Successful Lexical Analysis Character-at-a-time I/O Block / Buffered I/O Utilize Block of memory Stage data from source to buffer block at a time Tradeoffs ?
70
I/O - Key For Successful Lexical Analysis Maintain two blocks - Why? Asynchronous I/O - for 1 block While Lexical Analysis on 2nd block Block 1Block 2 ptr... When done, issue I/O Still Process token in 2nd block
71
Buffer Pairs Use two N-byte buffers (N = size of a disk block; typically, N = 1024 or 4096). Read N bytes into one half of the buffer each time. If input has less than N bytes, put a special EOF marker in the buffer. While one buffer is being processed, read N bytes into the other buffer (“circular buffers”).
72
Buffer pairs (cont’d) Code: if (fwd at end of first half) reload first half; set fwd to point to beginning of second half; else if (fwd at end of second half) reload second half; set fwd to point to beginning of first half; else fwd++; it takes two tests for each advance of the fwd pointer.
73
Buffer pairs: Sentinels Objective: Optimize the common case by reducing the number of tests to one per advance of fwd. Idea: Extend each buffer half to hold a sentinel at the end. This is a special character that cannot occur in a program (e.g., EOF). It signals the need for some special action (fill other buffer-half, or terminate processing).
74
Buffer pairs with sentinels (cont’d) Code: fwd++; if ( *fwd == EOF ) { /* special processing needed */ if (fwd at end of first half)... else if (fwd at end of second half)... else /* end of input */ terminate processing. } common case now needs just a single test per character.
75
Handling Reserved Words Hard-wire them directly into the scanner automaton: harder to modify; increases the size and complexity of the automaton; performance benefits unclear (fewer tests, but cache effects due to larger code size). Fold them into “identifier” case, then look up a keyword table: simpler, smaller code; table lookup cost can be mitigated using perfect hashing.
76
Implementing Finite Automata 1 Encoded as program code: each state corresponds to a (labeled code fragment) state transitions represented as control transfers. E.g.: while ( TRUE ) { … state_k: ch = NextChar(); /* buffer mgt happens here */ switch (ch) { case … : goto...; /* state transition */ … } state_m: /* read token state, such as qx*/ copy lexeme to where parser can get at it; return token_type; … }
77
Direct-Coded Automaton: Example int scanner() { char ch; while (TRUE) { ch = NextChar( ); state_1: switch (ch) { /* initial state */ case ‘a’ : goto state_2; case ‘b’ : goto state_3; default : Error(); } state_2: … state_3: switch (ch) { case ‘a’ : goto state_2; default : return SUCCESS; } } /* while */ }
78
Implementing Finite Automata 2 Table-driven automata (e.g., lex, flex): Use a table to encode transitions: next_state = T(curr_state, next_char); Use one bit in state no. to indicate whether it’s a final (or error) state. If so, consult a separate table for what action to take. T next input character Curren t state
79
Table-Driven Automaton: Example #define isFinal(s) ((s) < 0) int scanner() { char ch; int currState = 1; while (TRUE) { ch = NextChar( ); if (ch == EOF) return 0; /* fail */ currState = T [currState, ch]; if (IsFinal(currState)) { return 1; /* success */ } } /* while */ } T input ab state 123 223 32
80
Transition-diagram-based analyzer TOKEN getToken() { TOKEN retToken = new (Token) while (1) {/* repeat character processing until a return or failure occurs*/ switch(state) { case 1: c= nextchar(); if (c == ‘a‘) state = 2; else if (c == ‘b‘) state = 3; else: fail() //report lexical error break; case 2: … … case 3: retract(); retToken.attribute = TOK; return(retToken); }
81
What do we do on finding a match? A match is found when: –The current automaton state is a token reading state Actions on finding a match: –if appropriate, copy lexeme (or other token attribute) to where the parser can access it; –save any necessary scanner state so that scanning can subsequently resume at the right place; –return a value indicating the token found.
82
Other implementations Regular Expression Matching for simple lexical analysis –Checking whether current string starts with operators / punctuations / comments Checking enclosing tokens earlier than prefix tokens E.g., Check ‘>=‘ then check ‘>’ –If starts with number, read until non-number character, check against regexp of number –If starts with letter, read until non-number /letter character, check against regexp of identifier –Other: Lexical error
83
Today’s Class Scanner Tokens Basic Strategy Implementation: two-block buffer Lexical Errors
84
Handling Lexical Errors Error Handling is very localized, with Respect to Input Source For example: whil ( x := 0 ) do generates no lexical errors in PASCAL In what Situations do Errors Occur? –Prefix of remaining input doesn’t match any defined token
85
Handling Lexical Errors Possible error recovery actions: –Deleting or Inserting Input Characters –Replacing or Transposing Characters Or, skip over to next separator to “ignore” problem
86
Lexical Error Examples TL 15.0 var TX as 98 VAR ABC as int ABC := 123AB ABC := 123ab B $= 1E20
87
Remove Delimiters White space, tabs, new lines, and carry returns are common delimiters In scanner, these delimiters are skipped –They are not viewed as tokens –May result in lexical errors (e.g. if you language does not allow carry returns) –You simple go back to q0 after viewing a delimiter
88
TL 15.0 The language asks for at least one delimiter after each token –To make it easier So a + b is valid, but a+b is not.
89
Change to the model Token*... Error
90
Today’s Class Scanner Tokens Basic Strategy Implementation: two-block buffer Lexical Errors
91
Next Class Parser LL(1) grammar LL(1) parser LL(k) grammar and parsers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.