Lecture 6 Lexical Analysis Xiaoyin Wang CS5363 Programming Languages and Compilers.

Lecture 6 Lexical Analysis Xiaoyin Wang CS5363 Programming Languages and Compilers

Last Class  Finite State Machines  From Grammar to FSM  DFA vs. NFA

Today’s Class  Scanner Concepts  Tokens  Basic Strategy  Implementation: two-block buffer  Lexical Errors

Scanner Example Input text // this statement does very little if (x >= y) y = 42; Token Stream Note: tokens are atomic items, not character strings IFLPARENID(x)GEQID(y) RPAREN ID(y)BECOMESINT(42)SCOLON

Lexical Analyzer in Perspective lexical analyzer parser symbol table source program token get next token Important Issue: What are Responsibilities of each Box ? Focus on Lexical Analyzer and Parser

The Input  Read string input  Might be sequence of characters (Unix)  Might be sequence of lines (VMS)  Character set  ASCII  ISO Latin-1  ISO 10646 (16-bit = unicode)  Others (EBCDIC, JIS, etc)

The Output  A series of tokens  Punctuation ( ) ;, [ ]  Operators + - ** :=  Keywords begin end if  Identifiers Square_Root  String literals “hello this is a string”  Character literals ‘x’  Numeric literals 123 4_5.23e+2 16#ac#

Lexical Analysis: Terminology token: a name for a set of input strings with related structure. –Example: “identifier,” “integer constant” pattern: a rule describing the set of strings associated with a token. –Example: “a letter followed by zero or more letters, digits, or underscores.” lexeme: the actual input string that matches a pattern. –Example: count

Examples Input: count = 123 Tokens: identifier : Rule: “ letter followed by …” Lexeme: count assg_op : Rule: = Lexeme: = integer_const : Rule: “ digit followed by …” Lexeme: 123

Attributes for Tokens If more than one lexeme can match the pattern for a token, the scanner must indicate the actual lexeme that matched. This information is given using an attribute associated with the token. –Example: The program statement count = 123 –yields the following token-attribute pairs:  identifier, pointer to the string “count”   assg_op,   integer_const, the integer value 123 

Free form vs Fixed form  Free form languages  White space does not matter  Tabs, spaces, new lines, carriage returns  Only the ordering of tokens is important  Fixed format languages  Layout is critical  Fortran, label in cols 1-6  COBOL, area A B  Lexical analyzer must worry about layout

Lexical Analyzer in Perspective LEXICAL ANALYZER Scan Input Remove WS, NL, … Identify Tokens Create Symbol Table Insert Tokens into ST Generate Errors Send Tokens to Parser PARSER Perform Syntax Analysis Actions Dictated by Token Order Update Symbol Table Entries Create Abstract Rep. of Source Generate Errors And More…. (We’ll see later)

What Factors Have Influenced the Functional Division of Labor ?  Separation of Lexical Analysis From Parsing Presents a Simpler Conceptual Model  From a Software Engineering Perspective Division Emphasizes  High Cohesion and Low Coupling  Implies Well Specified  Parallel Implementation

What Factors Have Influenced the Functional Division of Labor ?  Separation Increases Compiler Efficiency (I/O Techniques to Enhance Lexical Analysis)  Separation Promotes Portability.  This is critical today, when platforms (OSs and Hardware) are numerous and varied!  Emergence of Platform Independence - Java

Today’s Class  Scanner Concepts  Tokens  Basic Strategy  Implementation: two-block buffer  Lexical Errors

Introducing Basic Terminology TokenSample LexemesInformal Description of Pattern const if relation id num literal const if, >, >= pi, count, D2 3.1416, 0, 6.02E23 “core dumped” const if or >= or > letter followed by letters and digits any numeric constant any characters between “ and “ except “ Classifies Pattern Actual values are critical. Info is : 1. Stored in symbol table 2. Returned to parser

Punctuation  Typically individual special characters  Such as ‘(‘, ‘)’  Sometimes double characters  E.g. (* treated as a kind of bracket  Returned just as identity of token  And perhaps location: For error message and debugging purposes

Punctuation in TL 15.0  LP: (  RP: )  SC: ;  All of them are single characters, very easy to identify

Operators  Like punctuation  No real difference for lexical analyzer  Typically single or double special chars  Operators + -  Operations :=  Returned just as identity of token  And perhaps location

Keywords  Reserved identifiers  E.g. BEGIN END in Pascal, if in C  Returned just as token identity  With possible location information  Unreserved keywords (e.g. PL/1)  Handled as identifiers (parser distinguishes)

Identifiers Rules differ –Length, allowed characters, separators Need to build table –So that a1 is recognized as a1 –Typical structure: hash table Lexical analyzer returns token type –And key to table entry –Table entry includes location information

Identifiers in TL 15.0  [A-Z][A-Z0-9]*  Simple and differentiated from keywords

Numeric Literals Also need a table Typically record value –E.g. 123 = 0123 = 01_23 (Ada) –But usually do not use int for values Because may have different characteristics Float stuff much more complex –Denormal numbers, correct rounding –Very delicate stuff

Number Literals in TL 15.0  [1-9][0-9]* | 0  Simple and differentiated from keywords

String Literals Text must be stored Actual characters are important –Not like identifiers –Character set issues –Table needed Lexical analyzer returns key to table May or may not be worth hashing

Character Literals Similar issues to string literals Lexical Analyzer returns –Token type –Identity of character Note, cannot assume character set of host machine, may be different

Handling Comments Comments have no effect on program Can therefore be eliminated by scanner But may need to be retrieved by tools Error detection issues –E.g. unclosed comments Scanner does not return comments

Comments in TL 15.0  Starts with ‘%’ until the end of the line  No in-line comments  Easy to remove

Case Sensitiveness Some languages have case equivalence –Pascal, Ada Some do not –C, Java Lexical analyzer ignores case if needed –This_Routine = THIS_RouTine –Error analysis may need exact casing

Today’s Class  Scanner  Tokens  Basic Strategy  Implementation: two-block buffer  Lexical Errors

Basic Strategy  Parse input to generate the token stream  Model the whole language as a regular language  Parse the input!

Basic Strategy  Each token can be modeled with a finite state machine  Language = Token*

Example of Token Models <=

Example of Token Models Number literal

Combine All Models Token*...

Basic Strategy  However, this will not work …  The reason  The grammar is ambiguous  One token can be prefix of the other  ‘abc’ can be  ‘a’ and ‘bc’ | ‘ab’ and ‘c’ | ‘abc’  ‘<=’ can be  ‘<’ and ‘=’ | ‘<=’

Basic Strategy  The rule to remove ambiguity  Always identify the shortest token  Basic strategy works, but programs are hard to write  Always identify the longest token  We need to revise the basic strategy  Add a ‘backspace’ operation after finding a token

Combine All Models with Backspace Token*...

An Example q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) To q0

q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) To q0

q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0

q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0 (

q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0 (id: A2

q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0 (id: A2>=

q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0 (id: A2>= Num: 35

q1 q2 q0 1-9 0 0-9 qx q3q4 Non-digit/number Non-digit/0 > = any/>= other/> q5 A-Z other/id A-Z0-9 q6 any/( ( q7 ) any/) IF To q0 (id: A2>= Num: 35)

Transition diagrams Transition diagram for relop

Transition diagrams (cont.) Transition diagram for reserved words and identifiers

Transition diagrams (cont.) Transition diagram for unsigned numbers

Transition diagrams (cont.) Transition diagram for whitespace

Finite Automata: An Example A finite automaton to match C-style in-line comments:

Issues to Address Speed –Lexical analysis can take a lot of time –Minimize processing per character I/O is also an issue (read large blocks) –We compile frequently Compilation time is important –Especially during development

Interface to Lexical Analyzer Convert entire file to a file of tokens –Lexical analyzer is separate phase Parser calls lexical analyzer –Get next token –This approach avoids extra I/O –Parser builds tree as we go along

Basic Scanning technique Use 1 character of look-ahead –Obtain char with getc() Do a case analysis –Based on lookahead char –Based on current lexeme Outcome –If char can extend lexeme, all is well, go on. –If char cannot extend lexeme: Figure out what the complete lexeme is and return its token Put the lookahead back into the symbol stream

Input Buffering Scanner performance is crucial: –This is the only part of the compiler that examines the entire input program one character at a time. –Disk input can be slow. –The scanner accounts for ~25-30% of total compile time. We need lookahead to determine when a match has been found. Scanners use double-buffering to minimize the overheads associated with this.

I/O - Key For Successful Lexical Analysis Character-at-a-time I/O Block / Buffered I/O Utilize Block of memory Stage data from source to buffer block at a time Tradeoffs ?

I/O - Key For Successful Lexical Analysis Maintain two blocks - Why? Asynchronous I/O - for 1 block While Lexical Analysis on 2nd block Block 1Block 2 ptr... When done, issue I/O Still Process token in 2nd block

Buffer Pairs Use two N-byte buffers (N = size of a disk block; typically, N = 1024 or 4096). Read N bytes into one half of the buffer each time. If input has less than N bytes, put a special EOF marker in the buffer. While one buffer is being processed, read N bytes into the other buffer (“circular buffers”).

Buffer pairs (cont’d) Code: if (fwd at end of first half) reload first half; set fwd to point to beginning of second half; else if (fwd at end of second half) reload second half; set fwd to point to beginning of first half; else fwd++; it takes two tests for each advance of the fwd pointer.

Buffer pairs: Sentinels Objective: Optimize the common case by reducing the number of tests to one per advance of fwd. Idea: Extend each buffer half to hold a sentinel at the end. This is a special character that cannot occur in a program (e.g., EOF). It signals the need for some special action (fill other buffer-half, or terminate processing).

Buffer pairs with sentinels (cont’d) Code: fwd++; if ( *fwd == EOF ) { /* special processing needed */ if (fwd at end of first half)... else if (fwd at end of second half)... else /* end of input */ terminate processing. } common case now needs just a single test per character.

Handling Reserved Words  Hard-wire them directly into the scanner automaton:  harder to modify;  increases the size and complexity of the automaton;  performance benefits unclear (fewer tests, but cache effects due to larger code size).  Fold them into “identifier” case, then look up a keyword table:  simpler, smaller code;  table lookup cost can be mitigated using perfect hashing.

Implementing Finite Automata 1 Encoded as program code: each state corresponds to a (labeled code fragment) state transitions represented as control transfers. E.g.: while ( TRUE ) { … state_k: ch = NextChar(); /* buffer mgt happens here */ switch (ch) { case … : goto...; /* state transition */ … } state_m: /* read token state, such as qx*/ copy lexeme to where parser can get at it; return token_type; … }

Direct-Coded Automaton: Example int scanner() { char ch; while (TRUE) { ch = NextChar( ); state_1: switch (ch) { /* initial state */ case ‘a’ : goto state_2; case ‘b’ : goto state_3; default : Error(); } state_2: … state_3: switch (ch) { case ‘a’ : goto state_2; default : return SUCCESS; } } /* while */ }

Implementing Finite Automata 2 Table-driven automata (e.g., lex, flex): Use a table to encode transitions: next_state = T(curr_state, next_char); Use one bit in state no. to indicate whether it’s a final (or error) state. If so, consult a separate table for what action to take. T next input character Curren t state

Table-Driven Automaton: Example #define isFinal(s) ((s) < 0) int scanner() { char ch; int currState = 1; while (TRUE) { ch = NextChar( ); if (ch == EOF) return 0; /* fail */ currState = T [currState, ch]; if (IsFinal(currState)) { return 1; /* success */ } } /* while */ } T input ab state 123 223 32

Transition-diagram-based analyzer TOKEN getToken() { TOKEN retToken = new (Token) while (1) {/* repeat character processing until a return or failure occurs*/ switch(state) { case 1: c= nextchar(); if (c == ‘a‘) state = 2; else if (c == ‘b‘) state = 3; else: fail() //report lexical error break; case 2: … … case 3: retract(); retToken.attribute = TOK; return(retToken); }

What do we do on finding a match? A match is found when: –The current automaton state is a token reading state Actions on finding a match: –if appropriate, copy lexeme (or other token attribute) to where the parser can access it; –save any necessary scanner state so that scanning can subsequently resume at the right place; –return a value indicating the token found.

Other implementations Regular Expression Matching for simple lexical analysis –Checking whether current string starts with operators / punctuations / comments Checking enclosing tokens earlier than prefix tokens E.g., Check ‘>=‘ then check ‘>’ –If starts with number, read until non-number character, check against regexp of number –If starts with letter, read until non-number /letter character, check against regexp of identifier –Other: Lexical error

Handling Lexical Errors Error Handling is very localized, with Respect to Input Source For example: whil ( x := 0 ) do generates no lexical errors in PASCAL In what Situations do Errors Occur? –Prefix of remaining input doesn’t match any defined token

Handling Lexical Errors Possible error recovery actions: –Deleting or Inserting Input Characters –Replacing or Transposing Characters Or, skip over to next separator to “ignore” problem

Lexical Error Examples TL 15.0 var TX as 98 VAR ABC as int ABC := 123AB ABC := 123ab B $= 1E20

Remove Delimiters White space, tabs, new lines, and carry returns are common delimiters In scanner, these delimiters are skipped –They are not viewed as tokens –May result in lexical errors (e.g. if you language does not allow carry returns) –You simple go back to q0 after viewing a delimiter

TL 15.0 The language asks for at least one delimiter after each token –To make it easier So a + b is valid, but a+b is not.

Change to the model Token*... Error

Next Class  Parser  LL(1) grammar  LL(1) parser  LL(k) grammar and parsers

Lecture 6 Lexical Analysis Xiaoyin Wang CS5363 Programming Languages and Compilers.

Similar presentations

Presentation on theme: "Lecture 6 Lexical Analysis Xiaoyin Wang CS5363 Programming Languages and Compilers."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 6 Lexical Analysis Xiaoyin Wang CS5363 Programming Languages and Compilers.

Similar presentations

Presentation on theme: "Lecture 6 Lexical Analysis Xiaoyin Wang CS5363 Programming Languages and Compilers."— Presentation transcript:

Similar presentations

About project

Feedback