Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1.

Lexical Analysis

The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1 ISO 10646 (16-bit = unicode) Ada, Java Others (EBCDIC, JIS, etc)

The Output A series of tokens: kind, location, name (if any) Punctuation ( ) ;, [ ] Operators + - ** := Keywords begin end if while try catch Identifiers Square_Root String literals “press Enter to continue” Character literals ‘x’ Numeric literals Integer: 123 Floating_point: 4_5.23e+2 Based representation: 16#ac#

Free form vs Fixed form Free form languages (all modern ones) White space does not matter. Ignore these: Tabs, spaces, new lines, carriage returns Only the ordering of tokens is important Fixed format languages (historical) Layout is critical Fortran, label in cols 1-6 COBOL, area A B Lexical analyzer must know about layout to find tokens

Punctuation: Separators Typically individual special characters such as ( { } :.. (two dots) Sometimes double characters: lexical scanner looks for longest token: (*, /* -- comment openers in various languages Returned just as identity (kind) of token And perhaps location for error messages and debugging purposes

Operators Like punctuation No real difference for lexical analyzer Typically single or double special chars Operators + - == <= Operations := => Returned as kind of token And perhaps location

Keywords Reserved identifiers E.g. BEGIN END in Pascal, if in C, catch in C++ Maybe distinguished from identifiers E.g. mode vs mode in Algol-68 Returned as kind of token With possible location information Oddity: unreserved keywords in PL/1  IF IF THEN THEN = THEN + 1; Handled as identifiers (parser disambiguates)

Identifiers Rules differ Length, allowed characters, separators Need to build a names table Single entry for all occurrences of Var1 Language may be case insensitive: same entry for VAR1, vAr1, Var1 Typical structure: hash table Lexical analyzer returns token kind And key (index) to table entry Table entry includes location information

Organization of names table Most common structure is hash table With fixed number of headers Chain according to hash code Serial search on one chain Hash code computed from characters (e.g. sum mod table size). No hash code is perfect! Expect collisions. Avoid any arbitrary limits on table or chain size.

String Literals Text must be stored Actual characters are important Not like identifiers: must preserve casing Character set issues: uniform internal representation Table needed Lexical analyzer returns key into table May or may not be worth hashing to avoid duplicates

Character Literals Similar issues to string literals Lexical Analyzer returns Token kind Identity of character Cannot assume character set of host machine, may be different

Numeric Literals need a table to store numeric value E.g. 123 = 0123 = 01_23 (Ada) But cannot use predefined type for values Because may have different bounds Floating point representations much more complex Denormals, correct rounding Very delicate to compute correct value. Host / target issues

Handling Comments Comments have no effect on program Can be eliminated by scanner But may need to be retrieved by tools Error detection issues E.g. unclosed comments Scanner skips over comments and returns next meaningful token

Case Equivalence Some languages are case-insensitive Pascal, Ada Some are not C, Java Lexical analyzer ignores case if needed This_Routine = THIS_RouTine Error analysis may need exact casing Friendly diagnostics follow user’s conventions

Performance Issues Speed Lexical analysis can become bottleneck Minimize processing per character Skip blanks fast I/O is also an issue (read large blocks) We compile frequently Compilation time is important  Especially during development Communicate with parser through global variables

General Approach Define set of token kinds: An enumeration type (tok_int, tok_if, tok_plus, tok_left_paren, tok_assign etc). Or a series of integer definitions in more primitive languages… Some tokens carry associated data E.g. key for identifier table May be useful to build tree node  For identifiers, literals etc

Interface to Lexical Analyzer Either: Convert entire file to a file of tokens Lexical analyzer is separate phase Or: Parser calls lexical analyzer to supply next token This approach avoids extra I/O Parser builds tree incrementally, using successive tokens as tree nodes

Relevant Formalisms Type 3 (Regular) Grammars Regular Expressions Finite State Machines Equivalent in expressive power Useful for program construction, even if hand-written

Regular Grammars Regular grammars Non-terminals (arbitrary names) Terminals (characters) Productions limited to the following: Non-terminal → terminal Non-terminal → terminal Non-terminal Treat character class (e.g. digit) as terminal Regular grammars cannot count: cannot express size limits on identifiers, literals Cannot express proper nesting (parentheses) Can be generalized by allowing terminal* instead of a single terminal.

Regular Grammars grammar for real literals with no exponent digit :: = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 REAL ::= digit REAL1 REAL1 ::= digit REAL1 (arbitrary size) REAL1 ::=. INTEGER INTEGER ::= digit INTEGER (arbitrary size) INTEGER ::= digit Start symbol is REAL

Regular Expressions Regular expressions (RE) defined by an alphabet (terminal symbols) and three operations: Alternation RE 1 | RE 2 Concatenation RE 1 RE 2 Repetition RE* (zero or more RE’s) Language of RE’s = regular grammars Regular expressions are more convenient for some applications

Specifying RE’s in Unix Tools Single characters a b c d \x Alternation [bcd] [b-z] ab|cd [^}] Any character. (period) Match sequence of characters x* y+ Concatenation abc[d-q] Optional RE [0-9]+(\.[0-9]*)?

Finite State Machines A language defined by a grammar is a (possibly infinite) set of strings An automaton is a device that determines, by reading a string once character at a time, whether the string belongs to a specified language. A finite state machine (FSM) is an automaton that recognize regular languages (regular expressions) Simplest automaton: memory is single number (state)

Specifying an FSM A set of labeled states Directed arcs between states labeled with character One or more states may be terminal (accepting) A distinguished state is start Automaton makes transition from state S1 to S2 If and only if arc from S1 to S2 is labeled with next character in input A string belongs to the language if, after reading the string, the automaton has reached an accepting state.

Building FSM from Grammar One state for each non-terminal A rule of the form Nt1 → terminal Generates transition from S1 to an accepting state A rule of the form Nt1 → terminal Nt2 Generates transition from S1 to S2 on an arc labeled by the terminal

Building FSM’s from RE’s Every RE corresponds to a grammar For all regular expressions A natural translation to FSM exists Alternation often leads to non-deterministic machines

Non-Deterministic FSM A non-deterministic FSM Has at least one state With two arcs to two distinct states Labeled with the same character Example: from start state, a digit can begin an integer literal or a real literal Naïve implementation requires backtracking Nasty 

Deterministic FSM For all states S For all characters C: There is at most one arc from any state S that is labeled with C Much easier to implement No backtracking

From NFSM to DFSM There is an algorithm for converting a non- deterministic machine to a deterministic one Result may have exponentially more states Intuitively: need new states to express uncertainty about token: int or real Algorithm is efficient in practice (e.g. grep) Other algorithms for minimizing number of states of FSM, for showing equivalence, etc.

Implementing the Scanner Three methods Hand-coded approach: draw DFSM, then implement with loop and case statement Hybrid approach : define tokens using regular expressions, convert to NFSM, apply algorithm to obtain minimal DSFM Hand-code resulting DFSM Automated approach: Use regular grammar as input to lexical scanner generator (e.g. FLEX)

Hand-coding Normal coding techniques Scan over white space and comments till non-blank character found. Branch depending on first character:  If digit, scan numeric literal  If character, scan identifier or keyword  If operator, check next character (++, etc.)  Need table to determine character type efficiently Return token found Write aggressive efficient code: goto’s, global variables

Using grammar and FSM Start with regular grammar or RE Typically found in the language reference example (Ada): Chapter 2. Lexical Elements Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 decimal-literal ::= integer [.integer][exponent] integer ::= digit {[underline] digit} exponent ::= E [+] integer | E - integer

Using grammar and FSM Create one state for each non-terminal Label edges according to productions in grammar Each state becomes a label in the program Code for each state is a switch on next character, corresponding to edges out of current state If no possible transition on next character, then: If state is accepting, return the corresponding token If state is not accepting, report error

Hand-coded version: Each state is encoded as follows: > case Next_Character is when ‘a’ => goto state3; when ‘b’ => goto state1; when others => End_of_token_processing; end case; > … No explicit mention of state of automaton

Translating from FSM to code variable holds current state: loop case State is when state1 => > case Next_Character is when ‘a’ => State := state3; when ‘b’ => State := state1; when others => End_token_processing; end case; when state2 … … end case; end loop;

Automatic scanner construction FLEX builds a transition table, indexed by state and by character. Code gets transition from table: Tab : array (State, Character) of State := … begin while More_Input loop Curstate := Tab (Curstate, Next_Char); if Curstate = Error_State then … end loop;

Automatic FSM Generation Our example, FLEX See home page for manual in HTML FLEX is given A set of regular expressions Actions associated with each RE It builds a scanner Which matches RE’s and executes actions

Flex General Format Input to Flex is a set of rules: Regexp actions (C statements) … Flex scans the longest matching Regexp And executes the corresponding actions

An Example of a Flex scanner DIGIT [0-9] ID[a-z][a-z0-9]* % {DIGIT}+{ printf (“an integer %s (%d)\n”, yytext, atoi (yytext)); } {DIGIT}+”.”{DIGIT}* { printf (“a float %s (%g)\n”, yytext, atof (yytext)); } if|then|begin|end|procedure|function|program { printf (“a keyword: %s\n”, yytext)); } {ID} printf (“an identifier %s\n”, yytext); “+”|“-”|“*”|“/” printf (“an operator %s\n”, yytext); "="|":"|";"|":=" printf ("a separator %s\n", yytext);

Flex Example (continued) "{"[^}]*"}" /* eat up Pascal-like comments */ [ \t\n]+ /* eat white space */. printf (“unrecognized character”); %

Assembling the flex program %{ #include /* for atof */ %} > % main (argc, argv) int argc; char **argv; { yyin = fopen (argv[1], “r”); yylex(); }

Running flex Under Unix/Linux flex is an executable program Run flex -oex3.c ex3.lex Without the –o flag, the output will be issued to lex.yy.c. Follow by gcc -oex3 ex3.c –lfl. Finally, run ex3 test0001.pas, where, test0001.pas is a test program.

Running flex on Other Systems For Ada fans Look at aflex (www.adapower.com)www.adapower.com For C++ fans flex can run in C++ mode Generates appropriate classes

Choice Between Methods? Hand written scanners Typically much faster execution Easy to write (standard structure) Preferable for good error recovery Flex approach Simple to Use Easy to modify token language

Historical oddities Because early keypunch machines were unreliable, FORTRAN treats blanks as optional: lexical analysis and parsing are intertwined. DO10I=1.6 3 tokens: identifier operator literal DO10I = 1.6 DO10I=1,6 7 tokens: Keyword stmt id operator literal comma literal DO 10 I = 1, 6 Celebrated NASA failure caused by this bug (?)

Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1.

Similar presentations

Presentation on theme: "Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1.

Similar presentations

Presentation on theme: "Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1."— Presentation transcript:

Similar presentations

About project

Feedback