Lexical Analysis and Scanning Honors Compilers Feb 5 th 2001 Robert Dewar.

Slides:



Advertisements
Similar presentations
Lexical Analysis Dragon Book: chapter 3.
Advertisements

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
The Assembly Language Level
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
1 Pass Compiler 1. 1.Introduction 1.1 Types of compilers 2.Stages of 1 Pass Compiler 2.1 Lexical analysis 2.2. syntactical analyzer 2.3. Code generation.
CPSC Compiler Tutorial 9 Review of Compiler.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis and Scanning Compiler Construction Lecture 2 Spring 2001 Robert Dewar.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary  Quoted string in.
Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
Chapter 3: Introduction to C Programming Language C development environment A simple program example Characters and tokens Structure of a C program –comment.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
Compiler Construction Lexical Analysis. The word lexical means textual or verbal or literal. The lexical analysis implemented in the “SCANNER” module.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analyzer (Checker)
D. M. Akbar Hussain: Department of Software & Media Technology 1 Compiler is tool: which translate notations from one system to another, usually from source.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
Flex: A fast Lexical Analyzer Generator CSE470: Spring 2000 Updated by Prasad.
1 Languages and Compilers (SProg og Oversættere) Lexical analysis.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability.
CPS 506 Comparative Programming Languages Syntax Specification.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
CSc 453 Lexical Analysis (Scanning)
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
1 Compiler & its Phases Krishan Kumar Asstt. Prof. (CSE) BPRCE, Gohana.
CSE 425: Syntax I Syntax and Semantics Syntax gives the structure of statements in a language –Allowed ordering, nesting, repetition, omission of symbols.
The Role of Lexical Analyzer
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
Compiler Construction CPCS302 Dr. Manal Abdulaziz.
Overview of Compilation Prepared by Manuel E. Bermúdez, Ph.D. Associate Professor University of Florida Programming Language Principles Lecture 2.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Deterministic Finite Automata Nondeterministic Finite Automata.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Department of Software & Media Technology
CS 3304 Comparative Languages
Lecture 2 Lexical Analysis
A Simple Syntax-Directed Translator
Chapter 2 :: Programming Language Syntax
Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lexical Analysis (Sections )
CSc 453 Lexical Analysis (Scanning)
CSc 453 Lexical Analysis (Scanning)
PROGRAMMING LANGUAGES
Lexical Analysis Why separate lexical and syntax analyses?
Lecture 5: Lexical Analysis III: The final bits
R.Rajkumar Asst.Professor CSE
CS 3304 Comparative Languages
CS 3304 Comparative Languages
Chapter 2 :: Programming Language Syntax
Lexical Analysis - An Introduction
Chapter 2 :: Programming Language Syntax
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Lexical Analysis and Scanning Honors Compilers Feb 5 th 2001 Robert Dewar

The Input Read string input Read string input Might be sequence of characters (Unix) Might be sequence of characters (Unix) Might be sequence of lines (VMS) Might be sequence of lines (VMS) Character set Character set ASCII ASCII ISO Latin-1 ISO Latin-1 ISO (16-bit = unicode) ISO (16-bit = unicode) Others (EBCDIC, JIS, etc) Others (EBCDIC, JIS, etc)

The Output A series of tokens A series of tokens Punctuation ( ) ;, [ ] Punctuation ( ) ;, [ ] Operators + - ** := Operators + - ** := Keywords begin end if Keywords begin end if Identifiers Square_Root Identifiers Square_Root String literals “hello this is a string” String literals “hello this is a string” Character literals ‘x’ Character literals ‘x’ Numeric literals 123 4_5.23e+2 16#ac# Numeric literals 123 4_5.23e+2 16#ac#

Free form vs Fixed form Free form languages Free form languages White space does not matter White space does not matter Tabs, spaces, new lines, carriage returns Tabs, spaces, new lines, carriage returns Only the ordering of tokens is important Only the ordering of tokens is important Fixed format languages Fixed format languages Layout is critical Layout is critical Fortran, label in cols 1-6 Fortran, label in cols 1-6 COBOL, area A B COBOL, area A B Lexical analyzer must worry about layout Lexical analyzer must worry about layout

Punctuation Typically individual special characters Typically individual special characters Such as + - Such as + - Lexical analyzer does not know : from : Lexical analyzer does not know : from : Sometimes double characters Sometimes double characters E.g. (* treated as a kind of bracket E.g. (* treated as a kind of bracket Returned just as identity of token Returned just as identity of token And perhaps location And perhaps location For error message and debugging purposes For error message and debugging purposes

Operators Like punctuation Like punctuation No real difference for lexical analyzer No real difference for lexical analyzer Typically single or double special chars Typically single or double special chars Operators + - Operators + - Operations := Operations := Returned just as identity of token Returned just as identity of token And perhaps location And perhaps location

Keywords Reserved identifiers Reserved identifiers E.g. BEGIN END in Pascal, if in C E.g. BEGIN END in Pascal, if in C Maybe distinguished from identifiers Maybe distinguished from identifiers E.g. mode vs mode in Algol-68 E.g. mode vs mode in Algol-68 Returned just as token identity Returned just as token identity With possible location information With possible location information Unreserved keywords (e.g. PL/1) Unreserved keywords (e.g. PL/1) Handled as identifiers (parser distinguishes) Handled as identifiers (parser distinguishes)

Identifiers Rules differ Rules differ Length, allowed characters, separators Length, allowed characters, separators Need to build table Need to build table So that junk1 is recognized as junk1 So that junk1 is recognized as junk1 Typical structure: hash table Typical structure: hash table Lexical analyzer returns token type Lexical analyzer returns token type And key to table entry And key to table entry Table entry includes location information Table entry includes location information

More on Identifier Tables Most common structure is hash table Most common structure is hash table With fixed number of headers With fixed number of headers Chain according to hash code Chain according to hash code Serial search on one chain Serial search on one chain Hash code computed from characters Hash code computed from characters No hash code is perfect! No hash code is perfect! Avoid any arbitrary limits Avoid any arbitrary limits

String Literals Text must be stored Text must be stored Actual characters are important Actual characters are important Not like identifiers Not like identifiers Character set issues Character set issues Table needed Table needed Lexical analyzer returns key to table Lexical analyzer returns key to table May or may not be worth hashing May or may not be worth hashing

Character Literals Similar issues to string literals Similar issues to string literals Lexical Analyzer returns Lexical Analyzer returns Token type Token type Identity of character Identity of character Note, cannot assume character set of host machine, may be different Note, cannot assume character set of host machine, may be different

Numeric Literals Also need a table Also need a table Typically record value Typically record value E.g. 123 = 0123 = 01_23 (Ada) E.g. 123 = 0123 = 01_23 (Ada) But cannot use int for values But cannot use int for values Because may have different characteristics Because may have different characteristics Float stuff much more complex Float stuff much more complex Denormals, correct rounding Denormals, correct rounding Very delicate stuff Very delicate stuff

Handling Comments Comments have no effect on program Comments have no effect on program Can therefore be eliminated by scanner Can therefore be eliminated by scanner But may need to be retrieved by tools But may need to be retrieved by tools Error detection issues Error detection issues E.g. unclosed comments E.g. unclosed comments Scanner does not return comments Scanner does not return comments

Case Equivalence Some languages have case equivalence Some languages have case equivalence Pascal, Ada Pascal, Ada Some do not Some do not C, Java C, Java Lexical analyzer ignores case if needed Lexical analyzer ignores case if needed This_Routine = THIS_RouTine This_Routine = THIS_RouTine Error analysis may need exact casing Error analysis may need exact casing

Issues to Address Speed Speed Lexical analysis can take a lot of time Lexical analysis can take a lot of time Minimize processing per character Minimize processing per character I/O is also an issue (read large blocks) I/O is also an issue (read large blocks) We compile frequently We compile frequently Compilation time is important Compilation time is important Especially during development Especially during development

General Approach Define set of token codes Define set of token codes An enumeration type An enumeration type A series of integer definitions A series of integer definitions These are just codes (no semantics) These are just codes (no semantics) Some codes associated with data Some codes associated with data E.g. key for identifier table E.g. key for identifier table May be useful to build tree node May be useful to build tree node For identifiers, literals etc For identifiers, literals etc

Interface to Lexical Analyzer Convert entire file to a file of tokens Convert entire file to a file of tokens Lexical analyzer is separate phase Lexical analyzer is separate phase Parser calls lexical analyzer Parser calls lexical analyzer Get next token Get next token This approach avoids extra I/O This approach avoids extra I/O Parser builds tree as we go along Parser builds tree as we go along

Implementation of Scanner Given the input text Given the input text Generate the required tokens Generate the required tokens Or provide token by token on demand Or provide token by token on demand Before we describe implementations Before we describe implementations We take this short break We take this short break To describe relevant formalisms To describe relevant formalisms

Relevant Formalisms Type 3 (Regular) Grammars Type 3 (Regular) Grammars Regular Expressions Regular Expressions Finite State Machines Finite State Machines

Regular Grammars Regular grammars Regular grammars Non-terminals (arbitrary names) Non-terminals (arbitrary names) Terminals (characters) Terminals (characters) Two forms of rules Two forms of rules Non-terminal ::= terminal Non-terminal ::= terminal Non-terminal ::= terminal Non-terminal Non-terminal ::= terminal Non-terminal One non-terminal is the start symbol One non-terminal is the start symbol Regular (type 3) grammars cannot count Regular (type 3) grammars cannot count No concept of matching nested parens No concept of matching nested parens

Regular Grammars Regular grammars Regular grammars E.g. grammar of reals with no exponent E.g. grammar of reals with no exponent REAL ::= 0 REAL1 (repeat for 1.. 9) REAL ::= 0 REAL1 (repeat for 1.. 9) REAL1 ::= 0 REAL1 (repeat for 1.. 9) REAL1 ::= 0 REAL1 (repeat for 1.. 9) REAL1 ::=. INTEGER REAL1 ::=. INTEGER INTEGER ::= 0 INTEGER (repeat for 1.. 9) INTEGER ::= 0 INTEGER (repeat for 1.. 9) INTEGER ::= 0 (repeat for 1.. 9) INTEGER ::= 0 (repeat for 1.. 9) Start symbol is REAL Start symbol is REAL

Regular Expressions Regular expressions (RE) defined by Regular expressions (RE) defined by Any terminal character is an RE Any terminal character is an RE Alternation RE | RE Alternation RE | RE Concatenation RE1 RE2 Concatenation RE1 RE2 Repetition RE* (zero or more RE’s) Repetition RE* (zero or more RE’s) Language of RE’s = type 3 grammars Language of RE’s = type 3 grammars Regular expressions are more convenient Regular expressions are more convenient

Specifying RE’s in Unix Tools Single characters a b c d \x Single characters a b c d \x Alternation [bcd] [b-z] ab|cd Alternation [bcd] [b-z] ab|cd Match any character. Match any character. Match sequence of characters x* y+ Match sequence of characters x* y+ Concatenation abc[d-q] Concatenation abc[d-q] Optional [0-9]+(.[0-9]*)? Optional [0-9]+(.[0-9]*)?

Finite State Machines Languages and Automata Languages and Automata A language is a set of strings A language is a set of strings An automaton is a machine An automaton is a machine That determines if a given string is in the language or not. That determines if a given string is in the language or not. FSM’s are automata that recognize regular languages (regular expressions) FSM’s are automata that recognize regular languages (regular expressions)

Definitions of FSM A set of labeled states A set of labeled states Directed arcs labeled with character Directed arcs labeled with character A state may be marked as terminal A state may be marked as terminal Transition from state S1 to S2 Transition from state S1 to S2 If and only if arc from S1 to S2 If and only if arc from S1 to S2 Labeled with next character (which is eaten) Labeled with next character (which is eaten) Recognized if ends up in terminal state Recognized if ends up in terminal state One state is distinguished start state One state is distinguished start state

Building FSM from Grammar One state for each non-terminal One state for each non-terminal A rule of the form A rule of the form Nont1 ::= terminal Nont1 ::= terminal Generates transition from S1 to final state Generates transition from S1 to final state A rule of the form A rule of the form Nont1 ::= terminal Nont2 Nont1 ::= terminal Nont2 Generates transition from S1 to S2 Generates transition from S1 to S2

Building FSM’s from RE’s Every RE corresponds to a grammar Every RE corresponds to a grammar For all regular expressions For all regular expressions A natural translation to FSM exists A natural translation to FSM exists We will not give details of algorithm here We will not give details of algorithm here

Non-Deterministic FSM A non-deterministic FSM A non-deterministic FSM Has at least one state Has at least one state With two arcs to two separate states With two arcs to two separate states Labeled with the same character Labeled with the same character Which way to go? Which way to go? Implementation requires backtracking Implementation requires backtracking Nasty  Nasty 

Deterministic FSM For all states S For all states S For all characters C For all characters C There is either ONE or NO arcs There is either ONE or NO arcs From state S From state S Labeled with character C Labeled with character C Much easier to implement Much easier to implement No backtracking No backtracking

Dealing with ND FSM Construction naturally leads to ND FSM Construction naturally leads to ND FSM For example, consider FSM for For example, consider FSM for [0-9]+ | [0-9]+\.[0-9]+ [0-9]+ | [0-9]+\.[0-9]+ (integer or real) (integer or real) We will naturally get a start state We will naturally get a start state With two sets of 0-9 branches With two sets of 0-9 branches And thus non-deterministic And thus non-deterministic

Converting to Deterministic There is an algorithm for converting There is an algorithm for converting From any ND FSM From any ND FSM To an equivalent deterministic FSM To an equivalent deterministic FSM Algorithm is in the text book Algorithm is in the text book Example (given in terms of RE’s) Example (given in terms of RE’s) [0-9]+ | [0-9]+\.[0-9]+ [0-9]+ | [0-9]+\.[0-9]+ [0-9]+(\.[0-9]+)? [0-9]+(\.[0-9]+)?

Implementing the Scanner Three methods Three methods Completely informal, just write code Completely informal, just write code Define tokens using regular expressions Define tokens using regular expressions Convert RE’s to ND finite state machine Convert RE’s to ND finite state machine Convert ND FSM to deterministic FSM Convert ND FSM to deterministic FSM Program the FSM Program the FSM Use an automated program Use an automated program To achieve above three steps To achieve above three steps

Ad Hoc Code (forget FSM’s) Write normal hand code Write normal hand code A procedure called Scan A procedure called Scan Normal coding techniques Normal coding techniques Basically scan over white space and comments till non-blank character found. Basically scan over white space and comments till non-blank character found. Base subsequent processing on character Base subsequent processing on character E.g. colon may be : or := E.g. colon may be : or := / may be operator or start of comment / may be operator or start of comment Return token found Return token found Write aggressive efficient code Write aggressive efficient code

Using FSM Formalisms Start with regular grammar or RE Start with regular grammar or RE Typically found in the language standard Typically found in the language standard For example, for Ada: For example, for Ada: Chapter 2. Lexical Elements Chapter 2. Lexical Elements Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 decimal-literal ::= integer [.integer][exponent] decimal-literal ::= integer [.integer][exponent] integer ::= digit {[underline] digit} integer ::= digit {[underline] digit} exponent ::= E [+] integer | E - integer exponent ::= E [+] integer | E - integer

Using FSM formalisms, cont Given RE’s or grammar Given RE’s or grammar Convert to finite state machine Convert to finite state machine Convert ND FSM to deterministic FSM Convert ND FSM to deterministic FSM Write a program to recognize Write a program to recognize Using the deterministic FSM Using the deterministic FSM

Implementing FSM (Method 1) Each state is code of the form: Each state is code of the form: > case Next_Character is when ‘a’ => goto state3; when ‘b’ => goto state1; when others => End_of_token_processing; end case; > case Next_Character is when ‘a’ => goto state3; when ‘b’ => goto state1; when others => End_of_token_processing; end case; > … > …

Implementing FSM (Method 2) There is a variable called State There is a variable called State loop case State is when state1 => > case Next_Character is when ‘a’ => State := state3; when ‘b’ => State := state1; when others => End_token_processing; end case; when state2 … … end case; end loop;

Implementing FSM (Method 3) T : array (State, Character) of State; while More_Input loop Curstate := T (Curstate, Next_Char); if Curstate = Error_State then … end loop; T : array (State, Character) of State; while More_Input loop Curstate := T (Curstate, Next_Char); if Curstate = Error_State then … end loop;

Automatic FSM Generation Our example, FLEX Our example, FLEX See home page for manual in HTML See home page for manual in HTML FLEX is given FLEX is given A set of regular expressions A set of regular expressions Actions associated with each RE Actions associated with each RE It builds a scanner It builds a scanner Which matches RE’s and executes actions Which matches RE’s and executes actions

Flex General Format Input to Flex is a set of rules: Input to Flex is a set of rules: Regexp actions (C statements) Regexp actions (C statements) … Flex scans the longest matching Regexp Flex scans the longest matching Regexp And executes the corresponding actions And executes the corresponding actions

An Example of a Flex scanner DIGIT [0-9] ID[a-z][a-z0-9]* % {DIGIT}+{ printf (“an integer %s (%d)\n”, yytext, atoi (yytext)); } {DIGIT}+”.”{DIGIT}* { printf (“a float %s (%g)\n”, yytext, atof (yytext)); if|then|begin|end|procedure|function { printf (“a keyword: %s\n”, yytext)); DIGIT [0-9] ID[a-z][a-z0-9]* % {DIGIT}+{ printf (“an integer %s (%d)\n”, yytext, atoi (yytext)); } {DIGIT}+”.”{DIGIT}* { printf (“a float %s (%g)\n”, yytext, atof (yytext)); if|then|begin|end|procedure|function { printf (“a keyword: %s\n”, yytext));

Flex Example (continued) {ID} printf (“an identifier %s\n”, yytext); “+”|“-”|“*”|“/” { printf (“an operator %s\n”, yytext); } {ID} printf (“an identifier %s\n”, yytext); “+”|“-”|“*”|“/” { printf (“an operator %s\n”, yytext); } “--”.*\n /* eat Ada style comment */ [ \t\n]+ /* eat white space */ [ \t\n]+ /* eat white space */. printf (“unrecognized character”); %. printf (“unrecognized character”); %

Assembling the flex program %{ #include /* for atof */ %} %{ #include /* for atof */ %} > > % main (argc, argv) int argc; char **argv; { yyin = fopen (argv[1], “r”); yylex();}

Running flex flex is a program that is executed flex is a program that is executed The input is as we have given The input is as we have given The output is a running C program The output is a running C program For Ada fans For Ada fans Look at aflex ( Look at aflex ( For C++ fans For C++ fans flex can run in C++ mode flex can run in C++ mode Generates appropriate classes Generates appropriate classes

Choice Between Methods? Hand written scanners Hand written scanners Typically much faster execution Typically much faster execution And pretty easy to write And pretty easy to write And a easier for good error recovery And a easier for good error recovery Flex approach Flex approach Simple to Use Simple to Use Easy to modify token language Easy to modify token language

The GNAT Scanner Hand written (scn.adb/scn.ads) Hand written (scn.adb/scn.ads) Basically a call does Basically a call does Super quick scan past blanks/comments etc Super quick scan past blanks/comments etc Big case statement Big case statement Process based on first character Process based on first character Call special routines Call special routines Namet.Get_Name for identifier (hashing) Namet.Get_Name for identifier (hashing) Keywords recognized by special hash Keywords recognized by special hash Strings (stringt.ads) Strings (stringt.ads) Integers (uintp.ads) Integers (uintp.ads) Reals (ureal.ads) Reals (ureal.ads)

More on the GNAT Scanner Entire source read into memory Entire source read into memory Single contiguous block Single contiguous block Source location is index into this block Source location is index into this block Different index range for each source file Different index range for each source file See sinput.adb/ads for source mgmt See sinput.adb/ads for source mgmt See scans.ads for definitions of tokens See scans.ads for definitions of tokens

More on GNAT Scanner Read scn.adb code Read scn.adb code Very easy reading, e.g. Very easy reading, e.g.

ASSIGNMENT TWO Write a flex or aflex program Write a flex or aflex program Recognize tokens of Algol-68s program Recognize tokens of Algol-68s program Print out tokens in style of flex example Print out tokens in style of flex example Extra credit Extra credit Build hash table for identifiers Build hash table for identifiers Output hash table key Output hash table key

Preprocessors Some languages allow preprocessing Some languages allow preprocessing This is a separate step This is a separate step Input is source Input is source Output is expanded source Output is expanded source Can either be done as separate phase Can either be done as separate phase Or embedded into the lexical analyzer Or embedded into the lexical analyzer Often done as separate phase Often done as separate phase Need to keep track of source locations Need to keep track of source locations

Nasty Glitches Separation of tokens Separation of tokens Not all languages have clear rules Not all languages have clear rules FORTRAN has optional spaces FORTRAN has optional spaces DO10I=1.6 DO10I=1.6 identifier operator literal identifier operator literal DO10I = 1.6 DO10I = 1.6 DO10I=1,6 DO10I=1,6 Keyword stmt loopvar operator literal punc literal Keyword stmt loopvar operator literal punc literal DO 10 I = 1, 6 DO 10 I = 1, 6 Modern languages avoid this kind of thing! Modern languages avoid this kind of thing!