Lexical Analysis and Scanning Compiler Construction Lecture 2 Spring 2001 Robert Dewar.

Lexical Analysis and Scanning Compiler Construction Lecture 2 Spring 2001 Robert Dewar

The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set ASCII ISO Latin-1 ISO 10646 (16-bit = unicode) Others (EBCDIC, JIS, etc)

The Output A series of tokens Punctuation ( ) ;, [ ] Operators + - ** := Keywords begin end if Identifiers Square_Root String literals “hello this is a string” Character literals ‘x’ Numeric literals 123 4_5.23e+2 16#ac#

Free form vs Fixed form Free form languages White space does not matter Tabs, spaces, new lines, carriage returns Only the ordering of tokens is important Fixed format languages Layout is critical Fortran, label in cols 1-6 COBOL, area A B Lexical analyzer must worry about layout

Punctuation Typically individual special characters Such as + - Lexical analyzer does not know : from : Sometimes double characters E.g. (* treated as a kind of bracket Returned just as identity of token And perhaps location For error message and debugging purposes

Operators Like punctuation No real difference for lexical analyzer Typically single or double special chars Operators + - Operations := Returned just as identity of token And perhaps location

Keywords Reserved identifiers E.g. BEGIN END in Pascal, if in C Maybe distinguished from identifiers E.g. mode vs mode in Algol-68 Returned just as token identity With possible location information Unreserved keywords (e.g. PL/1) Handled as identifiers (parser distinguishes)

Identifiers Rules differ Length, allowed characters, separators Need to build table So that junk1 is recognized as junk1 Typical structure: hash table Lexical analyzer returns token type And key to table entry Table entry includes location information

More on Identifier Tables Most common structure is hash table With fixed number of headers Chain according to hash code Serial search on one chain Hash code computed from characters No hash code is perfect! Avoid any arbitrary limits

String Literals Text must be stored Actual characters are important Not like identifiers Character set issues Table needed Lexical analyzer returns key to table May or may not be worth hashing

Character Literals Similar issues to string literals Lexical Analyzer returns Token type Identity of character Note, cannot assume character set of host machine, may be different

Numeric Literals Also need a table Typically record value E.g. 123 = 0123 = 01_23 (Ada) But cannot use int for values Because may have different characteristics Float stuff much more complex Denormals, correct rounding Very delicate stuff

Handling Comments Comments have no effect on program Can therefore be eliminated by scanner But may need to be retrieved by tools Error detection issues E.g. unclosed comments Scanner does not return comments

Case Equivalence Some languages have case equivalence Pascal, Ada Some do not C, Java Lexical analyzer ignores case if needed This_Routine = THIS_RouTine Error analysis may need exact casing

Issues to Address Speed Lexical analysis can take a lot of time Minimize processing per character I/O is also an issue (read large blocks) We compile frequently Compilation time is important Especially during development

General Approach Define set of token codes An enumeration type A series of integer definitions These are just codes (no semantics) Some codes associated with data E.g. key for identifier table May be useful to build tree node For identifiers, literals etc

Interface to Lexical Analyzer Convert entire file to a file of tokens Lexical analyzer is separate phase Parser calls lexical analyzer Get next token This approach avoids extra I/O Parser builds tree as we go along

Implementation of Scanner Given the input text Generate the required tokens Or provide token by token on demand Before we describe implementations We take this short break To describe relevant formalisms

Relevant Formalisms Type 3 (Regular) Grammars Regular Expressions Finite State Machines

Regular Grammars Regular grammars Non-terminals (arbitrary names) Terminals (characters) Two forms of rules Non-terminal ::= terminal Non-terminal ::= terminal Non-terminal One non-terminal is the start symbol Regular (type 3) grammars cannot count No concept of matching nested parens

Regular Grammars Regular grammars E.g. grammar of reals with no exponent REAL ::= 0 REAL1 (repeat for 1.. 9) REAL1 ::= 0 REAL1 (repeat for 1.. 9) REAL1 ::=. INTEGER INTEGER ::= 0 INTEGER (repeat for 1.. 9) INTEGER ::= 0 (repeat for 1.. 9) Start symbol is REAL

Regular Expressions Regular expressions (RE) defined by Any terminal character is an RE Alternation RE | RE Concatenation RE1 RE2 Repetition RE* (zero or more RE’s) Language of RE’s = type 3 grammars Regular expressions are more convenient

Specifying RE’s in Unix Tools Single characters a b c d \x Alternation [bcd] [b-z] ab|cd Match any character. Match sequence of characters x* y+ Concatenation abc[d-q] Optional [0-9]+(.[0-9]*)?

Finite State Machines Languages and Automata A language is a set of strings An automaton is a machine That determines if a given string is in the language or not. FSM’s are automata that recognize regular languages (regular expressions)

Definitions of FSM A set of labeled states Directed arcs labeled with character A state may be marked as terminal Transition from state S1 to S2 If and only if arc from S1 to S2 Labeled with next character (which is eaten) Recognized if ends up in terminal state One state is distinguished start state

Building FSM from Grammar One state for each non-terminal A rule of the form Nont1 ::= terminal Generates transition from S1 to final state A rule of the form Nont1 ::= terminal Nont2 Generates transition from S1 to S2

Building FSM’s from RE’s Every RE corresponds to a grammar For all regular expressions A natural translation to FSM exists We will not give details of algorithm here

Non-Deterministic FSM A non-deterministic FSM Has at least one state With two arcs to two separate states Labeled with the same character Which way to go? Implementation requires backtracking Nasty 

Deterministic FSM For all states S For all characters C There is either ONE or NO arcs From state S Labeled with character C Much easier to implement No backtracking

Dealing with ND FSM Construction naturally leads to ND FSM For example, consider FSM for [0-9]+ | [0-9]+\.[0-9]+ (integer or real) We will naturally get a start state With two sets of 0-9 branches And thus non-deterministic

Converting to Deterministic There is an algorithm for converting From any ND FSM To an equivalent deterministic FSM Algorithm is in the text book Example (given in terms of RE’s) [0-9]+ | [0-9]+\.[0-9]+ [0-9]+(\.[0-9]+)?

Implementing the Scanner Three methods Completely informal, just write code Define tokens using regular expressions Convert RE’s to ND finite state machine Convert ND FSM to deterministic FSM Program the FSM Use an automated program To achieve above three steps

Ad Hoc Code (forget FSM’s) Write normal hand code A procedure called Scan Normal coding techniques Basically scan over white space and comments till non-blank character found. Base subsequent processing on character E.g. colon may be : or := / may be operator or start of comment Return token found Write aggressive efficient code

Using FSM Formalisms Start with regular grammar or RE Typically found in the language standard For example, for Ada: Chapter 2. Lexical Elements Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 decimal-literal ::= integer [.integer][exponent] integer ::= digit {[underline] digit} exponent ::= E [+] integer | E - integer

Using FSM formalisms, cont Given RE’s or grammar Convert to finite state machine Convert ND FSM to deterministic FSM Write a program to recognize Using the deterministic FSM

Implementing FSM (Method 1) Each state is code of the form: > case Next_Character is when ‘a’ => goto state3; when ‘b’ => goto state1; when others => End_of_token_processing; end case; > …

Implementing FSM (Method 2) There is a variable called State loop case State is when state1 => > case Next_Character is when ‘a’ => State := state3; when ‘b’ => State := state1; when others => End_token_processing; end case; when state2 … … end case; end loop;

Implementing FSM (Method 3) T : array (State, Character) of State; while More_Input loop Curstate := T (Curstate, Next_Char); if Curstate = Error_State then … end loop;

Automatic FSM Generation Our example, FLEX See home page for manual in HTML FLEX is given A set of regular expressions Actions associated with each RE It builds a scanner Which matches RE’s and executes actions

Flex General Format Input to Flex is a set of rules: Regexp actions (C statements) … Flex scans the longest matching Regexp And executes the corresponding actions

An Example of a Flex scanner DIGIT [0-9] ID[a-z][a-z0-9]* % {DIGIT}+{ printf (“an integer %s (%d)\n”, yytext, atoi (yytext)); } {DIGIT}+”.”{DIGIT}* { printf (“a float %s (%g)\n”, yytext, atof (yytext)); if|then|begin|end|procedure|function { printf (“a keyword: %s\n”, yytext));

Flex Example (continued) {ID} printf (“an identifier %s\n”, yytext); “+”|“-”|“*”|“/” { printf (“an operator %s\n”, yytext); } “--”.*\n /* eat Ada style comment */ [ \t\n]+ /* eat white space */. printf (“unrecognized character”); %

Assembling the flex program %{ #include /* for atof */ %} > % main (argc, argv) int argc; char **argv; { yyin = fopen (argv[1], “r”); yylex(); }

Running flex flex is a program that is executed The input is as we have given The output is a running C program For Ada fans Look at aflex (www.adapower.com)www.adapower.com For C++ fans flex can run in C++ mode Generates appropriate classes

Choice Between Methods? Hand written scanners Typically much faster execution And pretty easy to write And a easier for good error recovery Flex approach Simple to Use Easy to modify token language

The GNAT Scanner Hand written (scn.adb/scn.ads) Basically a call does Super quick scan past blanks/comments etc Big case statement Process based on first character Call special routines Namet.Get_Name for identifier (hashing) Keywords recognized by special hash Strings (stringt.ads) Integers (uintp.ads) Reals (ureal.ads)

More on the GNAT Scanner Entire source read into memory Single contiguous block Source location is index into this block Different index range for each source file See sinput.adb/ads for source mgmt See scans.ads for definitions of tokens

More on GNAT Scanner Read scn.adb code Very easy reading, e.g.

DTL (Dewar Trivial Language) DTL Grammar Program ::= DECLARE Decls BEGIN Stmts Decls ::= {Decl}* Stmts ::= {Stmt}+ Type ::= INTEGER | REAL Identifier ::= letter (_{digit}+)* Decl ::= DECLARE identifier : Type

DTL (Continued) Integer_Literal ::= {digit}+ Real_Literal ::= {digit}+”.”{digit}* Stmt ::= Assignstmt | Ifstmt | Whilestmt Assignstmt ::= Identifier := Expr Expr ::= Literal | (Expr) Op (Expr) Op ::= + | - Literal ::= Integer_Literal | Real_Literal Ifstmt ::= IF Expr Relop Expr THEN Stmts Whilestmt ::= WHILE Expr Relop Expr DO Stmts Relop ::= > | = | <=

DTL Example DECLARE DECL A_123 : INTEGER DECL B: REAL BEGIN A_123 := 23 B := 2.4 WHILE A_123 > (2) + (1) DO A_123 := A_123 - 1

ASSIGNMENT TWO Write a flex or aflex program Recognize tokens of DTL program Print out tokens in style of flex example Extra credit Build hash table for identifiers Output hash table key

Preprocessors Some languages allow preprocessing This is a separate step Input is source Output is expanded source Can either be done as separate phase Or embedded into the lexical analyzer Often done as separate phase Need to keep track of source locations

Nasty Glitches Separation of tokens Not all languages have clear rules FORTRAN has optional spaces DO10I=1.6 identifier operator literal DO10I = 1.6 DO10I=1,6 Keyword stmt loopvar operator literal punc literal DO 10 I = 1, 6 Modern languages avoid this kind of thing!

Lexical Analysis and Scanning Compiler Construction Lecture 2 Spring 2001 Robert Dewar.

Similar presentations

Presentation on theme: "Lexical Analysis and Scanning Compiler Construction Lecture 2 Spring 2001 Robert Dewar."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lexical Analysis and Scanning Compiler Construction Lecture 2 Spring 2001 Robert Dewar.

Similar presentations

Presentation on theme: "Lexical Analysis and Scanning Compiler Construction Lecture 2 Spring 2001 Robert Dewar."— Presentation transcript:

Similar presentations

About project

Feedback