1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical.

1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical Analysis

2 Announcements Programming assignment 1 will be at the class webpage –Due next Tuesday, October 15 th (easy) Homework 1 is due now Lecture notes will be available at the class webpage

3 First Phase: Lexical Analysis (Scanning) Scanner Maps stream of characters into tokens –Basic unit of syntax Characters that form a word are its lexeme Its syntactic category is called its token Scanner discards white space and comments Scanner works as a subroutine of the parser Source code Scanner IR Parser Errors token get next token

4 Lexical Analysis Specify tokens using Regular Expressions Translate Regular Expressions to Finite Automata Use Finite Automata to generate tables or code for the scanner Scanner Generator specifications (regular expressions) source code tokens tables or code

5 Automating Scanner Construction To build a scanner: 1Write down the RE that specifies the tokens 2Translate the RE to an NFA 3Build the DFA that simulates the NFA 4Systematically shrink the DFA 5Turn it into code or table Scanner generators Lex, Flex, Jlex work along these lines Algorithms are well-known and well-understood Interface to parser is important

6 Automating Scanner Construction RE  NFA ( Thompson’s construction ) Build an NFA for each term Combine them with  -moves NFA  DFA ( subset construction ) Build the simulation DFA  Minimal DFA Hopcroft’s algorithm DFA  RE All pairs, all paths problem Union together paths from s 0 to a final state minimal DFA RENFADFA The Cycle of Constructions

7 NFA vs. DFA Scanners Given a regular expression r we can convert it to an NFA of size O(|r|) Given an NFA we can convert it to a DFA of size O(2 |r| ) We can simulate a DFA on string x in O(|x|) time We can simulate an NFA N (constructed by Thompson’s construction) on a string x in O(|N|  |x|) time Automaton Type Space Complexity Time Complexity NFA O(|r|) O(|r|  |x|) DFA O(2 |r| )O(|x|) Recognizing input string x for regular expression r

8 Scanner Generators: JLex, Lex, FLex user code % JLex directives % regular expression rules directly copied to the output file macro (regular) definitions (e.g., digits = [0-9]+ ) and state names each rule: optional state list, regular expression, action user code at top (from parser-generator) specifies what tokens are States can be mixed with regular expressions For each regular expression we can define a set of states where it is valid (JLex, Flex) Standard format of regular expression rule: regular_expression { actions }

9 JLex, FLex, Lex r_1{ action_1 } r_2{ action_2 }. r_n{ action_n } Java code for JLex, C code for FLex and Lex A r_1 A r_2 A r_n... s0s0    Automata for regular expression r_1 Rules used by scanner generators 1) Continue scanning the input until reaching an error state 2) Accept the longest prefix that matches to a regular expression and execute the corresponding action 3) If two patterns match the longest prefix, then the action which is specified earlier will be executed 4) After a match, go back to the end of the accepted prefix in the input and start scanning for the next token Regular expression rules: For faster scanning, convert this NFA to a DFA and minimize the states error new final states new start sate

10 A Simple Example Id = [a-z][a-z0-9]* Num = [0-9]+ if ="if" Recognize the following tokens: WhiteSpace = [\ \t\f\b\r\n] Comment = \/\/.* Also take care of one line comments and white space:

11 /* User code */ import java.io.*; // For FileInputStream and its exceptions. /* ========================================== */ class Type { static final int IF = 0; static final int ID = 1; static final int NUM = 2; static final int EOF = 3; }; class Token { public int type; public String attribute; public Token(int t) { type=t; } public Token(int t, String s) { type=t; attribute = s; } public static String spellingOf(int t) { switch (t) { case Type.IF : return "IF"; case Type.ID : return "ID"; case Type.NUM : return "NUM"; default : return "Undefinied token type"; } public String toString() { switch (type) { case Type.ID : case Type.NUM : return spellingOf(type) + ", " + attribute; default: return spellingOf(type); } };

12 /* ================================================= */ class Example { public static void main(String[] args) throws FileNotFoundException, IOException { FileInputStream fis = new FileInputStream(args[0]); Lexer L = new Lexer(fis); Token T = L.next(); while (T.type != Type.EOF) { System.out.println(T); T = L.next(); } /* ================================================ */ % /* JLex directives */ %class Lexer %function next %type Token %eofval{ return new Token(Type.EOF); %eofval} /* white space */ WhiteSpace = [\ \t\f\b\r\n] /* comments */ Comment = \/\/.* Id = [a-z][a-z0-9]* Num = [0-9]+ % {WhiteSpace} {} {Comment} {} "if" { return new Token(Type.IF); } {Id}{ return new Token(Type.ID, yytext()); } {Num}{ return new Token(Type.NUM, yytext()); }

13 If above JLex specification is in a file simple.jlx, you can generate a scanner for that specification as follows: % cd % setenv CLASSPATH ".:/fs/cs-cls/cs160/lib" % java JLex.Main simple.jlx % javac simple.jlx.java % java Example input1

14 IF ID, i1 IF ID, var15 NUM, 15 NUM, 1 Undefined token type NUM, 2 Undefined token type NUM, 3 if i1 // this is a comment if var15 15 1, 2, 3 if i1 // this is a comment if var15 15 1 2 4253 IF ID, i1 IF ID, var15 NUM, 15 NUM, 1 NUM, 2 NUM, 4253

15 Building Faster Scanners from the DFA Table-driven recognizers waste a lot of effort Read (& classify) the next character Find the next state Assign to the state variable Branch back to the top We can do better Encode state & actions in the code Do transition tests locally Generate ugly, spaghetti-like code (it is OK, this is automatically generated code) Takes (many) fewer operations per input character state = s 0 ; string =  ; char = get_next_char(); while (char != eof) { state =  (state,char); string = string + char; char = get_next_char(); } if (state in Final) then report acceptance; else report failure;

16 Building Faster Scanners from the DFA A direct-coded recognizer for Register regular expression R Digit Digit * Many fewer operations per character State is encoded as the location in the code goto s 0 ; s 0 : string   ; char  get_next_char(); if (char = ‘r’) then goto s 1 ; else goto s e ; s 1 : string  string+ char; char  get_next_char(); if (‘0’ ≤ char ≤ ‘9’) then goto s 2 ; else goto s e ; s2: string  string+ char; char  get_next_char(); if (‘0’ ≤ char ≤ ‘9’) then goto s 2 ; else if (char = eof) then report acceptance; else goto s e ; s e : print error message; return failure;

17 Building Faster Scanners Hashing keywords versus encoding them directly Some compilers recognize keywords as identifiers and check them in a hash table Encoding it in the DFA is a better idea –O(1) cost per transition –Avoids hash lookup on each identifier

18 What is hard about Lexical Analysis? Poor language design can complicate scanning Reserved words are important – In PL/I there are no reserved keywords, so you can write a valid statement like: if then then then = else; else else = then Significant blanks –In Fortran blanks are not significant do 10 i = 1,25 do loop do 10 i = 1.25 assignment to variable named do10i Closures –Limited identifier length adds states to the automata to count length

19 What can be so hard? (Fortran 66/77) How does a compiler do this? First pass finds & inserts blanks Can add extra words or tags to create a scannable language Second pass is normal scanner Macro definitions First A and B are converted to (6-2) This statement declares that variables that begin with A and B are of data-type four character string )=(3 is a literal constant assigns value to variable DO9E1 assigns value to array element one statement split into two lines integer function A statement for formatting input, output

1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical.

Similar presentations

Presentation on theme: "1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical.

Similar presentations

Presentation on theme: "1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical."— Presentation transcript:

Similar presentations

About project

Feedback