1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical.

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
Lexical Analysis — Introduction Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis.
Lexical Analysis: DFA Minimization & Wrap Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp.
CSc 453 Lexical Analysis (Scanning)
 Lex helps to specify lexical analyzers by specifying regular expression  i/p notation for lex tool is lex language and the tool itself is refered to.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
From Cooper & Torczon1 Automating Scanner Construction RE  NFA ( Thompson’s construction )  Build an NFA for each term Combine them with  -moves NFA.
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
Compiler Construction
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary  Quoted string in.
1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #5 Introduction.
Lexical Analysis Recognize tokens and ignore white spaces, comments
Lexical Analysis — Part II From Regular Expression to Scanner Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
CS 426 Compiler Construction
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CS 536 Spring Learning the Tools: JLex Lecture 6.
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
1 Compiler Construction Finite-state automata. 2 Today’s Goals More on lexical analysis: cycle of construction RE → NFA NFA → DFA DFA → Minimal DFA DFA.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis Constructing a Scanner from Regular Expressions.
Lexical Analyzer (Checker)
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
CSc 453 Lexical Analysis (Scanning)
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Compiler Construction Sohail Aslam Lecture 9. 2 DFA Minimization  The generated DFA may have a large number of states.  Hopcroft’s algorithm: minimizes.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Lexical Analysis: DFA Minimization & Wrap Up. Automating Scanner Construction PREVIOUSLY RE  NFA ( Thompson’s construction ) Build an NFA for each term.
Lexical Analysis (Scanning) Lexical Analysis (Scanning)
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
LECTURE 6 Scanning Part 2. FROM DFA TO SCANNER In the previous lectures, we discussed how one might specify valid tokens in a language using regular expressions.
CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire Lecture 2 Phases of a Compiler Lexical Analysis.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
1 Compiler Construction Vana Doufexi office CS dept.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture Ahmed Ezzat.
Lecture 2 Lexical Analysis
Lexical Analysis (Sections )
CSc 453 Lexical Analysis (Scanning)
RegExps & DFAs CS 536.
Lexical Analysis - An Introduction
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lecture 5: Lexical Analysis III: The final bits
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lecture 4: Lexical Analysis II: From REs to DFAs
Automating Scanner Construction
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lexical Analysis: DFA Minimization & Wrap Up
Lexical Analysis - An Introduction
Compiler Construction
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical Analysis

2 Announcements Programming assignment 1 will be at the class webpage –Due next Tuesday, October 15 th (easy) Homework 1 is due now Lecture notes will be available at the class webpage

3 First Phase: Lexical Analysis (Scanning) Scanner Maps stream of characters into tokens –Basic unit of syntax Characters that form a word are its lexeme Its syntactic category is called its token Scanner discards white space and comments Scanner works as a subroutine of the parser Source code Scanner IR Parser Errors token get next token

4 Lexical Analysis Specify tokens using Regular Expressions Translate Regular Expressions to Finite Automata Use Finite Automata to generate tables or code for the scanner Scanner Generator specifications (regular expressions) source code tokens tables or code

5 Automating Scanner Construction To build a scanner: 1Write down the RE that specifies the tokens 2Translate the RE to an NFA 3Build the DFA that simulates the NFA 4Systematically shrink the DFA 5Turn it into code or table Scanner generators Lex, Flex, Jlex work along these lines Algorithms are well-known and well-understood Interface to parser is important

6 Automating Scanner Construction RE  NFA ( Thompson’s construction ) Build an NFA for each term Combine them with  -moves NFA  DFA ( subset construction ) Build the simulation DFA  Minimal DFA Hopcroft’s algorithm DFA  RE All pairs, all paths problem Union together paths from s 0 to a final state minimal DFA RENFADFA The Cycle of Constructions

7 NFA vs. DFA Scanners Given a regular expression r we can convert it to an NFA of size O(|r|) Given an NFA we can convert it to a DFA of size O(2 |r| ) We can simulate a DFA on string x in O(|x|) time We can simulate an NFA N (constructed by Thompson’s construction) on a string x in O(|N|  |x|) time Automaton Type Space Complexity Time Complexity NFA O(|r|) O(|r|  |x|) DFA O(2 |r| )O(|x|) Recognizing input string x for regular expression r

8 Scanner Generators: JLex, Lex, FLex user code % JLex directives % regular expression rules directly copied to the output file macro (regular) definitions (e.g., digits = [0-9]+ ) and state names each rule: optional state list, regular expression, action user code at top (from parser-generator) specifies what tokens are States can be mixed with regular expressions For each regular expression we can define a set of states where it is valid (JLex, Flex) Standard format of regular expression rule: regular_expression { actions }

9 JLex, FLex, Lex r_1{ action_1 } r_2{ action_2 }. r_n{ action_n } Java code for JLex, C code for FLex and Lex A r_1 A r_2 A r_n... s0s0    Automata for regular expression r_1 Rules used by scanner generators 1) Continue scanning the input until reaching an error state 2) Accept the longest prefix that matches to a regular expression and execute the corresponding action 3) If two patterns match the longest prefix, then the action which is specified earlier will be executed 4) After a match, go back to the end of the accepted prefix in the input and start scanning for the next token Regular expression rules: For faster scanning, convert this NFA to a DFA and minimize the states error new final states new start sate

10 A Simple Example Id = [a-z][a-z0-9]* Num = [0-9]+ if ="if" Recognize the following tokens: WhiteSpace = [\ \t\f\b\r\n] Comment = \/\/.* Also take care of one line comments and white space:

11 /* User code */ import java.io.*; // For FileInputStream and its exceptions. /* ========================================== */ class Type { static final int IF = 0; static final int ID = 1; static final int NUM = 2; static final int EOF = 3; }; class Token { public int type; public String attribute; public Token(int t) { type=t; } public Token(int t, String s) { type=t; attribute = s; } public static String spellingOf(int t) { switch (t) { case Type.IF : return "IF"; case Type.ID : return "ID"; case Type.NUM : return "NUM"; default : return "Undefinied token type"; } public String toString() { switch (type) { case Type.ID : case Type.NUM : return spellingOf(type) + ", " + attribute; default: return spellingOf(type); } };

12 /* ================================================= */ class Example { public static void main(String[] args) throws FileNotFoundException, IOException { FileInputStream fis = new FileInputStream(args[0]); Lexer L = new Lexer(fis); Token T = L.next(); while (T.type != Type.EOF) { System.out.println(T); T = L.next(); } /* ================================================ */ % /* JLex directives */ %class Lexer %function next %type Token %eofval{ return new Token(Type.EOF); %eofval} /* white space */ WhiteSpace = [\ \t\f\b\r\n] /* comments */ Comment = \/\/.* Id = [a-z][a-z0-9]* Num = [0-9]+ % {WhiteSpace} {} {Comment} {} "if" { return new Token(Type.IF); } {Id}{ return new Token(Type.ID, yytext()); } {Num}{ return new Token(Type.NUM, yytext()); }

13 If above JLex specification is in a file simple.jlx, you can generate a scanner for that specification as follows: % cd % setenv CLASSPATH ".:/fs/cs-cls/cs160/lib" % java JLex.Main simple.jlx % javac simple.jlx.java % java Example input1

14 IF ID, i1 IF ID, var15 NUM, 15 NUM, 1 Undefined token type NUM, 2 Undefined token type NUM, 3 if i1 // this is a comment if var , 2, 3 if i1 // this is a comment if var IF ID, i1 IF ID, var15 NUM, 15 NUM, 1 NUM, 2 NUM, 4253

15 Building Faster Scanners from the DFA Table-driven recognizers waste a lot of effort Read (& classify) the next character Find the next state Assign to the state variable Branch back to the top We can do better Encode state & actions in the code Do transition tests locally Generate ugly, spaghetti-like code (it is OK, this is automatically generated code) Takes (many) fewer operations per input character state = s 0 ; string =  ; char = get_next_char(); while (char != eof) { state =  (state,char); string = string + char; char = get_next_char(); } if (state in Final) then report acceptance; else report failure;

16 Building Faster Scanners from the DFA A direct-coded recognizer for Register regular expression R Digit Digit * Many fewer operations per character State is encoded as the location in the code goto s 0 ; s 0 : string   ; char  get_next_char(); if (char = ‘r’) then goto s 1 ; else goto s e ; s 1 : string  string+ char; char  get_next_char(); if (‘0’ ≤ char ≤ ‘9’) then goto s 2 ; else goto s e ; s2: string  string+ char; char  get_next_char(); if (‘0’ ≤ char ≤ ‘9’) then goto s 2 ; else if (char = eof) then report acceptance; else goto s e ; s e : print error message; return failure;

17 Building Faster Scanners Hashing keywords versus encoding them directly Some compilers recognize keywords as identifiers and check them in a hash table Encoding it in the DFA is a better idea –O(1) cost per transition –Avoids hash lookup on each identifier

18 What is hard about Lexical Analysis? Poor language design can complicate scanning Reserved words are important – In PL/I there are no reserved keywords, so you can write a valid statement like: if then then then = else; else else = then Significant blanks –In Fortran blanks are not significant do 10 i = 1,25 do loop do 10 i = 1.25 assignment to variable named do10i Closures –Limited identifier length adds states to the automata to count length

19 What can be so hard? (Fortran 66/77) How does a compiler do this? First pass finds & inserts blanks Can add extra words or tags to create a scannable language Second pass is normal scanner Macro definitions First A and B are converted to (6-2) This statement declares that variables that begin with A and B are of data-type four character string )=(3 is a literal constant assigns value to variable DO9E1 assigns value to array element one statement split into two lines integer function A statement for formatting input, output