Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012.

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
1 IMPLEMENTATION OF FINITE AUTOMAT IN CODE There are several ways to translate either a DFA or an NFA into code. Consider, again the example of a DFA that.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
 We are given the following regular definition: if -> if then -> then else -> else relop -> |>|>= id -> letter(letter|digit)* num -> digit + (.digit.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Lexical Analysis Natawut Nupairoj, Ph.D.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis Hira Waseem Lecture
Concordia University Department of Computer Science and Software Engineering Click to edit Master title style COMPILER DESIGN Lexical analysis Joey Paquet,
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Lexical Analyzer (Checker)
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
CS 536 Fall Scanner Construction  Given a single string, automata and regular expressions retuned a Boolean answer: a given string is/is not in.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
1.  It is the first phase of compiler.  In computer science, lexical analysis is the process of converting a sequence of characters into a sequence.
IN LINE FUNCTION AND MACRO Macro is processed at precompilation time. An Inline function is processed at compilation time. Example : let us consider this.
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
The Role of Lexical Analyzer
Lexical Analysis.
1st Phase Lexical Analysis
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Chapter 2-II Scanning Sung-Dong Kim Dept. of Computer Engineering, Hansung University.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Department of Software & Media Technology
COMP 3438 – Part II - Lecture 3 Lexical Analysis II Par III: Finite Automata Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
CS 3304 Comparative Languages
System Software Theory (5KS03).
Lexical Analyzer in Perspective
Lexical and Syntax Analysis
Lecture 2 Lexical Analysis
Chapter 3 Lexical Analysis.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Finite-State Machines (FSMs)
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
Finite-State Machines (FSMs)
Compiler Construction
Recognizer for a Language
4 (c) parsing.
Lexical and Syntax Analysis
Chapter 3: Lexical Analysis
Compiler design Lexical analysis COMP 442/6421 – Compiler Design
Lexical Analysis Lecture 3-4 Prof. Necula CS 164 Lecture 3.
Recognition of Tokens.
CS 3304 Comparative Languages
Lecture 4: Lexical Analysis & Chomsky Hierarchy
4b Lexical analysis Finite Automata
CS 3304 Comparative Languages
4b Lexical analysis Finite Automata
Lecture 5 Scanning.
Compiler design Review COMP 442/6421 – Compiler Design
Presentation transcript:

Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012

Building a Lexical Analyzer Part I Building a Lexical Analyzer Joey Paquet, 2000, 2002, 2012

Roles of the Scanner Removal of comments Case conversion Comments are not part of the program’s meaning Multiple-line comments? Nested comments? Case conversion Is the lexical definition case sensitive? For identifiers For keywords Joey Paquet, 2000, 2002, 2012

Roles of the Scanner Removal of white spaces Blanks, tabulars, carriage returns Is it possible to identify tokens in a program without spaces? Interpretation of compiler directives #include, #ifdef, #ifndef and #define are directives to “redirect the input” of the compiler May be done by a pre-compiler Joey Paquet, 2000, 2002, 2012

Roles of the Scanner Communication with the symbol table A symbol table entry is created when an identifier is encountered The lexical analyzer cannot create the whole entries Convert the input file to a token stream Input file is a character stream Lexical convention: literals, operators, keywords, punctuation Joey Paquet, 2000, 2002, 2012

Tokens and Lexemes Token: An element of the lexical definition of the language. Lexeme: A sequence of characters identified as a token. id relop openpar if then assignop semi distance,rate,time,a,x >=,<,== ( = ; Joey Paquet, 2000, 2002, 2012

Design of a Lexical Analyzer Steps 1- Construct a set of regular expressions (REs) that define the form of any valid token 2- Derive an NDFA from the REs 3- Derive a DFA from the NDFA 4- Translate to a state transition table 5- Implement the table 6- Implement the algorithm to interpret the table Joey Paquet, 2000, 2002, 2012

Regular Expressions  : { } s : {s | s in s^} a : {a}  : { } s : {s | s in s^} a : {a} r | s : {r | r in r^} or {s | s in s^} s* : {sn | s in s^ and n>=0} s+ : {sn | s in s^ and n>=1} id ::= letter(letter|digit)* Joey Paquet, 2000, 2002, 2012

Derive NDFA from REs Could derive DFA from REs but: Much easier to do NDFA, then derive DFA No standard way of deriving DFAs from REs Use Thompson’s construction (c.f. Louden’s)  letter digit Joey Paquet, 2000, 2002, 2012

Derive DFA from NDFA Use subset construction (c.f. Louden’s) May be optimized Easier to implement: No  edges Deterministic (no backtracking) digit letter letter digit [other] Joey Paquet, 2000, 2002, 2012

Generate State Transition Table letter digit [other] 1 2 letter digit other final 0 1 N 1 1 1 2 N 2 Y Joey Paquet, 2000, 2002, 2012

Implementation Concerns Backtracking Principle : A token is normally recognized only when the next character is read. Problem : Maybe this character is part of the next token. Example : x<1. “<“ is recognized only when “1” is read. In this case, we have to backtrack on character to continue token recognition. Can include the occurrence of these cases in the state transition table. Joey Paquet, 2000, 2002, 2012

Implementation Concerns Ambiguity Problem : Some tokens’ lexemes are subsets of other tokens. Example : n-1. Is it <n><-><1> or <n><-1>? Solutions : Postpone the decision to the syntactic analyzer Do not allow sign prefix to numbers in the lexical specification Interact with the syntactic analyzer to find a solution. (Induces coupling) Joey Paquet, 2000, 2002, 2012

Example Alphabet : Simple tokens : Composite tokens : Words : {:, *, =, (, ), <, >, {, }, [a..z], [0..9]} Simple tokens : {(, ), {, }, :, <, >} Composite tokens : {:=, >=, <=, <>, (*, *)} Words : id ::= letter(letter | digit)* num ::= digit* Joey Paquet, 2000, 2002, 2012

Example Ambiguity problems: Backtracking: Character Possible tokens Must back up a character when we read a character that is part of the next token. Occurences are coded in the table Character Possible tokens : :, := > >, >= < <, <=, <> ( (, (* * *, *) Joey Paquet, 2000, 2002, 2012

l d Final state with backtracking Final state l d d { } sp ( ) * * : = 2 3 d d 4 5 { } 6 7 sp 1 19 20 ( ) * * 8 9 10 11 : 12 13 = < 14 15 = > 16 > 17 18 = Joey Paquet, 2000, 2002, 2012

Table-driven Scanner (Table) { } ( * ) : = < > sp final/backup 1 2 4 6 19 8 12 14 17 3 yes [id] 5 yes [num] 7 no [{…}] 20 9 10 11 no [(*…*)] 13 no [:=] 15 16 no [<=] no [<>] 18 no [>=] no [error] yes [various] Table-driven Scanner (Table) Joey Paquet, 2000, 2002, 2012

Table-driven Scanner (Algorithm) nextToken() state = 0 token = null do lookup = nextChar() state = Table(state, lookup) if (isFinalState(state)) token = createToken() if (Table(state, “backup”) == yes) backupChar() until (token != null) return (token) Joey Paquet, 2000, 2002, 2012

Table-driven Scanner nextToken() nextChar() backupChar() Extract the next token in the program (called by syntactic analyzer) nextChar() Read the next character in the input program backupChar() Backs up one character in the input file Joey Paquet, 2000, 2002, 2012

Table-driven Scanner isFinalState(state) table(state, column) Returns TRUE if state is a final state table(state, column) Returns the value corresponding to [state, column] in the state transition table. createToken() Creates and returns a structure that contains the token type, its location in the source code, and its value (for literals). Joey Paquet, 2000, 2002, 2012

Hand-written Scanner nextToken() c = nextChar() case (c) of "[a..z],[A..Z]": while (c in {[a..z],[A..Z],[0..9]}) do s = makeUpString() if ( isReservedWord(s) )then token = createToken(RESWORD,null) else token = createToken(ID,s) backupChar() "[0..9]": while (c in [0..9]) do v = makeUpValue() token = createToken(NUM,v) Hand-written Scanner Joey Paquet, 2000, 2002, 2012

Hand-written Scanner "{": c = nextChar() while ( c != "}" ) do "(": if ( c == "*" ) then repeat while ( c != "*" ) do until ( c != ")" ) return ":": if ( c == "=" ) then token = createToken(ASSIGNOP,null) else token = createToken(COLON,null) backupChar() Hand-written Scanner Joey Paquet, 2000, 2002, 2012

Hand-written Scanner "<": c = nextChar() if ( c == "=" ) then token = createToken(LEQ,null) else if ( c == ">" ) then token = createToken(NEQ,null) else token = createToken(LT,null) backupChar() ">": token = createToken(GEQ,null) token = createToken(GT,null) ")": token = createToken(RPAR,null) "*": token = createToken(STAR,null) "=": token = createToken(EQ,null) end case return token Hand-written Scanner Joey Paquet, 2000, 2002, 2012

Error recovery in Lexical Analysis Part II Error recovery in Lexical Analysis Joey Paquet, 2000, 2002, 2012

Possible Lexical Errors Depends on the accepted conventions: Invalid character letter not allowed to terminate a number numerical overflow identifier too long end of line before end of string Are these lexical errors? Joey Paquet, 2000, 2002, 2012

Accepted or Not? 123a <Error> or <num><id>? 123456789012345678901234567 <Error> related to machine’s limitations “Hello world <CR> Either <CR> is skipped or <Error> ThisIsAVeryLongVariableName = 1 Limit identifier length? Joey Paquet, 2000, 2002, 2012

Error Recovery Techniques Finding only the first error is not acceptable Panic Mode: Skip characters until a valid character is read Guess Mode: do pattern matching between erroneous strings and valid strings Example: (beggin vs. begin) Rarely implemented Joey Paquet, 2000, 2002, 2012

Conclusions Joey Paquet, 2000, 2002, 2012

Possible Implementations Lexical Analyzer Generator (e.g. Lex) safe, quick Must learn software, unable to handle unusual situations Table-Driven Lexical Analyzer general and adaptable method, same function can be used for all table-driven lexical analyzers Building transition table can be tedious and error-prone Joey Paquet, 2000, 2002, 2012

Possible Implementations Hand-written Can be optimized, can handle any unusual situation, easy to build for most languages Error-prone, not adaptable or maintainable Joey Paquet, 2000, 2002, 2012

Lexical Analyzer’s Modularity Why should the Lexical Analyzer and the Syntactic Analyzer be separated? Modularity/Maintainability : system is more modular, thus more maintainable Efficiency : modularity = task specialization = easier optimization Reusability : can change the whole lexical analyzer without changing other parts Joey Paquet, 2000, 2002, 2012