Lexical Analysis Natawut Nupairoj, Ph.D.

Slides:



Advertisements
Similar presentations
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Advertisements

Converting NFAs to DFAs. NFA to DFA: Approach In: NFA N Out: DFA D Method: Construct transition table Dtran (a.k.a. the "move function"). Each DFA state.
Chapter 3 Lexical Analysis Yu-Chen Kuo.
Chapter 3 Lexical Analysis. Definitions The lexical analyzer produces a certain token wherever the input contains a string of characters in a certain.
CS-338 Compiler Design Dr. Syed Noman Hasany Assistant Professor College of Computer, Qassim University.
1 IMPLEMENTATION OF FINITE AUTOMAT IN CODE There are several ways to translate either a DFA or an NFA into code. Consider, again the example of a DFA that.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
1 Chapter 2: Scanning 朱治平. Scanner (or Lexical Analyzer) the interface between source & compiler could be a separate pass and places its output on an.
Chapter 3 Chang Chi-Chung. The Structure of the Generated Analyzer lexeme Automaton simulator Transition Table Actions Lex compiler Lex Program lexemeBeginforward.
Yu-Chen Kuo1 Chapter 2 A Simple One-Pass Compiler.
Lexical Analysis Recognize tokens and ignore white spaces, comments
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
 We are given the following regular definition: if -> if then -> then else -> else relop -> |>|>= id -> letter(letter|digit)* num -> digit + (.digit.
Chapter 3 Lexical Analysis
2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.
Topic #3: Lexical Analysis
1 Flex. 2 Flex A Lexical Analyzer Generator  generates a scanner procedure directly, with regular expressions and user-written procedures Steps to using.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis Hira Waseem Lecture
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Lexical Analyzer (Checker)
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
Syntactic Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.
CH3.1 CS 345 Dr. Mohamed Ramadan Saady Algebraic Properties of Regular Expressions AXIOMDESCRIPTION r | s = s | r r | (s | t) = (r | s) | t (r s) t = r.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Chapter 3. Lexical Analysis (1). 2 Interaction of lexical analyzer with parser.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analyzer in Perspective
1 Lexical Analysis and Lexical Analyzer Generators Chapter 3 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University,
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
1.  It is the first phase of compiler.  In computer science, lexical analysis is the process of converting a sequence of characters into a sequence.
Using Scanner Generator Lex By J. H. Wang May 10, 2011.
CSc 453 Lexical Analysis (Scanning)
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Fall 2003CS416 Compiler Design1 Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. Normally a lexical.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
The Model of Compilation Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.
The Role of Lexical Analyzer
Lexical Analysis (Scanning) Lexical Analysis (Scanning)
Lexical Analysis.
1st Phase Lexical Analysis
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Chapter2 : Lexical Analysis
Compiler Chapter 4. Lexical Analysis Dept. of Computer Engineering, Hansung University, Sung-Dong Kim.
CS 3304 Comparative Languages
Lexical Analyzer in Perspective
CS510 Compiler Lecture 2.
A Simple Syntax-Directed Translator
Scanner Scanner Introduction to Compilers.
Chapter 3 Lexical Analysis.
Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012.
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
CSc 453 Lexical Analysis (Scanning)
Compiler Construction
Regular Definition and Transition Diagrams
פרק 3 ניתוח לקסיקאלי תורת הקומפילציה איתן אביאור.
Chapter 3: Lexical Analysis
Lexical Analysis and Lexical Analyzer Generators
COP4020 Programming Languages
Review: Compiler Phases:
Recognition of Tokens.
CS 3304 Comparative Languages
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University

Outline Overview. Token, Lexeme, and Pattern. Lexical Analysis Specification. Lexical Analysis Engine.

Front-End Components Front-End Scanner Source program (text stream) m Group token. Scanner Source program (text stream) identifier main symbol ( m a i n ( ) { token next-token Construct parse tree. Symbol Table Parser parse-tree Check semantic/contextual. Semantic Analyzer Intermediate Representation (file or in memory)

Tasks for Scanner Read input and group tokens for Parser. Strip comments and white spaces. Count line numbers. Create an entry in the symbol table. Preprocessing functions

Benefits Simpler design More efficient scanner Portability parser doesn’t worry about comments and white spaces. More efficient scanner optimize the scanning process only. use specialize buffering techniques. Portability handle standard symbols on different platforms.

Basic Terminology Token Lexeme a set of strings Ex: token = identifier a sequence of characters in the source program matched by the pattern for a token. Ex: lexeme = counter

Basic Terminology Pattern a description of strings that can belong to a particular token set. Ex: pattern = letter followed by letters or digit {A,…,Z,a,…,z}{A,…,Z,a,…,z,0,…,9}*

Token Lexeme Pattern const if relation id num literal const if <, <=, …, >= counter, x, y 12.53, 1.42E-10 “Hello World” Pattern const if comparison symbols letter (letter | digit)* any numeric constant characters between “

Language and Lexical Analysis Fixed-format input i.e. FORTRAN must consider the alignment of a lexeme. difficult to scan. No reserved words i.e. PL/I keywords vs. id ? -- complex rules. if if = then then then := else; else else := then;

Regular Expression Revisited e is a regular expression that denotes {e}. If a is an alphabet, a is a regular expression that denotes {a}. Suppose r and s are regular expressions: (r)|(s) denoting L(r) U L(s). (r)(s) denoting L(r)L(s). (r)* denoting (L(r))*

Precedence of Operator Level of precedence Kleene clusure (*) concatenation union (|) All operators are left associative. Ex: a*b | cd* = ((a*)b) | (c(d*))

Regular Definition A sequence of definitions: d1 ฎ r1 d2 ฎ r2 ... dn ฎ rn di is a distinct name ri is a regular expression over: ๅ U {d1, …, di-1}

Examples letter ฎ A | B | … | Z | a | b | … | z digit ฎ 0 | 1 | … | 9 id ฎ letter ( letter | digit )* digits ฎ digit digit* opt_fraction ฎ . digits | e opt_exponent ฎ ( E ( + | - | e ) digits ) | e num ฎ digits opt_fraction opt_exponent

Notational Shorthands One or more instances r+ = rr* Zero or one instance r? = r | e (rs)? = rs | e Character Class [A-Za-z] = A | B | … | Z | a | b | … | z

Examples digit ฎ [0-9] digits ฎ digit+ opt_fraction ฎ ( . digits )? opt_exponent ฎ ( E ( + | - )? digits )? num ฎ digits opt_fraction opt_exponent id ฎ [A-Za-z][A-Za-z0-9]*

Recognition of Tokens Consider tokens from the grammar. pattern attribute Draw NFAs with retracting options.

Example : Grammar stmt ::= if expr then stmt | if expr then stmt else stmt | expr expr ::= term relop term | term term ::= id | num

Example : Regular Definition if ฎ if then ฎ then else ฎ else relop ฎ < | <= | = | <> | > | >= id ฎ letter (letter | digit)* num ฎ digit+ ( . digit+ )? ( E (+ | -)? digit+ ) ? delimฎ blank | tab | newline ws ฎ delim+

Example: Pattern-Token-Attribute Regular Expression ws if then else id num < <= = <> ... Token - if then else id num relop ... Attribute-Value - Index in table LT LE EQ NE ..

Attributes for Tokens if count >= 0 then ... <if, > <id, index for count in symbol table> <relop, GE> <num, integer value 0> <then, >

NFA – Lexical Analysis Engine < = 1 2 return(relop, LE) > 3 = return(relop, NE) other * 4 return(relop, LT) > 5 return(relop, EQ) = 6 7 return(relop, GE) * 8 other return(relop, GT)

num ฎ digit+ ( . digit+ )? ( E (+ | -)? digit+ ) ? Handle Numbers Pattern for number contains options. num ฎ digit+ ( . digit+ )? ( E (+ | -)? digit+ ) ? 31, 31.02, 31.02E-15 Always get the longest possible match. match the longest first if not match, try the next possible pattern.

Handle Numbers digit digit digit + or - . digit digit digit E 12 13 14 15 16 17 18 other digit E digit digit 19 * . other * digit digit 20 21 22 23 24 digit * other digit return(num, getnum()) 25 26 27

Handle Keywords Two approaches: encode keywords into an NFA (if, then, etc.) complex NFA (too many states). use symbol table simple. require some tricks. * letter other 9 10 11 return(gettoken(), install_id()) letter or digit

Handle Keywords Symbol table contains both lexeme and token type. Initialize symbol table with all keywords and corresponding token types. lexeme: if token type: if lexeme: then token type: then lexeme: else token type: else

Handle Keywords … … … … Scanner initial Symbol Table Lexeme Token Type Parser 1 if … 2 then … 3 else … 4 5

Handle Keywords gettoken(): If id is not found in the table, return token type ID. Otherwise, return token type from the table.

Handle Keywords … … … … i i f f c o u n t < = Source program (text stream) Scanner gettoken Symbol Table if i f next-token Lexeme Token Type … Parser 1 if … 2 then … 3 else … 4 5

Handle Keywords install_id(): If id is not found in the table, it’s a new id. INSERT NEW ID INTO TABLE and return pointer to the new entry. If id is found and its type is ID, return pointer to that entry. Otherwise, it’s a keyword. Return 0.

Handle Keywords if … … … … i i f f c o u n t < = Source program (text stream) Scanner install_id Symbol Table if token i f next-token Lexeme Token Type … Parser 1 if … 2 then … 3 else … 4 5

Handle Keywords … … … … i i f f c c o o u u n n t t < = Source program (text stream) Scanner gettoken Symbol Table id c o u n t next-token Lexeme Token Type … Parser 1 if … 2 then … 3 else … 4 5 Not found!

Handle Keywords id 4 … … … … … i f c o u n t < = Source program (text stream) Scanner install_id Symbol Table 4 id 4 token c o u n t next-token Lexeme Token Type … Parser 1 if … 2 then … 3 else … 4 count id … 5