COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.

Slides:



Advertisements
Similar presentations
Compiler construction in4020 – lecture 2 Koen Langendoen Delft University of Technology The Netherlands.
Advertisements

Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Chapter 3 Lexical Analysis Yu-Chen Kuo.
LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006.
 Lex helps to specify lexical analyzers by specifying regular expression  i/p notation for lex tool is lex language and the tool itself is refered to.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
1 Chapter 2: Scanning 朱治平. Scanner (or Lexical Analyzer) the interface between source & compiler could be a separate pass and places its output on an.
Regular Languages Sequential Machine Theory Prof. K. J. Hintz Department of Electrical and Computer Engineering Lecture 3 Comments, additions and modifications.
Prof. Hilfinger CS 164 Lecture 21 Lexical Analysis Lecture 2-4 Notes by G. Necula, with additions by P. Hilfinger.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Lexical Analysis Recognize tokens and ignore white spaces, comments
1 Regular Expressions/Languages Regular languages –Inductive definitions –Regular expressions syntax semantics Not covered in lecture.
CPSC 388 – Compiler Design and Construction
Topic #3: Lexical Analysis
Languages & Strings String Operations Language Definitions.
Finite-State Machines with No Output
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 1 Chapter 4 Chapter 4 Lexical analysis.
Lexical Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
1 COMP 3438 – Part II-Lecture 1: Overview of Compiler Design Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
Review: Regular expression: –How do we define it? Given an alphabet, Base case: – is a regular expression that denote { }, the set that contains the empty.
Lecture 2: Lexical Analysis
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
What is a language? An alphabet is a well defined set of characters. The character ∑ is typically used to represent an alphabet. A string : a finite.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
1 Module 14 Regular languages –Inductive definitions –Regular expressions syntax semantics.
Lexical Analyzer in Perspective
1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a compiler PART II:
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
Lexical Analysis – Part I EECS 483 – Lecture 2 University of Michigan Monday, September 11, 2006.
1.  It is the first phase of compiler.  In computer science, lexical analysis is the process of converting a sequence of characters into a sequence.
IN LINE FUNCTION AND MACRO Macro is processed at precompilation time. An Inline function is processed at compilation time. Example : let us consider this.
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
1st Phase Lexical Analysis
Prof. Necula CS 164 Lecture 31 Lexical Analysis Lecture 3-4.
COMP 3438 – Part II-Lecture 5 Syntax Analysis II Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
COMP 3438 – Part II-Lecture 6 Syntax Analysis III Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
COMP 3438 – Part II - Lecture 4 Syntax Analysis I Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
Set, Alphabets, Strings, and Languages. The regular languages. Clouser properties of regular sets. Finite State Automata. Types of Finite State Automata.
Deterministic Finite Automata Nondeterministic Finite Automata.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Chapter2 : Lexical Analysis
CC410: System Programming Dr. Manal Helal – Fall 2014 – Lecture 12–Compilers.
Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science.
COMP 3438 – Part II - Lecture 3 Lexical Analysis II Par III: Finite Automata Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
Lexical Analyzer in Perspective
CS510 Compiler Lecture 2.
Chapter 3 Lexical Analysis.
Lexical Analysis CSE 340 – Principles of Programming Languages
4 Lexical analysis.
4 Lexical analysis.
Review: Compiler Phases:
Lexical Analysis Lecture 3-4 Prof. Necula CS 164 Lecture 3.
Lecture 4: Lexical Analysis & Chomsky Hierarchy
CS 3304 Comparative Languages
Specification of tokens using regular expressions
Compiler Construction
Presentation transcript:

COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1

Overview of the Subject (COMP 3438) Overview of Unix Sys. Prog. ProcessFile System Overview of Device Driver Development Character Device Driver Development Introduction to Block Device Driver Overview of Complier Design Lexical Analysis (HW #3) Syntax Analysis (HW #4) Part I: Unix System Programming (Device Driver Development) Part II: Compiler Design Course Organization (This lecture is in red) 2

The Outline Part I: Introduction to Lexical Analysis 1. Input (A Source Program) and Output (Tokens) 2. How to specify tokens? Regular Expression 2. How to recognize tokens? Regular Expression  Lex (Software Tool) Regular Expression  Finite Automaton (Write our own) Part II: Regular Expression Part III: Finite Automata (Write your own- Homework #3) 3

Part I: Introduction to Lexical Analysis  Why we need Lexical Analysis? Its input & output.  How to specify Tokens: Regular Expression  How to Recognize Tokens – Two Methods Regular Expression  Software tool: Lex Regular Expression  Finite Automata (write your own program) 4

 Given a program, how to group the characters into meaningful “words” ?  Example: Program Segment by C A string of characters stored in a file if (i==j) z = 0; else z = 1; Why we need Lexical Analysis ? if (i==j) \n\t\tz=0;\telse\n\t\tz=1;\n How to identify: “if””else” are keyword; “i”,”j”, “z” are variables; etc. from the string of char.? (Similar to in English, in order to understand “I love you”, you have to identify the words, “I”, “love”, “you” first.) 5

Lexical Analysis (Input & Output)  In lexical analysis, a source program is read from left-to-right and grouped into tokens, the sequence of characters with a collective meaning. Lexical Analyzer INPUT: Source Program OUTPUT: Tokens if (i==j) \n\t\tz=0;\telse\n\t\tz=1;\n Token Lexeme (value) keywordif IDi Operator== IDj …… 6

What is a Token ?  A syntactic category In English: noun, verb, adjective, … In a programming language: Identifier, Integer, Keyword, Whitespace  Tokens correspond to sets of strings with a collective meaning Identifier: strings of letters and digits, starting with a letter Integer: a non-empty string of digits Keyword: “else”, “if”, “while”, … 7

Example – Expression (Input & Output) ((48*10) div 12)**3 LEXICAL ANALYSIS TokenName: LP value: ( TokenName: NUM value: 48 TokenName: MPY value: * TokenName: NUM value: 10 TokenName: RP value: ) TokenName: DIV value: / TokenName: NUM value:12 TokenName: RP value:) TokenName: EXP value:^ TokenName: NUM value:3 TokenName: END value:$ LEXICAL ANALYSIS FINISH 8

Example – Mini Java Program (Input & Output) program xyz; class hellow{ method void main( ){ System.println('hellow\n'); } 9

What are Tokens For ?  Classify program substrings according to their syntactic role.  As the output of lexical analysis, Tokens are the input to the Parser (Syntax Analysis) Parser relies on token distinctions e.g. A keyword is treated differently than an ID 10

How to Recognize Tokens (Lexical Analyzer)?  First, specify tokens – Regular Expression (Patterns)  Second, based on regular Expression, two ways to implement a lexical analyzer: Method 1: Use Lex, a software tool Method 2: Use Finite Automata (write your own program). (Homework #3) 11

Part II. Regular Expression  Alphabet, Strings, Languages  Regular Expression  Regular Set (Regular Language) 12

 A token is specified by a pattern.  A pattern is a set of rules that describing the formation of the token.  The lexical analyzer uses the pattern to identify the lexeme - a sequence of characters in the input that matches the pattern.  Once matched, the corresponding token is recognized. Example: The rule (pattern) for ID (Identifier) : letter followed by letters and digits abc1 and A1By match the rule (pattern), they are ID type token; 1A does not match the rule (pattern), it is not ID type token. Specifying tokens 13

 The rules for specifying token patterns are called regular expression. A regular set (regular language) is the set of strings generated by a regular expression over an alphabet.  What are Alphabet, Language, Regular Expression, Regular Set? Example: Rules, Tokens, Regular Expressions 14

Alphabet and Strings  Alphabet (  ) is a finite set of symbol. e.g. {0,1} is the binary alphabet; {a,b,…,z, A,B,…,Z} is the English alphabet;  A string over an alphabet is a finite sequence of symbols drawn from that alphabet. e.g is the string over  ={0,1}; wxyzabc is the string over  ={a,b,c,…,z};  denotes the empty string (without any symbol)  The length of a string w is denoted as |w| e.g. |  | = 0; |101|=3; |abcdef|=6; 15

 * and Languages   * denotes the set of all strings, including  (the empty string), over an alphabet . e.g.  * ={ , 0, 1, 00, 01, 10, 11, 000, …} over  ={0,1};  Languages: Any set of strings over an alphabet  - that is, any subset of  *- will be called a language e.g. Ø, {  },  *, and  are all languages; {abc, def, d, z} is a language over  ={a,b,..,z}; 16

 Remember a language is a set; so all operations on sets can be applied for languages.  We are interested in: Union, Concatenation, and Closures. Given: two languages, L and M Operations on Languages 17

Precedence of Operators  Precedence: Kleen closure > Concatenation > Union Exponentiation > Multiplication > Addition. 2 1 | 23* 18

Examples for Operations on Languages  Given: L={a, b}, M={a, bb} L U M = {a, b, bb} LM ={aa, abb, ba, bbb} 19

Example 3.2 (Page 93 in textbook) Let L be the set {A, B,..., Z, a, b,..., z} and D be the set {0, 1,..., 9}. They are both languages. Here are some examples of new languages created from L and D by applying the operators defined in Fig L  D is the set of letters and digits; 2. LD is the set of strings consisting of a letter followed by a digit; 3. L 4 is the set of all four-letter strings; 4. L * is the set of all strings of letters, including , the empty string; 5. L(L  D) * is the set of all strings of letters and digits beginning with a letter; 6. D + is the set of all strings of one or more digits. Another Example for Operations on languages 20

 The rules defining regular expressions over alphabet  : 1. The empty string  is a regular expression that denotes {  }. 2. A single symbol a in  is a regular expression that denotes {a}. 3. Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then, a) (r) | (s) is a regular expression denoting L(r)  L(s). b) (r) (s) is a regular expression denoting L(r) L(s). c) (r) * is a regular expression denoting (L(r)) * d) (r) is a regular expression denoting (L(r)) Defining regular expressions 21

Example 3.3 (Page 95 in textbook) Let  = {a, b}. 1. The regular expression a | b denotes the set {a, b} 2. The regular expression (a | b) (a | b) denotes {aa, ab, ba, bb}, the set of all strings of a's and b's of length two. Another regular expression for this same set is aa | ab | ba | bb. 3. The regular expression a * denotes the set of all strings of zero or more a's, i.e., { , a, aa, aaa,...}. Examples of regular expressions 22

Example 3.3 (Page 95 in textbook) Let  = {a, b}. 4. The regular expression (a | b) * denotes the set of all strings containing zero or more instances of an a or b, that is, the set of all strings of a's and b's. Another regular expression for this set is (a * b * ) *. 5. The regular expression a | a * b denotes the set containing the string a and all strings consisting of zero or more a's followed by a b. Examples of regular expressions 23

Regular Set, Regular Expression and Regular Definition  Regular Set (Regular Language) Each regular expression r denotes a language L(r) called as Regular Set. e.g. Let  = {a, b}, a|b denotes the set {a,b}.  Regular definition: Give distinct names (d1, d2,..) to define regular expressions (r1, r2, …) like d1  r1 d2  r2 d3  r3 …. 24

 Example 3.4 (pp. 96) Pascal identifier: a string of letters and digits beginning with a letter. Regular Expression: LETTER  A | B | …| Z | a | b | … | z DIGIT  0 | 1 | …| 9 ID  LETTER ( LETTER | DIGIT )* Example – Identifier in Pascal 25

Example – Unsigned Numbers in Pascal  Example 3.5 (pp. 96) Unsigned numbers in Pascal are strings such as 5230, 39.37, 6.336E4, or 1.89E-4. Regular Expression: DIGIT  0 | 1 | …| 9 DIGITS  DIGIT DIGIT* OPTIONAL_FRAC . DIGITS |  OPTIONAL_EXP  (E ( + | - |  ) DIGITS ) |  NUM  DIGITS OPTIONAL_FRAC OPTIONAL_EXP 26

Notation Shorthands 1. One or more instances: 2. Zero or one instance: r ? = r |  3. Character classes. [a-z]= a|b|c|…|z e.g. Original Regular Expression for Unsigned Numbers DIGIT  0 | 1 | …| 9 DIGITS  DIGIT DIGIT* OPTIONAL_FRAC . DIGITS |  OPTIONAL_EXP  (E ( + | - |  ) DIGITS ) |  NUM  DIGITS OPTIONAL_FRAC OPTIONAL_EXP Regular Expression for Unsigned Numbers with Notation Shorthands DIGITS  [0-9] OPTIONAL_FRAC  (. DIGITS )? OPTIONAL_EXP  (E ( + | - )? DIGITS )? NUM  DIGITS OPTIONAL_FRAC OPTIONAL_EXP + 27

Recognize tokens  Given a string s and a regular expression r, is s  L(r) ? e.g. Let  = {a, b}, a | b is given regular expression. Then the string aa L(a|b) the string a L(a|b) 28

Implementation of Lexical Analysis  After regular expressions are obtained, we have two methods to implement a lexical analyzer: Use tools: lex (for C), flex (for C/C++), jlex (for Java) Specify tokens using regular expressions Tool generates source code for the lexical analysis Use regular expressions and finite automata Write code to express the recognition of tokens Table driven 29

LEX: a lexical analyzer generator  Lex is a UNIX software tool (developed by M.E. Lesk and E. Schmidt from Bell Lab in 1972) that automatically constructs a lexical analyzer  Input: a specification containing regular expressions written in the Lex language (pp in textbook and LEX documentation on Blackboard) Assume that each token matches a regular expression Also need an action for each expression  Output: Produces a C program  Especially useful when coupled with a parser generator (e.g., yacc) 30

 Given the input file lex.l which contains regular expressions specifying the tokens, LEX produces a C file lex.yy.c, Lex.yy.c contains a tabular representation of a state transition graph for a FA constructed from the regular expressions and a routine yylex() that uses the table to recognize token. yylex() can be called as a subroutine, e.g., it can be called by a syntax analyzer generated by Yacc Or compile lex.yy.c and run it independently Lex Specification (lex.l) LEX Lex.yy.c (contains the lexical analyzer called yylex) How does LEX work 31

How does LEX work lexccfoolex foo.lfoolex.cfoolex tokens input > flex –o foolex.c foo.l > cc –o foolex foolex.c -lfl >more input begin if size>10 then size * end > foolex < input Keyword: begin Keyword: if Identifier: size Operator: > Integer: 10 (10) Keyword: then Identifier: size Operator: * Operator: - Float: (3.1415) Keyword: end 32

About LEX  Some materials related to LEX can be found from 33