Chapter 2 Lexical Analysis Nai-Wei Lin. Lexical Analysis Lexical analysis recognizes the vocabulary of the programming language and transforms a string.

Slides:



Advertisements
Similar presentations
Lexical Analysis Dragon Book: chapter 3.
Advertisements

COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis.
Chapter 3 Lexical Analysis Yu-Chen Kuo.
LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006.
Lexical Analyzer Second lecture. Compiler Construction Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical.
 Lex helps to specify lexical analyzers by specifying regular expression  i/p notation for lex tool is lex language and the tool itself is refered to.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
1 Chapter 2: Scanning 朱治平. Scanner (or Lexical Analyzer) the interface between source & compiler could be a separate pass and places its output on an.
2. Lexical Analysis Prof. O. Nierstrasz
Chapter 3 Chang Chi-Chung. The Structure of the Generated Analyzer lexeme Automaton simulator Transition Table Actions Lex compiler Lex Program lexemeBeginforward.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary  Quoted string in.
Lexical Analysis Recognize tokens and ignore white spaces, comments
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Lexical Analysis. 2 Contents  Introduction to lexical analyzer  Tokens  Regular expressions (RE)  Finite automata (FA) –deterministic and nondeterministic.
Chapter 3 Lexical Analysis
Topic #3: Lexical Analysis
1 Flex. 2 Flex A Lexical Analyzer Generator  generates a scanner procedure directly, with regular expressions and user-written procedures Steps to using.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
CS308 Compiler Principles Lexical Analyzer Fan Wu Department of Computer Science and Engineering Shanghai Jiao Tong University Fall 2012.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Review: Regular expression: –How do we define it? Given an alphabet, Base case: – is a regular expression that denote { }, the set that contains the empty.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analyzer (Checker)
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
1 Lexical Analysis and Lexical Analyzer Generators Chapter 3 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University,
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
Pembangunan Kompilator.  A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that language, and.
By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability.
1 Using Lex. Flex – Lexical Analyzer Generator A language for specifying lexical analyzers Flex compilerlex.yy.clang.l C compiler -lfl a.outlex.yy.c a.outtokenssource.
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
C Chuen-Liang Chen, NTUCS&IE / 35 SCANNING Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University Taipei,
Lexical Analysis.
1st Phase Lexical Analysis
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Compilers Lexical Analysis 1. while (y < z) { int x = a + b; y += x; } 2.
COMP 3438 – Part II - Lecture 3 Lexical Analysis II Par III: Finite Automata Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
CS510 Compiler Lecture 2.
Chapter 3 Lexical Analysis.
The time complexity for e-closure(T).
Two issues in lexical analysis
Recognizer for a Language
Lexical Analysis Why separate lexical and syntax analyses?
פרק 3 ניתוח לקסיקאלי תורת הקומפילציה איתן אביאור.
Review: Compiler Phases:
Recognition of Tokens.
Finite Automata & Language Theory
Chapter 3. Lexical Analysis (2)
Presentation transcript:

Chapter 2 Lexical Analysis Nai-Wei Lin

Lexical Analysis Lexical analysis recognizes the vocabulary of the programming language and transforms a string of characters into a string of words or tokens Lexical analysis discards white spaces and comments between the tokens Lexical analyzer (or scanner) is the program that performs lexical analysis

Outline Scanners Tokens Regular expressions Finite automata Automatic conversion from regular expressions to finite automata FLex - a scanner generator

Scanners Scanner Parser Symbol Table token next token characters

Tokens A token is a sequence of characters that can be treated as a unit in the grammar of a programming language A programming language classifies tokens into a finite set of token types TypeExamples IDfoo i n NUM73 13 IFif COMMA,

Semantic Values of Tokens Semantic values are used to distinguish different tokens in a token type –,, –, – Token types affect syntax analysis and semantic values affect semantic analysis

Scanner Generators Scanner Generator Scanner definition in matalanguage Scanner Program in programming language Token types & semantic values

Languages A language is a set of strings A string is a finite sequence of symbols taken from a finite alphabet – The C language is the (infinite) set of all strings that constitute legal C programs – The language of C reserved words is the (finite) set of all alphabetic strings that cannot be used as identifiers in the C programs – Each token type is a language

Regular Expressions (RE) A language allows us to use a finite description to specify a (possibly infinite) set RE is the metalanguage used to define the token types of a programming language

Regular Expressions  is a RE denoting L = {  } If a  alphabet, then a is a RE denoting L = {a} Suppose r and s are RE denoting L(r) and L(s)  alternation: (r) | (s) is a RE denoting L(r)  L(s)  concatenation: (r) (s) is a RE denoting L(r)L(s)  repetition: (r) * is a RE denoting (L(r)) *  (r) is a RE denoting L(r)

Examples a | b{a, b} (a | b)(a | b){aa, ab, ba, bb} a * { , a, aa, aaa,...} (a | b) * the set of all strings of a ’ s and b ’ s a | a * bthe set containing the string a and all strings consisting of zero or more a ’ s followed by a b

Regular Definitions Names for regular expressions d 1  r 1 d 2  r 2... d n  r n where r i over alphabet  {d 1, d 2,..., d i-1 } Examples: letter  A | B |... | Z | a | b |... | z digit  0 | 1 |... | 9 identifier  letter ( letter | digit ) *

Notational Abbreviations One or more instances (r) + denoting (L(r)) + r * = r + |  r + = r r * Zero or one instance r? = r |  Character classes [abc] = a | b | c [a-z] = a | b |... | z [^abc] = any character except a | b | c Any character except newline.

Examples if {return IF;} [a-z][a-z0-9]* {return ID;} [0-9]+ {return NUM;} ([0-9]+ “. ” [0-9]*)|([0-9]* “. ” [0-9]+) {return REAL;} ( “ -- ” [a-z]* “ \n ” )|( “ ” | “ \n ” | “ \t ” )+ {/*do nothing for white spaces and comments*/}. { error(); }

Completeness of REs A lexical specification should be complete; namely, it always matches some initial substring of the input …./* match any */

Disambiguity of REs (1) Longest match disambiguation rules: the longest initial substring of the input that can match any regular expression is taken as the next token ([0-9]+ “. ” [0-9]*)|([0-9]* “. ” [0-9]+) /* REAL */ 0.9

Disambiguity of REs (2) Rule priority disambiguation rules: for a particular longest initial substring, the first regular expression that can match determines its token type if/* IF */ [a-z][a-z0-9]*/* ID */ if

Finite Automata A finite automaton is a finite-state transition diagram that can be used to model the recognition of a token type specified by a regular expression A finite automaton can be a nondeterministic finite automaton or a deterministic finite automaton

Nondeterministic Finite Automata (NFA) An NFA consists of – A finite set of states – A finite set of input symbols – A transition function that maps (state, symbol) pairs to sets of states – A state distinguished as start state – A set of states distinguished as final states

An Example RE: (a | b) * abb States: {1, 2, 3, 4} Input symbols: {a, b} Transition function: (1,a) = {1,2}, (1,b) = {1} (2,b) = {3}, (3,b) = {4} Start state: 1 Final state: {4} 1234 a b b a,b start

Acceptance of NFA An NFA accepts an input string s iff there is some path in the finite-state transition diagram from the start state to some final state such that the edge labels along this path spell out s The language recognized by an NFA is the set of strings it accepts

An Example 1423 abb a b start (a | b) * abbaabb {1}  {1,2}  {1,2}  {1,3}  {1,4} aabb

aaba An Example 1423 abb a b start (a | b) * abb {1}  {1,2}  {1,2}  {1,3}  {1} a aba

Another Example RE: aa * | bb * States: {1, 2, 3, 4, 5} Input symbols: {a, b} Transition function: (1,  ) = {2, 4}, (2, a) = {3}, (3, a) = {3}, (4, b) = {5}, (5, b) = {5} Start state: 1 Final states: {3, 5}

Finite-State Transition Diagram start aa * | bb * a b a b 5 aaa {1}  {1,2,4}  {3}  {3}  {3} aaa 

Operations on NFA states  -closure(s): set of states reachable from a state s on  -transitions alone  -closure(S): set of states reachable from some state s in S on  -transitions alone move(s, c): set of states to which there is a transition on input symbol c from a state s move(S, c): set of states to which there is a transition on input symbol c from some state s in S

An Example start aa * | bb * a b a b 5 aaa {1}  {1,2,4}  {3}  {3}  {3} aaa  S 0 = {1} S 1 =  -closure({1}) = {1,2,4} S 2 = move({1,2,4},a) = {3} S 3 =  -closure({3}) = {3} S 4 = move({3},a) = {3} S 5 =  -closure({3}) = {3} S 6 = move({3},a) = {3} S 7 =  -closure({3}) = {3} 3 is in {3, 5}  accept

Simulating an NFA Input: An input string ended with eof and an NFA with start state s 0 and final states F. Output: The answer “yes” if accepts, “no” otherwise. begin S :=  -closure({s 0 }); c := nextchar; while c <> eof do begin S :=  -closure(move(S, c)); c := nextchar end; if S  F <>  then return “yes” else return “no” end.

Computation of  -closure (a | b) * abb start a b ab 5 b  -closure({1}) = {1,2,3,5,8}  -closure({4}) = {2,3,4,5,7,8}

Computation of  -closure Input: An NFA and a set of NFA states S. Output: T =  -closure(S). begin push all states in S onto stack; T := S; while stack is not empty do begin pop t, the top element, off of stack; for each state u with an edge from t to u labeled  do if u is not in T then begin add u to T; push u onto stack end end; return T end.

Deterministic Finite Automata (DFA) A DFA is a special case of an NFA in which no state has an  -transition for each state s and input symbol a, there is at most one edge labeled a leaving s

An Example RE: (a | b) * abb States: {1, 2, 3, 4} Input symbols: {a, b} Transition function: (1,a) = 2, (2,a) = 2, (3,a) = 2, (4,a) = 2 (1,b) = 1, (2,b) = 3, (3,b) = 4, (4,b) = 1 Start state: 1 Final state: {4}

Finite-State Transition Diagram (a | b) * abb 1423 a b b a b start a b a

Acceptance of DFA A DFA accepts an input string s iff there is one path in the finite-state transition diagram from the start state to some final state such that the edge labels along this path spell out s The language recognized by a DFA is the set of strings it accepts

An Example (a | b) * abb 1423 a b b a b start a b a aabb 1  2  2  3  4 aabb

An Example (a | b) * abb 1423 a b b a b start a b a aaba 1  2  2  3  2 aaba

An Example (a | b) * abb 1423 a b b a b start a b a bbababb s = 1 s = move(1, b) = 1 s = move(1, a) = 2 s = move(2, b) = 3 s = move(3, a) = 2 s = move(2, b) = 3 s = move(3, b) = 4 4 is in {4}  accept

Simulating a DFA Input: An input string ended with eof and a DFA with start state s 0 and final states F. Output: The answer “yes” if accepts, “no” otherwise. begin s := s 0 ; c := nextchar; while c <> eof do begin s := move(s, c); c := nextchar end; if s is in F then return “yes” else return “no” end.

Combined Finite Automata start if start 3 1 a-z start 2 a-z, [a-z][a-z0-9]* ([0-9]+ “. ” [0-9]*) | ([0-9]* “. ” [0-9]+) if IF ID REAL

Combined Finite Automata  23 if  4 5 a-z  6 a-z, start IF ID REAL NFA

Combined Finite Automata i f 3 j-z 4 a-z, start IF ID REAL DFA a-z,0-9 a-h a-e g-z ID

Recognizing the Longest Match The automaton must keep track of the longest match seen so far and the position of that match until a dead state is reached Use two variables Last-Final (the state number of the most recent final state encountered) and Input-Position-at-Last-Final to remember the last time the automaton was in a final state

An Example i 3 j-z 4 a-z, start ID REAL DFA a-z,0-9 a-h a-e g-z ID S C L P i f f a i l ? iffail+ IF

Scanner Generators RE NFA DFA

Flex – A Scanner Generator A language for specifying scanners Flex compilerlex.yy.clang.l C compiler -lfl a.outlex.yy.c a.outtokenssource code

Flex Programs %{ auxiliary declarations %} regular definitions % translation rules % auxiliary procedures

Translation Rules P 1 action 1 P 2 action 2... P n action n where P i are regular expressions and action i are C program segments

Example 1 % username printf( “%s”, getlogin() ); By default, any text not matched by a flex scanner is copied to the output. This scanner copies its input file to its output with each occurrence of “username” being replaced with the user’s login name.

Example 2 %{ int lines = 0, chars = 0; %} % \n++lines; ++chars;.++chars;/* all characters except \n */ % main() { yylex(); printf(“lines = %d, chars = %d\n”, lines, chars); }

Example 3 %{ #define EOF0 #define LE25... %} delim[ \t\n] ws{delim}+ letter[A-Za-z] digit[0-9] id{letter}({letter}|{digit})* number{digit}+(\.{digit}+)?(E[+\-]?{digit}+)? %

Example 3 {ws}{ /* no action and no return */ } if{return (IF);} else{return (ELSE);} {id}{yylval=install_id(); return (ID);} {number}{yylval=install_num(); return (NUMBER);} “<=”{yylval=LE; return (RELOP);} “==”{yylval=EQ; return (RELOP);}... >{return(EOF);} % install_id() {... } install_num() {... }

Functions and Variables yylex() a function implementing the lexical analyzer and returning the token matched yytext a global pointer variable pointing to the lexeme matched yyleng a global variable giving the length of the lexeme matched yylval an external global variable storing the attribute of the token

NFA from Flex Programs P 1 | P 2 |... | P n N(P2)N(P2)... N(P1)N(P1) N(Pn)N(Pn) s0s0

Rules Look for the longest lexeme – number Look for the first-listed pattern that matches the longest lexeme – keywords and identifiers List frequently occurring patterns first – white space

Rules View keywords as exceptions to the rule of identifiers – construct a keyword table

Rules Start condition: r – match r only in start condition s Start conditions are declared in the first section using either %s or %x %s str A start condition is activated using the BEGIN action \ ” BEGIN(str); [^ ” ]*{/* eat up string body */} The default start condition is INITIAL \ ” BEGIN(INITIAL);

Lexical Error Recovery Error: none of patterns matches a prefix of the remaining input Panic mode error recovery – delete successive characters from the remaining input until the pattern-matching can continue

Maintaining Line Number Flex allows to maintain the number of the current line in the global variable yylineno using the following option mechanism %option yylineno in the first section

From a RE to an NFA Thompson’s construction algorithm – For , construct – For a in alphabet, construct fi starta if 

From a RE to an NFA Suppose N(s) and N(t) are NFA for RE s and t – for s | t, construct – for s t, construct start iN(s)N(t) f start i N(s) N(t) f isis itit fsfs ftft fsfs itit

From a RE to an NFA – for s *, construct – for (s), use N(s) iN(s) f start isis fsfs

An Example (a | b) * abb 21 a start 78 a 9 b 10 b 11 b 34 56

From an NFA to a DFA Subset construction Algorithm. Input: An NFA N. Output: A DFA D with states Dstates and trasition table Dtran. begin add  -closure(s 0 ) as an unmarked state to Dstates; while there is an unmarked state T in Dstates do begin mark T; for each input symbol a do begin U :=  -closure(move(T, a)); if U is not in Dstates then add U as an unmarked state to Dstates; Dtran[T, a] := U end end.

An Example (a | b) * abb start a b ab 5 b

An Example  -closure({1}) = {1,2,3,5,8} = A  -closure(move(A, a))=  -closure({4,9}) = {2,3,4,5,7,8,9} = B  -closure(move(A, b))=  -closure({6}) = {2,3,5,6,7,8} = C  -closure(move(B, a))=  -closure({4,9}) = B  -closure(move(B, b))=  -closure({6,10}) = {2,3,5,6,7,8,10} = D  -closure(move(C, a))=  -closure({4,9}) = B  -closure(move(C, b))=  -closure({6}) = C  -closure(move(D, a))=  -closure({4,9}) = B  -closure(move(D, b))=  -closure({6,11}) = {2,3,5,6,7,8,11} = E  -closure(move(E, a))=  -closure({4,9}) = B  -closure(move(E, b))=  -closure({6}) = C

An Example State Input Symbol ab A = {1,2,3,5,8} B = {2,3,4,5,7,8,9} C = {2,3,5,6,7,8} D = {2,3,5,6,7,8,10} E = {2,3,5,6,7,8,11} B B B B BC E C D C

An Example start {1,2,3,5,8} {2,3,4,5, 7,8,9} {2,3,5, 6,7,8} {2,3,5,6, 7,8,10} {2,3,5,6, 7,8,11} a a b b b a a b a b