LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006.

Slides:



Advertisements
Similar presentations
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Advertisements

Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
Chapter 2 Lexical Analysis Nai-Wei Lin. Lexical Analysis Lexical analysis recognizes the vocabulary of the programming language and transforms a string.
Chapter 3 Lexical Analysis Yu-Chen Kuo.
Chapter 3 Lexical Analysis. Definitions The lexical analyzer produces a certain token wherever the input contains a string of characters in a certain.
Lexical Analyzer Second lecture. Compiler Construction Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical.
CS-338 Compiler Design Dr. Syed Noman Hasany Assistant Professor College of Computer, Qassim University.
Lexical Analysis - Scanner Computer Science Rensselaer Polytechnic Compiler Design Lecture 2.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
1 Chapter 2: Scanning 朱治平. Scanner (or Lexical Analyzer) the interface between source & compiler could be a separate pass and places its output on an.
2. Lexical Analysis Prof. O. Nierstrasz
Lexical Analysis Recognize tokens and ignore white spaces, comments
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
CPSC 388 – Compiler Design and Construction
Chapter 3 Lexical Analysis
Topic #3: Lexical Analysis
Lexical Analysis Natawut Nupairoj, Ph.D.
CS308 Compiler Principles Lexical Analyzer Fan Wu Department of Computer Science and Engineering Shanghai Jiao Tong University Fall 2012.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
1 Chapter 1 Introduction to the Theory of Computation.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analyzer (Checker)
1 Chapter 3 Scanning – Theory and Practice. 2 Overview of scanner A scanner transforms a character stream of source file into a token stream. It is also.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
What is a language? An alphabet is a well defined set of characters. The character ∑ is typically used to represent an alphabet. A string : a finite.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analyzer in Perspective
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Chapter 3 Chang Chi-Chung The Role of the Lexical Analyzer Lexical Analyzer Parser Source Program Token Symbol Table getNextToken error.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability.
Fall 2003CS416 Compiler Design1 Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. Normally a lexical.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Lexical Analysis (Scanning) Lexical Analysis (Scanning)
1st Phase Lexical Analysis
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
1 Chapter 3 Regular Languages.  2 3.1: Regular Expressions (1)   Regular Expression (RE):   E is a regular expression over  if E is one of:
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Chapter2 : Lexical Analysis
Compilers Lexical Analysis 1. while (y < z) { int x = a + b; y += x; } 2.
Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science.
COMP 3438 – Part II - Lecture 3 Lexical Analysis II Par III: Finite Automata Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
Lexical Analyzer in Perspective
CS510 Compiler Lecture 2.
Chapter 3 Lexical Analysis.
Finite-State Machines (FSMs)
CSc 453 Lexical Analysis (Scanning)
Two issues in lexical analysis
Recognizer for a Language
Lexical Analysis Why separate lexical and syntax analyses?
פרק 3 ניתוח לקסיקאלי תורת הקומפילציה איתן אביאור.
Review: Compiler Phases:
Recognition of Tokens.
Chapter 3. Lexical Analysis (2)
Specification of tokens using regular expressions
Compiler Construction
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006

Faculty of IT - HCMUTLexical Analysis2 Outline Introduction to Lexical Analysis Token specification –Language –Regular Expressions (REs) Token recoginition –REs  NFA (Thompson’s construction, Algorithm 3.3) –NFA  DFA (subset construction, Algorithm 3.2) –DFA  minimal DFA (Algorithm 3.6) Programming

Faculty of IT - HCMUTLexical Analysis3 Introduction Read the input characters Produce as output a sequence of tokens Eliminate white space and comments lexical analyzer parser symbol table source program token get next token

Faculty of IT - HCMUTLexical Analysis4 Why ? Simplify design Improve compiler efficiency Enhance compiler portability

Faculty of IT - HCMUTLexical Analysis5 Tokens, Patterns, Lexemes TokenSample Lexeme Informal description of pattern const if relation,>= or >= idpi, count, x2letter followed by letters or digits num3.14, 25, 6.02E3any numeric constant literal“core dumped”any characters between “ and “ except “

Faculty of IT - HCMUTLexical Analysis6 Outline Introduction  Token specification –Language –Regular Expressions (REs) Token recoginition –REs  NFA (Thompson’s construction, Algorithm 3.3) –NFA  DFA (subset construction, Algorithm 3.2) –DFA  minimal DFA (Algorithm 3.6) Programming

Faculty of IT - HCMUTLexical Analysis7 Alphabet, Strings and Languages Alphabet ∑ : any finite set of symbols –The Vietnamese alphabet {a, á, à, ả, ã, ạ, b, c, d, đ,…} –The binary alphabet {0,1} –The ASCII alphabet String: a finite sequence of symbols drawn from ∑ : –Length |s| of a string s: the number of symbols in s –The empty string, denoted , |  | = 0 Language: any set of strings over ∑ ; –its two special cases:  : the empty set {  }

Faculty of IT - HCMUTLexical Analysis8 Examples of Languages ∑ ={ a, á, à, ả, ã, ạ, b, c, d, đ,… } –Vietnamese language ∑ = { 0,1 } –A string is an instruction –The set of Pentium instructions ∑ = the ASCII set –A string is a program –The set of C programs

Faculty of IT - HCMUTLexical Analysis9 Terms (Fig.3.7) TermDefinition prefix of sa string obtained by removing 0 or more trailing symbols of s; e.g. ban is a prefix of banana suffix of sa string formed by deleting 0 or more the leading symbols of s; e.g. na is a suffix of banana substring of sa string obtained by deleting a prefix and a suffix from s; e.g. nan is a substring of banana proper prefix, suffix or substring of s Any nonempty string x that is, respectively, a prefix, suffix os substring of s such that s  x

Faculty of IT - HCMUTLexical Analysis10 String operations String concatenation –If x and y are strings, xy is the string formed by appending y to x. E.g.: x = hom, y = nay  xy = homnay –  is the identity:  y = y; x  = x String exponentiation –s 0 =  –s i = s i-1 s E.g. s = 01, s 0 = , s 2 = 0101, s 3 =

Faculty of IT - HCMUTLexical Analysis11 Language Operations (Fig 3.8) TermDefinition union: L  ML  M = { s | s  L or s  M } concatenation: LM LM= { st | s  L or t  M } Kleene closure: L * L * = L 0  L  LL  LLL  … where L 0 = {  } 0 or more concatenations of L positive closure: L + L + = L  LL  LLL  … 1 or more concatenations of L

Faculty of IT - HCMUTLexical Analysis12 Examples L = {A,B,…,Z,a,b,…,z} D = {0,1,…,9} ExampleLanguage L  D LD L 4 L * L(L  D) * D + letters and digits strings consists of a letter followed by a digit all four-letter strings all strings of letters, including  all strings of letters and digits beginning with a letter all strings of one or more digits

Faculty of IT - HCMUTLexical Analysis13 Regular Expressions (Res) over Alphabet ∑ Inductive base: 1.  is a RE, denoting the RL {  } 2.a  ∑ is a RE, denoting the RL {a} Inductive step: Suppose r and s are REs, denoting the language L(r) and L(s). Then 3.(r)|(s) is a RE, denoting the RL L(r)  L(s) 4.(r)(s) is a RE, denoting the RL L(r)L(s) 5.(r)* is a RE, denoting the RL (L(r))* 6.(r) is a RE, denoting the RL L(r)

Faculty of IT - HCMUTLexical Analysis14 Precedence and Associativity Precedence: –“*” has the highest precedence –“concatenation” has the second highest precedence –“|” has the lowest precedence Associativity: –all are left-associative E.g.: (a)|((b)*(c))  a|b*c  Unnecessary parentheses can be removed

Faculty of IT - HCMUTLexical Analysis15 Example ∑ = {a, b} 1.a|b denotes {a,b} 2.(a|b)(a|b) denotes {aa,ab,ba,bb} 3.a* denotes { ,a,aa,aaa,aaaa,…} 4.(a|b)* denotes ? 5.a|a*b denotes ?

Faculty of IT - HCMUTLexical Analysis16 Notational Shorthands One or more instances +: r+ = rr* –denotes the language (L(r))+ –has the same precedence and associativity as * Zero or one instance ?: r? = r|  –denotes the language (L(r)  {  }) Character classes –[abc] denotes a|b|c –[A-Z] denotes A|B|…|Z –[a-zA-Z_][a-zA-Z0-9_]* denotes ?

Faculty of IT - HCMUTLexical Analysis17 Outline Introduction  Token specification  –Language –Regular Expressions (REs) Token recoginition –REs  NFA (Thompson’s construction, Algorithm 3.3) –NFA  DFA (subset construction, Algorithm 3.2) –DFA  minimal DFA (Algorithm 3.6) Programming

Faculty of IT - HCMUTLexical Analysis18 Overview RE NFADFA mDFA

Faculty of IT - HCMUTLexical Analysis19 Nondeterministic finite automata A nondeterministic finite automaton (NFA) is a mathematical model that consists of –a finite set of states S –a set of input symbols ∑ –a transition function move: S  ∑  S –a start state s 0 –a finite set of final or accepting states F

Faculty of IT - HCMUTLexical Analysis20 Transition graph state transition start state final state AB a A A A

Faculty of IT - HCMUTLexical Analysis21 Transition table ab 0{0,1}{0} 1-{2} 2-{3} Input symbol State

Faculty of IT - HCMUTLexical Analysis22 Acceptance A NFA accepts an input string x iff there is some path in the transition graph from start state to some accepting state such that the edge labels along this path spell out x. A B A  B  A  B  A  B A  B  A  B  A  ? error 0 1 0

Faculty of IT - HCMUTLexical Analysis23 Deterministic finite automata A deterministic finite automaton (DFA) is a special case of NFA in which 1.no state has an  -transition, and 2.for each state s and input symbol a, there is at most one edge labeled a leaving s.

Faculty of IT - HCMUTLexical Analysis24 Thompson’s construction of NFA from REs guided by the syntactic structure of the RE r For , For a in ∑ if  if a

Faculty of IT - HCMUTLexical Analysis25 Thompson’s construction (cont’d) Suppose N(s) and N(t) are NFA’s for REs s and t –For s|t, –For st, –For s*, –For (s), use N(s) itself N(s) N(t) i f     N(s) i f N(t) i f    

Faculty of IT - HCMUTLexical Analysis26 Outline Introduction  Token specification  –Language –Regular Expressions (REs) Token recoginition –REs  NFA (Thompson’s construction)  –NFA  DFA (subset construction) –DFA  minimal DFA (Algorithm 3.6) Programming

Faculty of IT - HCMUTLexical Analysis27 Subset construction OperationDescription  -closure(s) Set of NFA states reachable from state s on  -transition alone  -closure(T) Set of NFA states reachable from some state s in T on  -transition alone move(T,a)Set of NFA states to which there is a transition on input a from some state s in T s : an NFA state T : a set of NFA states

Faculty of IT - HCMUTLexical Analysis28 Subset construction (cont’d) Let s 0 be the start state of the NFA; Dstates contains the only unmarked state  -closure(s 0 ); while there is an unmarked state T in Dstates do begin mark T for each input symbol a do begin U :=  -closure(move(T; a)); if U is not in Dstates then Add U as an unmarked state to Dstates; DTran[T; a] := U; end;

Faculty of IT - HCMUTLexical Analysis29 DFA Let ( ∑, S, T, F, s 0 ) be the original NFA. The DFA is: The alphabet: ∑ The states: all states in Dstates The transitions: DTran The accepting states: all states in Dstates containing at least one accepting state in F of the NFA The start state:  -closure(s0)

Faculty of IT - HCMUTLexical Analysis30 Outline Introduction  Token specification  –Language –Regular Expressions (REs) Token recoginition –REs  NFA (Thompson’s construction)  –NFA  DFA (subset construction)  –DFA  minimal DFA (Algorithm 3.6) Programming

Faculty of IT - HCMUTLexical Analysis31 Minimise a DFA Initially, create two states: 1.one is the set of all final states: F 2.the other is the set of all non-final states: S - F while (more splits are possible) { Let S = {s 1,…, s n } be a state and c be any char in ∑ Let t 1,…, t n be the successor states to s 1,…, s n under c if (t 1,…, t n don't all belong to the same state) { Split S into new states so that s i and s j remain in the same state iff t i and t j are in the same state }

Faculty of IT - HCMUTLexical Analysis32 Example ABD E C b b b b b a a a a a Step1: {A,B,C,D}{E} For a, {B,B,B,B} For b, {C,D,C,E} Split {A,B,C} {D}{E} Step 2: For b, {C,D,C} Split {A,C} {B} {D} {E} Step 3: For a, {B,B} For b, {C,C} Terminate ABD E b b b b b a a a a

Faculty of IT - HCMUTLexical Analysis33 Outline Introduction  Token specification  –Language –Regular Expressions (REs) Token recoginition –REs  NFA (Thompson’s construction)  –NFA  DFA (subset construction)  –DFA  minimal DFA (Algorithm 3.6)  Programming

Faculty of IT - HCMUTLexical Analysis34 Input Buffering begin…begin… Scanner eof if (forward at end of first half) { reload second half forward++ } else if (forward at end of second half) { reload first half forward = 0 } else forward++

Faculty of IT - HCMUTLexical Analysis35 Input Buffering begin…begin… Scanner eof forward = forward + 1 if (forward↑=eof) { if (forward at end of first half) { reload second half forward++ } else if (forward at end of second half) { reload first half forward = 0 } else terminate the analysis }

Faculty of IT - HCMUTLexical Analysis36 Transition Diagrams relop  < = > other return(relop,LE) return(relop,NE) return(relop,LT) id  letter(letter|digit)* 56 7 letter letter or digit other return(id,lexeme) Transition diagram is a DFA in which there is no edge leaving out of a final state

Faculty of IT - HCMUTLexical Analysis37 Implementation token nexttoken() { while (1) { switch (state) { case 0: c = nextchar(); if (c == ‘<‘) state = 1; else state = fail(0); break; case 1: c = nextchar(); if (c == ‘=‘) state = 2; else if (c == ‘>’ state = 3; else state = 4; break; case 2: retract(0); return new Token(relop,”<=”); case 4: retract(1); return new Token(relop,”<”); case 5: c = nextchar(); if (Character.isLetter(c)) state = 6; else state = fail(5); break; case 6: c = nextchar(); if (Character.isLetter(c) ||Character.isDigit(c)) continue; else state = 7; break; case 7: retract(1); return new Token(id, getLexeme());

Faculty of IT - HCMUTLexical Analysis38 Implemetation (cont’d) int fail(int current_state) { forward = beginning; switch (current_state) { case 0: return 5; case 5: error(); } void retract(int flag) { if (flag ==1) move forward back get lexeme from beginning to forward move forward onward beginning = forward state = 0 } b│e│g│i│n│:│=│ │ │…

Faculty of IT - HCMUTLexical Analysis39 Outline Introduction  Token specification  –Language –Regular Expressions (REs) Token recoginition –REs  NFA (Thompson’s construction)  –NFA  DFA (subset construction)  –DFA  minimal DFA (Algorithm 3.6)  Programming 