Compilers Lexical Analysis 1. while (y < z) { int x = a + b; y += x; } 2.

Slides:



Advertisements
Similar presentations
4b Lexical analysis Finite Automata
Advertisements

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
Finite Automata CPSC 388 Ellen Walker Hiram College.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis.
Lexical Analyzer Second lecture. Compiler Construction Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
Topics Automata Theory Grammars and Languages Complexities
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
Scanner Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Theory of Computation, Feodor F. Dragan, Kent State University 1 Regular expressions: definition An algebraic equivalent to finite automata. We can build.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
2. Scanning College of Information and Communications Prof. Heejin Park.
Lexical Analyzer (Checker)
1 Chapter 3 Scanning – Theory and Practice. 2 Overview of scanner A scanner transforms a character stream of source file into a token stream. It is also.
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.
CHAPTER 1 Regular Languages
CSc 453 Lexical Analysis (Scanning)
Fall 2003CS416 Compiler Design1 Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. Normally a lexical.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
CMSC 330: Organization of Programming Languages Theory of Regular Expressions Finite Automata.
Brian Mitchell - Drexel University MCS680-FCS 1 Patterns, Automata & Regular Expressions int MSTWeight(int graph[][], int size)
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Chapter 5 Finite Automata Finite State Automata n Capable of recognizing numerous symbol patterns, the class of regular languages n Suitable for.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Compiler Construction Lecture Three: Lexical Analysis - Part Two CSC 2103: Compiler Construction Lecture Three: Lexical Analysis - Part Two Joyce Nakatumba-Nabende.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
CS510 Compiler Lecture 2.
Lexical analysis Finite Automata
Chapter 2 Finite Automata
Two issues in lexical analysis
Recognizer for a Language
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
Recognition of Tokens.
Finite Automata.
4b Lexical analysis Finite Automata
Chapter 3. Lexical Analysis (2)
Compiler Construction
4b Lexical analysis Finite Automata
Lecture 5 Scanning.
Presentation transcript:

Compilers Lexical Analysis 1

while (y < z) { int x = a + b; y += x; } 2

Lexemes vs Tokens  Input Stream: while (y < z) { int x = a + b; y += x; }  Token Stream: T_While T_LeftParen T_Identifier y T_Less T_Identifier z T_RightParen T_OpenBrace T_Int T_Identifier x T_Assign T_Identifier a T_Plus T_Identifier b T_Semicolon T_Identifier y T_PlusAssign T_Identifier x T_Semicolon T_CloseBrace 3

Goals of Lexical Analysis (Scanner)  Convert from physical description of a program into sequence of tokens.  Each token is associated with a lexeme.  Each token may have optional attributes.  The token stream will be used in the parser to recover the program structure. 4 ScannerParser sourcetokens

Challenges in Lexical Analysis  How to partition the program into lexemes?  How to label each lexeme correctly? 5

Defining a Lexical Analysis  Define a set of tokens.  Define the set of lexemes associated with each token.  Define an algorithm for resolving conflicts that arise in the lexemes.  It often consumes a surprising amount of the compiler’s total execution time. 6

Choosing Tokens  What Tokens are Useful Here? for (int k = 0; k < myArray[5]; ++k) { cout << k << endl; } For { IntConstant } Int << Identifier ; = < ( [ ) ] ++ 7

Choosing Good Tokens  Give keywords their own tokens.  Give different operators and punctuation symbols their own tokens.  Discard irrelevant information (whitespace, comments) 8

Tokens in Programming Languages  Operators & Punctuation  + - * / ( ) { } [ ] ; : :: < <= == = != ! …  Each of these is a distinct lexical class  Keywords  if while for goto return switch void …  Each of these is also a distinct lexical class (not a string)  Identifiers  A signle ID lexical class, but parameterized by actual id  Integer constants  A single INT lexical class, but parameterized by int value  Other constants, etc. 9

Defining Sets of Strings 10

String Terminology  An alphabet Σ is a set of characters.  A string over Σ is a finite sequence of elements from Σ.  Example: ● Σ = {☺, ☼ } ● Valid strings include ☺, ☺☺, ☺☼, ☺☺☼, etc. ● The empty string of no characters is denoted ε. 11

Languages  A language is a set of strings.  Example: Even numbers ● Σ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} ● L = {0, 2, 4, 6, 8, 10, 12, 14, … }  Example: C variable names ● Σ = ASCII characters ● L = {a, b, c, …, A, B, C, …, _, aa, ab, … } 12

Regular Languages  A subset of all languages that can be defined by regular expressions.  Any character is a regular expression matching itself. ( a is a regular expression for character a )  ε is a regular expression matching the empty string. 13

Operations on Regular Expressions  If R1 and R2 are two regular expressions, then: ● R1R2: is a regular expression matching the concatenation of the languages. ● R1 | R2: is a regular expression matching the disjunction of the languages. ● R1*: is a regular expression matching the Kleene closure of the language (0 or more occurrences ). ● (R): is a regular expression matching R. 14

Example  Example 3.4: Let Σ = {a, b}. (page 122) 1. The regular expression a|b denotes the language {a, b}. 2. (a|b)(a|b) denotes {aa, ab, ba, bb}, the language of all strings of length two over the alphabet Σ. Another regular expression for the same language is aa|ab|ba|b. 3. a* denotes the language consisting of all strings of zero or more a's, that is, {ε,a,aa,aaa,...} 15

Example  Example 3.4: Let Σ = {a, b}. (page 122) 4. (a|b)* denotes the set of all strings consisting of zero or more instances of a or b, that is, all strings of a's and b's: {e,a, b,aa, ab, ba, bb,aaa,...}. Another regular expression for the same language is (a*b*)* 5. a|a*b denotes the language {a, b, ab, aab, aaab,...}, that is, the string a and all strings consisting of zero or more a's and ending in b. 16

Abbreviations  The basic operations generate all possible regular expressions, but there are common abbreviations used for convenience. Typical examples: Abbr.MeaningNotes r+(rr*)1 or more occurrences r?(r | ε)0 or 1 occurrence [a-z](a|b|…|z)1 character in given range [abxyz](a|b|x|y|z)1 of the given characters 17

More Examples reMeaning [abc]+ [abc]* [0-9]+ [1-9][0-9]* [a-zA-Z][a-zA-Z0-9_]* 18

Regular Definitions  For notational convenience, we may give names to certain regular expressions and use those names in subsequent expressions, as if the names were themselves symbols. 19

Regular Definitions  Example: Even Numbers (+|-|ε) (0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8)  Regular Definitions: Sign → + | - OptSign → Sign | ε (Sign ?) Digit → [0 – 9] (0 | 1 | ….. | 9) EvenDigit → [02468] (0 | 2 | 4 | 6 | 8) EvenNumber → OptSign Digit* EvenDigit 20

More Examples  Example 3.5: (page 123) C identifiers are strings of letters, digits, and underscores. Here is a regular definition for the language of C identifiers. We shall conventionally use italics for the symbols defined in regular definitions. letter_ → A|B|….|Z|a|b|….|z |_ digit → 0 | 1 | …… | 9 id → letter _( letter_| digit )* 21

More Examples  Example 3.6: (page 123) Unsigned numbers (integer or floating point) are strings such as 5280, , 6.336E4, or 1.89E-4. The regular definitions: digit → 0 | 1 |..... | 9 digits → digit digit* optionalFraction →.digits | ε optionalExponent →( E ( + | - | ε ) digits ) | ε number → digits optionalFraction optionalExponent 22

 Example 3.7: (page 124) we can rewrite the regular definition of Example 3.5 as: letter_ →[A-Za-z_] digit → [0-9] id → letter_ ( letter_| digit )* The regular definition of Example 3.6 can also be simplified: digit → [0-9] digits → digit+ number → digits (. digits)? ( E [+-]? digits )? 23 More Examples

Implementing Regular Expressions  Regular expressions can be implemented using finite automata.  There are two kinds of finite automata: ● NFAs (nondeterministic finite automata) ● DFAs (deterministic finite automata)  The step of implementing the lexical analyzer 24

Finite State Automaton  A finite set of states  One marked as initial state  One or more marked as final states  States sometimes labeled or numbered  A set of transitions from state to state  Each labeled with symbol from Σ, or ε 25

Finite State Automaton  Operate by reading input symbols (usually characters)  Transition can be taken if labeled with current symbol  ε-transition can be taken at any time  Accept when final state reached & no more input  Reject if no transition possible, or no more input and not in final state (DFA) 26

Example: FSA for “cat” atc 27 StartAccept

A Simple FSA 28

29 A Simple FSA

30 A Simple FSA

31 A Simple FSA

32 A Simple FSA

33 A Simple FSA

34 A Simple FSA

Example  Example 3.14: (page 148) The transition graph for an FSA recognizing the language of regular expression (a|b)*abb is shown in Fig

 We can also represent an FSA by a transition table, whose rows correspond to states, and whose columns correspond to the input symbols and ε  The entry for a given state and input is the value of the transition function applied to those arguments.  If the transition function has no information about that state-input pair, we put 0 in the table for the pair. 36 Transition Tables

 Example 3.15: (page 149) The transition table for the FSA of the example 3.14 is: 37

Acceptance of Input Strings by Automata  Example 3.16: (page 149) The string aabb is accepted by the FSA of the example The path labeled by aabb from state 0 to state 3 demonstrating this fact is:  Note that several paths labeled by the same string may lead to different states. For instance, path  This path leads to state 0, which is not accepting a a bb a a bb

From Regular Expressions to NFAs  Associate each regular expression with an NFA with the following properties:  There is exactly one accepting state.  There are no transitions out of the accepting state.  There are no transitions into the starting state. 39 Accept

Base Cases 40

Construction for st 41

Construction for s|t 42

Construction for s* 43

Example  Construct the NFS for the regular expression (a l b)*abb  NFA for a l b 44

Example  NFA for (a l b)* 45

Example  NFA for (alb)*abb 46

Speeding Up The Scanner  A DFA is like an NFA, but with tighter restrictions  Every state must have exactly one transition defined for every letter.  ε-moves are not allowed. 47

From NFA to DFA  constructs a transition table Dtran for DFA.  Each state of DFA is a set of NFA states  Note that s is a single state of NFA, while T is a set of states of NFA. 48

From NFA to DFA  Example 3.21 : (page 154) for the NFA accepting (a1 b) *abb; ε-closure(0) = {0,1,2,4,7} = A  since these are exactly the states reachable from state 0 via a path all of whose edges have label ε.  Note that a path can have zero edges, so state 0 is reachable from itself by an ε -labeled path. 49

From NFA to DFA  The input alphabet is {a, b). Thus, our first step is to mark A and compute  Dtran[A, a] = ε -closure(move(A, a))  Dtran[A, b] = ε - closure(move(A, b))  Among the states 0, 1, 2, 4, and 7, only 2 and 7 have transitions on a, to 3 and 8, respectively. Thus, move(A, a) = {3,8). Also, ε -closure({3,8}) = {1,2,3,4,6,7,8}, so we conclude Dtran[A, a] = ε -closure(rnove(A, a)) = ε -closure({3, 8}) = B 50

From NFA to DFA  Now, we must compute Dtran[A, b]. Among the states in A, only 4 has a transition on b, and it goes to 5. Thus, Dtran[A, b] = ε -closure(rnove(A, b)) = ε -closure({5}) = {I, 2, 4,5, 6,7} = C  B={1,2,3,4,6,7,8}  Now, we must compute Dtran[B, a].  Dtran[B, a] = ε -closure(rnove(B, a)) = ε -closure({3,8}) = B  Now, we must compute Dtran[B, b].  Dtran[B, b] = ε -closure(rnove(B, b)) = ε -closure({5,9}) = {1, 2, 4, 5, 6, 7, 9} = D 51

From NFA to DFA  C= {I, 2, 4, 6, 7}  Now, we must compute Dtran[C, a]. Dtran[C, a] = ε -closure(rnove(C, a)) = ε -closure({3,8}) = B  Now, we must compute Dtran[C, b]. Dtran[C, b] = ε -closure(rnove(C, b)) = ε -closure({5}) = C  D= {I, 2, 4, 5, 6, 7, 9}  Now, we must compute Dtran[D, a]. Dtran[D, a] = ε -closure(rnove(D, a)) = ε -closure({3,8}) = B  Now, we must compute Dtran[D, b]. Dtran[D, b] = ε -closure(rnove(D, b)) = ε -closure({5,10}) = {1, 2, 3, 5, 6, 7, 10} = E 52

From NFA to DFA  E = {1, 2, 3, 5, 6, 7, 10}  Now, we must compute Dtran[E, a]. Dtran[E, a] = ε -closure(rnove(E, a)) = ε -closure({3,8}) = B  Now, we must compute Dtran[E, b]. Dtran[E, b] = ε -closure(rnove(E, b)) = ε -closure({5}) = C 53

From NFA to DFA  Transition table Dtran for DFA 54

From NFA to DFA  the DFA for (a | b)* abb 55

Minimizing the Number of States of a DFA  Example 3.40: (page 183) Let us reconsider the previous DFA.  The initial partition consists of the two groups: {A, B, C, D}{E} which are respectively the nonac- cepting states and the accepting states.  The group {E} cannot be split, because it has only one state  The other group {A, B, C, D} can be split, so we must consider the effect of each input symbol. 56

Minimizing the Number of States of a DFA  On input a, each of these states goes to state B, so there is no way to distinguish these states using strings that begin with a.  On input b, states A, B, and C go to members of group {A, B, C, D}, while state D goes to E, a member of another group. So, group{A, B, C, D} could be split into {A, B, C}{D}.  We have now for this round the groups {A, B, C){D){E} 57

Minimizing the Number of States of a DFA  In the next round, we can split {A, B, C} into {A, C}{B}, since A and C each go to a member of {A, B, C} on input b, while B goes to a member of another group, {D}.  we cannot split the one remaining group with more than one state, since A and C each go to the same state (and therefore to the same group) on each input. 58

Minimizing the Number of States of a DFA  The minimum state DFA will be: 59 ABDE a a a ba bb b

Minimizing the Number of States of a DFA  The transition table of minimum state DFA will be: 60