Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science.

Slides:



Advertisements
Similar presentations
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Advertisements

LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
1 Chapter 2: Scanning 朱治平. Scanner (or Lexical Analyzer) the interface between source & compiler could be a separate pass and places its output on an.
Topics Automata Theory Grammars and Languages Complexities
Lexical Analysis Recognize tokens and ignore white spaces, comments
Scanner Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Theory of Computation, Feodor F. Dragan, Kent State University 1 Regular expressions: definition An algebraic equivalent to finite automata. We can build.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analyzer in Perspective
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability.
CSc 453 Lexical Analysis (Scanning)
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Brian Mitchell - Drexel University MCS680-FCS 1 Patterns, Automata & Regular Expressions int MSTWeight(int graph[][], int size)
1st Phase Lexical Analysis
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Regular Languages Chapter 1 Giorgi Japaridze Theory of Computability.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Compilers Lexical Analysis 1. while (y < z) { int x = a + b; y += x; } 2.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Lexical Analyzer in Perspective
CS314 – Section 5 Recitation 2
CS510 Compiler Lecture 2.
Chapter 3 Lexical Analysis.
CS 404 Introduction to Compiler Design
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Recognizer for a Language
Lexical Analysis Why separate lexical and syntax analyses?
פרק 3 ניתוח לקסיקאלי תורת הקומפילציה איתן אביאור.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Review: Compiler Phases:
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
4b Lexical analysis Finite Automata
CS 3304 Comparative Languages
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Specification of tokens using regular expressions
4b Lexical analysis Finite Automata
Lexical Analysis.
Lecture 5 Scanning.
Announcements - P1 part 1 due Today - P1 part 2 due on Friday Feb 1st
Presentation transcript:

Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science

2 List of Acronyms RE - regular expression FSM - finite state machine NFA - non-deterministic finite automata DFA - deterministic finite automata

3 The Structure of a Compiler syntactic analyzer code generator program text interm. rep. machine code tokenize r parse r token stream

4 Purpose of Lexical Analysis Converts a character stream into a token stream tokenize r int main(void) { for (int i = 0; i < 10; i++) {...

5 How the Tokenizer is Used Usually the tokenizer is used by the parser, which calls the getNextToken() function when it wants another token Often the tokenizer also includes a pushBack() function for putting the token back (so it can be read again) tokenize r parse r token getNextToken() program text

6 Other Tokenizing Jobs Input reading and buffering Macro expansion (C's #define) File inclusion (C's #include) Stripping out comments

7 Tokens, Patterns, and Lexemes A token is a pair – token name (e.g., VARIABLE) – token value (e.g., "myCounter") A lexeme is a sequence of program characters that form a token – (e.g., "myCounter") A pattern is a description of the form that the lexemes of a token may take – e.g., character strings including A-Z, a-z, 0-9, and _

8 A History Lesson Usually tokens are easy to recognize even without any context, but not always A tough example from Fortran 90: DO 5 I = 1.25 DO 5 I = 1,25

9 Lexical Errors Sometimes the current prefix of the input stream does not match any pattern – This is an error and should be logged The lexical analyzer may try to continue by – deleting characters until the input matches a pattern – deleting the first input character – adding an input character – replacing the first input character – transposing the first two input characters

10 Exercise Circle the lexemes in the following programs public static void main(String args[]) { System.println("Hello World!"); } float max(float a, float b) { return a > b ? a : b; }

11 Input Buffering Lexemes can be long and the pushBack function requires a mechanism for pushing them back One possible mechanism (suggested in the textbook) is a double buffer When we run off the end of one buffer we load the next buffer return (23); \n }\n public static void startcurrent

12 Tokenizing (so far) What a tokenizer does – reads character input and turns it into tokens What a token is – a token name and a value (usually the lexeme) How to read input – use a double buffer if some lookahead is necessary How does the tokenizer recognize tokens? How do we specify patterns?

13 Where to Next? We need a formal mechanism for defining the patterns that define tokens This mechanism is formal language theory Using formal language theory we can make tokenizers without writing any actual code

14 Strings and Languages An alphabet  is a set of symbols A string S over an alphabet  is a finite sequence of symbols in  The empty string, denoted , is a string of length 0 A language L over  is a countable set of strings over 

15 Examples of Languages The empty language L =  The language L = {  } containing only the empty string The set L of all syntactically correct C programs The set L of all valid variable names in Java The set L of all grammatically correct english sentences

16 String Concatenation If x and y are strings then the concatenation of x and y, denoted xy, is the string formed by appending y to x Example – x = "dog" – y = "house" – xy = "doghouse" If we treat concatenation as a "product" then we get exponentiation: – x 2 = "dogdog" – x 3 = "dogdogdog"

17 Operations on Languages We can form complex languages from simple ones using various operations Union: L  M (also denoted L | M) – L  M = { s : s  L or s  M } Concatenation – LM = { st : s  L and t  M } Kleene Closure L * – L * = { L i : i = 0, 1, 2,... } Positive Closure L + – L * = { L i : i = 1, 2, 3,... }

18 Some Example L = { A,B,C,...Z,a,b,c,...z } D = { 0,1,2,3,4,5,6,7,8,9 } L  D LD L 4 L * L(L  D) * D+

19 Regular Expressions Regular expressions provide a notation for defining languages A regular expression r denotes a language L(r) over a finite alphabet  Basics: –  is a RE and L(e) = { e } – For each symbol a in , a is a RE and L(a) = { a }

20 Regular Expression Operators Suppose r and s are regular expressions Union (choice) – (r)|(s) denotes L(r)  L(s) Concatenation – (r)(s) denotes L(r) L(s) Kleene Closure – r* denotes (L(r)) * Parenthesization – (r) denote L(r) – Used to enforce specific order of operations

21 Order of Operations in REs To avoid too many parentheses, we adopt the following conventions – The * operator has the highest level of precedence and is left associative – Concatenation has second highest precedence and is left associative – The | operator has lowest precedence and is left associative

22 Binary Examples For the alphabet  = { a,b } – a|b denotes the language { a, b } – (a|b)(a|b) denotes the langage { aa, ab, ba, bb } – a* denotes { e, a, aa, aaa, aaaa,.... } – (a|b)* denotes all possible strings over  – a|a*b denotes the language { a, b, ab, aab, aaab,... }

23 Regular Definitions REs can quickly become complicated Regular definitions are multiline regular expressions Each line can refer to any of the preceding lines but not to itself or to subsequent lines letter_ = A|B|...|Z|a|b|...|z|_ digit = 0|1|2|3|4|5|6|7|8|9 id = letter_(letter_|digit)*

24 Regular Definition Example Floating point number example – Accepts 42, , E+23, 42E+23, 42E23,... digit = 0|1|2|3|4|5|6|7|8|9 digits = digit digit* optionalFraction =. digits | e optionalExponent = (E (+|-|e) digits) | e number = digits optionalFraction optionalExponent

25 Exercises Write regular definitions for – All strings of lowercase letters that contain the five vowels in order – All strings of lowercase letters in which the letters are in ascending lexicographic order – Comments, consisting of a string surrounded by /* and */ without any intervening */

26 Extension of Regular Expressions There are also several time-saving extensions of REs One or more instances – r+ = rr* Zero or one instance – r? = r|e Character classes – [abcdef] = (a|b|c|d|e|f) – [A-Za-z] = (A|B|C|...|Y|Z|a|b|c|...|y|z) Others – See page 127 of the text for more common RE shorthands

27 Some Examples digit = [0-9] digits = digit+ number = digits (. digits)? (E[+-]? digits)? letter_ = [A-Za-z_] digit = [0-9] variable= letter_ (letter|digit)*

28 Recognizing Tokens We now have a notation for patterns that define tokens We want to make these into a tokenizer For this, we use the formalism of finite state machines

29 An FSM for Relational Operators relational operators, =, ==, <> start< > = > = 7 LE NE GE EQ = = LT GT

30 FSM for Variable Names letter_ = [A-Za-z_] digit = [0-9] variable= letter_ (letter|digit)* 01 letter letter or digit

31 FSM for Numbers Build the FSM for the following: digit = [0-9] digits = digit+ number = digits (. digits)? ((E|e) digits)?

32 NumReader.java Look at NumReader.java example – Implements a token recognizer using a switch statement

33 The Story So Far We can write tokens types as regular expressions We want to convert these REs into (deterministic) finite automata (DFAs) From the DFA we can generate code – A single while loop containing a large switch statement Each state in S becomes a case – A table mapping S×S→S (current state,next symbol)→(new state) – A hash table mapping S×S→S – Elements of S may be grouped into character classes

34 NumReader2.java Look at NumReader2.java example – Implements a tokenizer using a hashtable

35 Automatic Tokenizer Generators Generating FSMs by hand from regular expressions is tedious and error-prone Ditto for generating code from FSMs Luckily, it can be done automatically lex Regular expressions lex NFA NFA2DFA tokenizer

36 Non-Deterministic Finite Automata An NFA is a finite state machine whose edges are labelled with subsets of S Some edges may be labelled with e The same labels may appear on two or more outgoing edges at a vertex An NFA accepts a string s if s defines any path to any of its accepting states

37 NFA Example NFA that accepts apple or ape 1 p p p e e l e e a a start e e

38 NFA Example NFA that accepts any binary string whose 4 last value is start 01

39 From Regular Expression to NFA Going from a RE to a NFA with one accepting state is easy e a start a

40 FSM for s Union r|s FSM for r e e e e star t

41 Concatenation rs FSM for s FSM for r start

42 Kleene Closure r* FSM for r start

43 NFA to DFA So far – We can express token patterns as RE – We can convert REs to NFA NFAs are hard to use – Given an NFA F and a string s, it is difficult to test if F accepts s Instead, we first convert the NFA into a deterministic finite automaton – No e transitions – No repeated labels on outgoing edges

44 Converting an NFA into a DFA Converting an NFA into a DFA is easy but sometimes expensive Suppose the NFA has n states 1,...,n Each state of the DFA is labelled with one of the 2 n subsets of {1,...,n} The DFA will be in a state whose label contains i if the NFA could be in state i Any DFA state that contains an accepting state of the NFA is also an accepting state

45 NFA 2 DFA – Sketch of Algorithm Step 1 - Remove duplicate edge labels by using e transitions a aa e a

46 NFA 2 DFA Step 2: Starting at state 0, start expanding states – State i expands into every state reachable from i using only e-transitions – Create new states, as necessary for the neighbours of already-expanded states – Use a lookup table to make sure that each possible state (subset of {1,...,n}) is created only once

47 Example p p p e e l e e 0 start Convert this NFA into a DFA

48 Example Convert this NFA into a DFA

49 From REs to a Tokenizer We can convert from RE to NFA to DFA DFAs are easy to implement – Using a switch statement or a (hash)table For each token type we write a RE The lexical analysis generator then creates a NFA (or DFA) for each token type and combines them into one big NFA

50 From REs to a Tokenizer One giant NFA captures all token types Convert this to a DFA – If any state of the DFA contains an accepting state for more than 1 token then something is wrong with the language specification NFA for token A NFA for token B NFA for token C e e e

51 Summary The Tokenizer converts the input character stream into a token stream Tokens can be specified using REs A software tool can be used to convert the list of REs into a tokenizer – Convert each RE to an NFA – Combine all NFAs into one big NFA – Convert this NFA into a DFA and the code that implements this DFA

52 Other Notes REs, NFAs, and DFAs are equivalent in terms of the languages they can define Converting from NFA to DFA can be expensive – An n-state NFA can result in a 2 n state DFA None of these are powerful enough to parse programming languages but are usually good enough for tokens – Example: the language { a n b n : n = 1,2,3,...} is not recognizable by a DFA (why?)