CS-338 Compiler Design Dr. Syed Noman Hasany Assistant Professor College of Computer, Qassim University.

Slides:

Advertisements

Similar presentations

COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.

Advertisements

From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?

CSc 453 Lexical Analysis (Scanning)

Chapter 3 Lexical Analysis Yu-Chen Kuo.

Chapter 3 Lexical Analysis. Definitions The lexical analyzer produces a certain token wherever the input contains a string of characters in a certain.

LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006.

 Lex helps to specify lexical analyzers by specifying regular expression  i/p notation for lex tool is lex language and the tool itself is refered to.

1 IMPLEMENTATION OF FINITE AUTOMAT IN CODE There are several ways to translate either a DFA or an NFA into code. Consider, again the example of a DFA that.

Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.

1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.

1 Chapter 2: Scanning 朱治平. Scanner (or Lexical Analyzer) the interface between source & compiler could be a separate pass and places its output on an.

Chapter 3 Chang Chi-Chung. The Structure of the Generated Analyzer lexeme Automaton simulator Transition Table Actions Lex compiler Lex Program lexemeBeginforward.

Lexical Analysis Recognize tokens and ignore white spaces, comments

Scanner Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the.

1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.

 We are given the following regular definition: if -> if then -> then else -> else relop -> |>|>= id -> letter(letter|digit)* num -> digit + (.digit.

Chapter 3 Lexical Analysis

Topic #3: Lexical Analysis

1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.

Lexical Analysis Natawut Nupairoj, Ph.D.

Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.

Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.

Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.

Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.

Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.

Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.

Lexical Analyzer (Checker)

CH3.1 CS 345 Dr. Mohamed Ramadan Saady Algebraic Properties of Regular Expressions AXIOMDESCRIPTION r | s = s | r r | (s | t) = (r | s) | t (r s) t = r.

1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.

Chapter 3. Lexical Analysis (1). 2 Interaction of lexical analyzer with parser.

Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,

Lexical Analyzer in Perspective

Chapter 3 Chang Chi-Chung The Role of the Lexical Analyzer Lexical Analyzer Parser Source Program Token Symbol Table getNextToken error.

1 Lexical Analysis and Lexical Analyzer Generators Chapter 3 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University,

Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.

By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability.

1.  It is the first phase of compiler.  In computer science, lexical analysis is the process of converting a sequence of characters into a sequence.

IN LINE FUNCTION AND MACRO Macro is processed at precompilation time. An Inline function is processed at compilation time. Example : let us consider this.

CSc 453 Lexical Analysis (Scanning)

Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.

Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.

Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.

The Role of Lexical Analyzer

Lexical Analysis (Scanning) Lexical Analysis (Scanning)

1 January 18, January 18, 2016January 18, 2016January 18, 2016 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.

Lexical Analysis.

1st Phase Lexical Analysis

Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.

CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.

Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia

COMP 3438 – Part II - Lecture 3 Lexical Analysis II Par III: Finite Automata Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.

Lexical Analyzer in Perspective

Finite automate.

CS510 Compiler Lecture 2.

Chapter 3 Lexical Analysis.

CSc 453 Lexical Analysis (Scanning)

Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:

CSc 453 Lexical Analysis (Scanning)

Regular Definition and Transition Diagrams

Lexical Analysis Why separate lexical and syntax analyses?

Lexical and Syntax Analysis

פרק 3 ניתוח לקסיקאלי תורת הקומפילציה איתן אביאור.

Chapter 3: Lexical Analysis

Example TDs : id and delim

Lexical Analysis and Lexical Analyzer Generators

Review: Compiler Phases:

Recognition of Tokens.

Lexical Analysis - An Introduction

CSc 453 Lexical Analysis (Scanning)

Presentation transcript:

CS-338 Compiler Design Dr. Syed Noman Hasany Assistant Professor College of Computer, Qassim University

Chapter 3: Lexical Analyzer THE ROLE OF LEXICAL ANALYSER :  It is the first phase of the compiler.  It reads the input characters and produces as output a sequence of tokens that the parser uses for syntax analysis.  It strips out from the source program comments and white spaces in the form of blank, tab and newline characters.  It also correlates error messages from the compiler with the source program (because it keeps track of line numbers). 2

3 Interaction of the Lexical Analyzer with the Parser Lexical Analyzer Parser Source Program Token, tokenval Symbol Table Get next token error

4 The Reason Why Lexical Analysis is a Separate Phase Simplifies the design of the compiler –LL(1) or LR(1) parsing with 1 token lookahead would not be possible (multiple characters/tokens to match) Provides efficient implementation –Systematic techniques to implement lexical analyzers by hand or automatically from specifications –Stream buffering methods to scan input Improves portability –Non-standard symbols and alternate character encodings can be normalized (e.g. trigraphs)

5 Attributes of Tokens Lexical analyzer y := *x Parser token tokenval (token attribute)

6 Tokens, Patterns, and Lexemes A token is a classification of lexical units –For example: id and num Lexemes are the specific character strings that make up a token –For example: abc and 123 Patterns are rules describing the set of lexemes belonging to a token –For example: “letter followed by letters and digits” and “non-empty sequence of digits”

Tokens, Patterns, and Lexemes A lexeme is a sequence of characters from the source program that is matched by a pattern for a token. 7 lexeme Pattern Token

Tokens, Patterns, and Lexemes TokenSample LexemesInformal Description of Pattern const if relation id num literal const if, >, >= pi, count, D , 0, 6.02E23 “core dumped” const if or >= or > letter followed by letters and digits any numeric constant any characters between “ and “ except “ Classifies Pattern Actual values are critical. Info is : 1. Stored in symbol table 2. Returned to parser

Examining ways of speeding reading the source program –In one buffer technique, the last lexeme under process will be over- written when we reload the buffer. –Two-buffer scheme handling large look ahead safely 3.2 Input Buffering

Two buffers of the same size, say 4096, are alternately reloaded. Two pointers to the input are maintained: –Pointer lexeme_Begin marks the beginning of the current lexeme. –Pointer forward scans ahead until a pattern match is found Buffer Pairs

If forward at end of first half then begin reload second half; forward:=forward + 1; End Else if forward at end of second half then begin reload first half; move forward to beginning of first half End Else forward:=forward + 1;

3.2.2 Sentinels E = M * eofC * * 2 eof eof

forward:=forward+1; If forward ^ = EOF then begin If forward at end of first half then begin reload second half; forward:=forward + 1; End Else if forward at end of second half then begin reload first half; move forward to beginning of first half End Else terminate lexical analysis;

14 Specification of Patterns for Tokens: Definitions An alphabet  is a finite set of symbols (characters) A string s is a finite sequence of symbols from  –  s  denotes the length of string s –  denotes the empty string, thus  = 0 A language is a specific set of strings over some fixed alphabet 

15 Specification of Patterns for Tokens: String Operations The concatenation of two strings x and y is denoted by xy The exponentation of a string s is defined by s 0 =  ( Empty string : a string of length zero) s i = s i-1 s for i > 0 note that s  =  s = s

16 Specification of Patterns for Tokens: Language Operations Union L  M = {s  s  L or s  M} Concatenation LM = {xy  x  L and y  M} Exponentiation L 0 = {  }; L i = L i-1 L Kleene closure L * =  i=0,…,  L i Positive closure L + =  i=1,…,  L i

Language Operations Examples L = {A, B, C, D } D = {1, 2, 3} L  D = {A, B, C, D, 1, 2, 3 } LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 } L 2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, … DD} L 4 = L 2 L 2 = ?? L* = { All possible strings of L plus  } L + = L* -  L (L  D ) = ?? L (L  D )* = ??

18 Specification of Patterns for Tokens: Regular Expressions Basis symbols: –  is a regular expression denoting language {  } –a   is a regular expression denoting {a} If r and s are regular expressions denoting languages L(r) and M(s) respectively, then –r  s is a regular expression denoting L(r)  M(s) –rs is a regular expression denoting L(r)M(s) –r * is a regular expression denoting L(r) * –(r) is a regular expression denoting L(r) A language defined by a regular expression is called a regular set

Examples: –let a | b (a | b) a * (a | b)* a | a*b –We assume that ‘*’ has the highest precedence and is left associative. Concatenation has second highest precedence and is left associative and ‘|’ has the lowest precedence and is left associative (a) | ((b)*(c ) ) = a | b*c

Algebraic Properties of Regular Expressions AXIOMDESCRIPTION r | s = s | r r | (s | t) = (r | s) | t (r s) t = r (s t)  r = r r  = r r* = ( r |  )* r ( s | t ) = r s | r t ( s | t ) r = s r | t r r** = r* | is commutative | is associative concatenation is associative concatenation distributes over | relation between * and   Is the identity element for concatenation * is idempotent

Finite Automaton Given an input string, we need a “machine” that has a regular expression hard-coded in it and can tell whether the input string matches the pattern described by the regular expression or not. A machine that determines whether a given string belongs to a language is called a finite automaton.

Deterministic Finite Automaton Definition: Deterministic Finite Automaton –a five-tuple ( , S, , s 0, F) where  is the alphabet S is the set of states  is the transition function (S  S) s 0 is the starting state F is the set of final states (F  S) Notation: –Use a transition diagram to describe a DFA states are nodes, transitions are directed, labeled edges, some states are marked as final, one state is marked as starting If the automaton stops at a final state on end of input, then the input string belongs to the language.

① a  ={a} L= {a} S = {1,2}  (1,a)=2 S 0 = 1 F = {2}

② a|b  ={a,b} L = {a,b} S = {1,2}  (1,a)=2,  (1,b)=2 S 0 = 1 F = {2}

③ a(a|b)  ={a,b} L = {aa,ab} S = {1,2,3}  (1,a)=2,  (2,a)=3,  (2,b)=3 S 0 = 1 F = {3}

④ a*  = {a} L = { ,a,aa,aaa,aaaa,…} S = {1}  (1,  )=1,  (1,a)=1 S 0 = 1 F = {1}

⑤a⁺⑤a⁺  ={a} L = {a,aa,aaa,aaaa,…} S = {1,2}  (1,a)=2,  (2,a)=2 S 0 = 1 F = {2} Note: a ⁺ =aa*

⑥ (a|b)(a|b)b  = {a,b} L = {aab,abb,bab,bbb} S = {1,2,3,4}  (1,a)=2,  (1,b)=2,  (2,a)=3,  (2,b)=3,  (3,b)=4 S 0 = 1 F = {4}

⑦ (a|b)*  ={a,b} L={ ,a,b,aa,bb,ba,ab,aaa,…,bbb,…,abab,…,b aba,bbba,…,…} S = {1}  (1,a)=1,  (1,b)=1 S 0 = 1 F = {1}

⑧ (a|b) ⁺  ={a,b} L = {a,aa,aaa,…,b,bb,bbb,…} S = {1,2}  (1,a)=2,  (1,b)=2,  (2,a)=2,  (2,b)=2 S 0 = 1 F = {2} Note: (a|b) ⁺ =(a|b)(a|b)*

⑨ a ⁺ |b ⁺  ={a,b} L = {a,aa,aaa,…,b,bb,bbb,…} S = {1,2,3}  (1,a)=2,  (2,a)=2,  (1,b)=3,  (3,b)=3 S 0 = 1 F = {2,3}

⑩ a(a|b)*  ={a,b} L={a,aa,ab,…,aba,…,abb,…,baa,abbb,…,bab aba,…} S = {1,2}  (1,a)=2,  (2,a)=2,  (2,b)=2 S 0 = 1 F = {2}

⑪ a(b|a)b ⁺  ={a,b} L = {aab,abb,aabb,…,abbb,abbbb,…} S ={1,2,3,4}  (1,a)=2,  (2,a)=3,  (2,b)=3,  (3,b)=4,  (4,b)=4 S 0 = 1 F = {4}

⑫ ab*a(a ⁺ |b ⁺ )  ={a,b} L = {aaa,aab,abaa,abbaa,…,abbab,abbabbb,…} S = {1,2,3,4,5}  (1,a)=2,  (2,b)=2,  (2,a)=3,  (3,a)=4,  (4,a)=4,  (3,b)=5,  (5,b)=5 S 0 = 1 F = {4,5}

35 Specification of Patterns for Tokens: Regular Definitions Regular definitions introduce a naming convention: d 1  r 1 d 2  r 2 … d n  r n where each r i is a regular expression over   {d 1, d 2, …, d i-1 } Any d j in r i can be textually substituted in r i to obtain an equivalent set of definitions

36 Specification of Patterns for Tokens: Regular Definitions Example: letter  A  B  …  Z  a  b  …  z digit  0  1  …  9 id  letter ( letter  digit ) * Regular definitions are not recursive: digits  digit digits  digitwrong!

37 Specification of Patterns for Tokens: Notational Shorthand The following shorthands are often used: r + = rr * r? = r  [ a - z ] = a  b  c  …  z Examples: digit  [ ] num  digit + (. digit + )? ( E (+  -)? digit + )?

38 Regular Definitions and Grammars stmt  if expr then stmt  if expr then stmt else stmt   expr  term relop term  term term  id  num if  if then  then else  else relop   >  >=  = id  letter ( letter | digit ) * num  digit + (. digit + )? ( E (+  -)? digit + )? Grammar Regular definitions

Constructing Transition Diagrams for Tokens Transition Diagrams (TD) are used to represent the tokens – these are automatons! As characters are read, the relevant TDs are used to attempt to match lexeme to a pattern Each TD has: States : Represented by Circles Actions : Represented by Arrows between states Start State : Beginning of a pattern (Arrowhead) Final State(s) : End of pattern (Concentric Circles) Each TD is Deterministic - No need to choose between 2 different actions !

Example : All RELOPs start< 0 other = 67 8 return(relop, LE) 5 4 > = 12 3 other > = * * return(relop, NE) return(relop, LT) return(relop, EQ) return(relop, GE) return(relop, GT)

Example TDs : id and delim Keyword or id : delim : startdelim 28 other 3029 delim * return(install_id(), gettoken()) startletter 9 other 1110 letter or digit *

Combine TD for KW and IDs Install_id(): decides for the attribute –It will check the accepted lexeme in the list of keywords; if it is matched, zero is returned. –Otherwise checks the lexeme in symbol table, if it is found, the address is returned. –If the lexeme not found in symbol table, install_id() first installs the ID in the symbol table and return the address of the newly created entry. Gettoken(): decides for the token –If zero returned by install_id(), the same word(or its numeric form) is returned as token –Otherwise token “ID” is returned. 42

Example TDs : Unsigned #s startotherdigit. E+ | -digit E * startdigit 25 other 2726 digit *startdigit 20 *. 21 digit 24 other 23 digit 22 * Questions: Is ordering important for unsigned #s ? Why are there no TDs for then, else, if ?

Keywords Recognition All Keywords / Reserved words are matched as ids After the match, the symbol table or a special keyword table is consulted Keyword table contains string versions of all keywords and associated token values if begin then If a match is not found, then it is assumed that an id has been discovered

Transition Diagrams & Lexical Analyzers state = 0; token nexttoken() { while(1) { switch (state) { case 0: c = nextchar(); /* c is lookahead character */ if (c== blank || c==tab || c== newline) { state = 0; lexeme_beginning++; /* advance beginning of lexeme */ } else if (c == ‘<‘) state = 1; else if (c == ‘=‘) state = 5; else if (c == ‘>’) state = 6; else state = fail(); break; … /* cases 1-8 here */

case 9: c = nextchar(); if (isletter(c)) state = 10; else state = fail(); break; case 10; c = nextchar(); if (isletter(c)) state = 10; else if (isdigit(c)) state = 10; else state = 11; break; case 11; retract(1); install_id(); return ( gettoken() ); … /* cases here */ case 25; c = nextchar(); if (isdigit(c)) state = 26; else state = fail(); break; case 26; c = nextchar(); if (isdigit(c)) state = 26; else state = 27; break; case 27; retract(1); install_num(); return ( NUM ); } } } Case numbers correspond to transition diagram states !

When Failures Occur: int state = 0, start = 0; Int lexical_value; /* to “return” second component of token */ Init fail() { forward = token_beginning; switch (start) { case 0: start = 9; break; case 9: start = 12; break; case 12: start = 20; break; case 20: start = 25; break; case 25: recover(); break; default: /* compiler error */ } return start; }

Using a Lex Generator Lex source prog   lex.yy.c lex.l lex.yy.c   a.out Input stream   sequence of input.c tokens Lex Compiler C compiler a.out