CS-338 Compiler Design Dr. Syed Noman Hasany Assistant Professor College of Computer, Qassim University.

1 CS-338 Compiler Design Dr. Syed Noman Hasany Assistant Professor College of Computer, Qassim University

2 Chapter 3: Lexical Analyzer THE ROLE OF LEXICAL ANALYSER :  It is the first phase of the compiler.  It reads the input characters and produces as output a sequence of tokens that the parser uses for syntax analysis.  It strips out from the source program comments and white spaces in the form of blank, tab and newline characters.  It also correlates error messages from the compiler with the source program (because it keeps track of line numbers). 2

3 3 Interaction of the Lexical Analyzer with the Parser Lexical Analyzer Parser Source Program Token, tokenval Symbol Table Get next token error

4 4 The Reason Why Lexical Analysis is a Separate Phase Simplifies the design of the compiler –LL(1) or LR(1) parsing with 1 token lookahead would not be possible (multiple characters/tokens to match) Provides efficient implementation –Systematic techniques to implement lexical analyzers by hand or automatically from specifications –Stream buffering methods to scan input Improves portability –Non-standard symbols and alternate character encodings can be normalized (e.g. trigraphs)

5 5 Attributes of Tokens Lexical analyzer y := 31 + 28*x Parser token tokenval (token attribute)

6 6 Tokens, Patterns, and Lexemes A token is a classification of lexical units –For example: id and num Lexemes are the specific character strings that make up a token –For example: abc and 123 Patterns are rules describing the set of lexemes belonging to a token –For example: “letter followed by letters and digits” and “non-empty sequence of digits”

7 Tokens, Patterns, and Lexemes A lexeme is a sequence of characters from the source program that is matched by a pattern for a token. 7 lexeme Pattern Token

8 Tokens, Patterns, and Lexemes TokenSample LexemesInformal Description of Pattern const if relation id num literal const if, >, >= pi, count, D2 3.1416, 0, 6.02E23 “core dumped” const if or >= or > letter followed by letters and digits any numeric constant any characters between “ and “ except “ Classifies Pattern Actual values are critical. Info is : 1. Stored in symbol table 2. Returned to parser

9 Examining ways of speeding reading the source program –In one buffer technique, the last lexeme under process will be over- written when we reload the buffer. –Two-buffer scheme handling large look ahead safely 3.2 Input Buffering

10 Two buffers of the same size, say 4096, are alternately reloaded. Two pointers to the input are maintained: –Pointer lexeme_Begin marks the beginning of the current lexeme. –Pointer forward scans ahead until a pattern match is found. 3.2.1 Buffer Pairs

11 If forward at end of first half then begin reload second half; forward:=forward + 1; End Else if forward at end of second half then begin reload first half; move forward to beginning of first half End Else forward:=forward + 1;

12 3.2.2 Sentinels E = M * eofC * * 2 eof eof

13 forward:=forward+1; If forward ^ = EOF then begin If forward at end of first half then begin reload second half; forward:=forward + 1; End Else if forward at end of second half then begin reload first half; move forward to beginning of first half End Else terminate lexical analysis;

14 14 Specification of Patterns for Tokens: Definitions An alphabet  is a finite set of symbols (characters) A string s is a finite sequence of symbols from  –  s  denotes the length of string s –  denotes the empty string, thus  = 0 A language is a specific set of strings over some fixed alphabet 

15 15 Specification of Patterns for Tokens: String Operations The concatenation of two strings x and y is denoted by xy The exponentation of a string s is defined by s 0 =  ( Empty string : a string of length zero) s i = s i-1 s for i > 0 note that s  =  s = s

16 16 Specification of Patterns for Tokens: Language Operations Union L  M = {s  s  L or s  M} Concatenation LM = {xy  x  L and y  M} Exponentiation L 0 = {  }; L i = L i-1 L Kleene closure L * =  i=0,…,  L i Positive closure L + =  i=1,…,  L i

17 Language Operations Examples L = {A, B, C, D } D = {1, 2, 3} L  D = {A, B, C, D, 1, 2, 3 } LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 } L 2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, … DD} L 4 = L 2 L 2 = ?? L* = { All possible strings of L plus  } L + = L* -  L (L  D ) = ?? L (L  D )* = ??

18 18 Specification of Patterns for Tokens: Regular Expressions Basis symbols: –  is a regular expression denoting language {  } –a   is a regular expression denoting {a} If r and s are regular expressions denoting languages L(r) and M(s) respectively, then –r  s is a regular expression denoting L(r)  M(s) –rs is a regular expression denoting L(r)M(s) –r * is a regular expression denoting L(r) * –(r) is a regular expression denoting L(r) A language defined by a regular expression is called a regular set

19 Examples: –let a | b (a | b) a * (a | b)* a | a*b –We assume that ‘*’ has the highest precedence and is left associative. Concatenation has second highest precedence and is left associative and ‘|’ has the lowest precedence and is left associative (a) | ((b)*(c ) ) = a | b*c

20 Algebraic Properties of Regular Expressions AXIOMDESCRIPTION r | s = s | r r | (s | t) = (r | s) | t (r s) t = r (s t)  r = r r  = r r* = ( r |  )* r ( s | t ) = r s | r t ( s | t ) r = s r | t r r** = r* | is commutative | is associative concatenation is associative concatenation distributes over | relation between * and   Is the identity element for concatenation * is idempotent

21 Finite Automaton Given an input string, we need a “machine” that has a regular expression hard-coded in it and can tell whether the input string matches the pattern described by the regular expression or not. A machine that determines whether a given string belongs to a language is called a finite automaton.

22 Deterministic Finite Automaton Definition: Deterministic Finite Automaton –a five-tuple ( , S, , s 0, F) where  is the alphabet S is the set of states  is the transition function (S  S) s 0 is the starting state F is the set of final states (F  S) Notation: –Use a transition diagram to describe a DFA states are nodes, transitions are directed, labeled edges, some states are marked as final, one state is marked as starting If the automaton stops at a final state on end of input, then the input string belongs to the language.

23 ① a  ={a} L= {a} S = {1,2}  (1,a)=2 S 0 = 1 F = {2}

24 ② a|b  ={a,b} L = {a,b} S = {1,2}  (1,a)=2,  (1,b)=2 S 0 = 1 F = {2}

25 ③ a(a|b)  ={a,b} L = {aa,ab} S = {1,2,3}  (1,a)=2,  (2,a)=3,  (2,b)=3 S 0 = 1 F = {3}

26 ④ a*  = {a} L = { ,a,aa,aaa,aaaa,…} S = {1}  (1,  )=1,  (1,a)=1 S 0 = 1 F = {1}

27 ⑤a⁺⑤a⁺  ={a} L = {a,aa,aaa,aaaa,…} S = {1,2}  (1,a)=2,  (2,a)=2 S 0 = 1 F = {2} Note: a ⁺ =aa*

28 ⑥ (a|b)(a|b)b  = {a,b} L = {aab,abb,bab,bbb} S = {1,2,3,4}  (1,a)=2,  (1,b)=2,  (2,a)=3,  (2,b)=3,  (3,b)=4 S 0 = 1 F = {4}

29 ⑦ (a|b)*  ={a,b} L={ ,a,b,aa,bb,ba,ab,aaa,…,bbb,…,abab,…,b aba,bbba,…,…} S = {1}  (1,a)=1,  (1,b)=1 S 0 = 1 F = {1}

30 ⑧ (a|b) ⁺  ={a,b} L = {a,aa,aaa,…,b,bb,bbb,…} S = {1,2}  (1,a)=2,  (1,b)=2,  (2,a)=2,  (2,b)=2 S 0 = 1 F = {2} Note: (a|b) ⁺ =(a|b)(a|b)*

31 ⑨ a ⁺ |b ⁺  ={a,b} L = {a,aa,aaa,…,b,bb,bbb,…} S = {1,2,3}  (1,a)=2,  (2,a)=2,  (1,b)=3,  (3,b)=3 S 0 = 1 F = {2,3}

32 ⑩ a(a|b)*  ={a,b} L={a,aa,ab,…,aba,…,abb,…,baa,abbb,…,bab aba,…} S = {1,2}  (1,a)=2,  (2,a)=2,  (2,b)=2 S 0 = 1 F = {2}

33 ⑪ a(b|a)b ⁺  ={a,b} L = {aab,abb,aabb,…,abbb,abbbb,…} S ={1,2,3,4}  (1,a)=2,  (2,a)=3,  (2,b)=3,  (3,b)=4,  (4,b)=4 S 0 = 1 F = {4}

34 ⑫ ab*a(a ⁺ |b ⁺ )  ={a,b} L = {aaa,aab,abaa,abbaa,…,abbab,abbabbb,…} S = {1,2,3,4,5}  (1,a)=2,  (2,b)=2,  (2,a)=3,  (3,a)=4,  (4,a)=4,  (3,b)=5,  (5,b)=5 S 0 = 1 F = {4,5}

35 35 Specification of Patterns for Tokens: Regular Definitions Regular definitions introduce a naming convention: d 1  r 1 d 2  r 2 … d n  r n where each r i is a regular expression over   {d 1, d 2, …, d i-1 } Any d j in r i can be textually substituted in r i to obtain an equivalent set of definitions

36 36 Specification of Patterns for Tokens: Regular Definitions Example: letter  A  B  …  Z  a  b  …  z digit  0  1  …  9 id  letter ( letter  digit ) * Regular definitions are not recursive: digits  digit digits  digitwrong!

37 37 Specification of Patterns for Tokens: Notational Shorthand The following shorthands are often used: r + = rr * r? = r  [ a - z ] = a  b  c  …  z Examples: digit  [ 0 - 9 ] num  digit + (. digit + )? ( E (+  -)? digit + )?

38 38 Regular Definitions and Grammars stmt  if expr then stmt  if expr then stmt else stmt   expr  term relop term  term term  id  num if  if then  then else  else relop   >  >=  = id  letter ( letter | digit ) * num  digit + (. digit + )? ( E (+  -)? digit + )? Grammar Regular definitions

39 Constructing Transition Diagrams for Tokens Transition Diagrams (TD) are used to represent the tokens – these are automatons! As characters are read, the relevant TDs are used to attempt to match lexeme to a pattern Each TD has: States : Represented by Circles Actions : Represented by Arrows between states Start State : Beginning of a pattern (Arrowhead) Final State(s) : End of pattern (Concentric Circles) Each TD is Deterministic - No need to choose between 2 different actions !

40 Example : All RELOPs start< 0 other = 67 8 return(relop, LE) 5 4 > = 12 3 other > = * * return(relop, NE) return(relop, LT) return(relop, EQ) return(relop, GE) return(relop, GT)

41 Example TDs : id and delim Keyword or id : delim : startdelim 28 other 3029 delim * return(install_id(), gettoken()) startletter 9 other 1110 letter or digit *

42 Combine TD for KW and IDs Install_id(): decides for the attribute –It will check the accepted lexeme in the list of keywords; if it is matched, zero is returned. –Otherwise checks the lexeme in symbol table, if it is found, the address is returned. –If the lexeme not found in symbol table, install_id() first installs the ID in the symbol table and return the address of the newly created entry. Gettoken(): decides for the token –If zero returned by install_id(), the same word(or its numeric form) is returned as token –Otherwise token “ID” is returned. 42

43 Example TDs : Unsigned #s 1912141316151817 startotherdigit. E+ | -digit E * startdigit 25 other 2726 digit *startdigit 20 *. 21 digit 24 other 23 digit 22 * Questions: Is ordering important for unsigned #s ? Why are there no TDs for then, else, if ?

44 Keywords Recognition All Keywords / Reserved words are matched as ids After the match, the symbol table or a special keyword table is consulted Keyword table contains string versions of all keywords and associated token values if begin then 0 0 0... If a match is not found, then it is assumed that an id has been discovered

45 Transition Diagrams & Lexical Analyzers state = 0; token nexttoken() { while(1) { switch (state) { case 0: c = nextchar(); /* c is lookahead character */ if (c== blank || c==tab || c== newline) { state = 0; lexeme_beginning++; /* advance beginning of lexeme */ } else if (c == ‘<‘) state = 1; else if (c == ‘=‘) state = 5; else if (c == ‘>’) state = 6; else state = fail(); break; … /* cases 1-8 here */

46 case 9: c = nextchar(); if (isletter(c)) state = 10; else state = fail(); break; case 10; c = nextchar(); if (isletter(c)) state = 10; else if (isdigit(c)) state = 10; else state = 11; break; case 11; retract(1); install_id(); return ( gettoken() ); … /* cases 12-24 here */ case 25; c = nextchar(); if (isdigit(c)) state = 26; else state = fail(); break; case 26; c = nextchar(); if (isdigit(c)) state = 26; else state = 27; break; case 27; retract(1); install_num(); return ( NUM ); } } } Case numbers correspond to transition diagram states !

47 When Failures Occur: int state = 0, start = 0; Int lexical_value; /* to “return” second component of token */ Init fail() { forward = token_beginning; switch (start) { case 0: start = 9; break; case 9: start = 12; break; case 12: start = 20; break; case 20: start = 25; break; case 25: recover(); break; default: /* compiler error */ } return start; }

48 Using a Lex Generator Lex source prog   lex.yy.c lex.l lex.yy.c   a.out Input stream   sequence of input.c tokens Lex Compiler C compiler a.out

