Download presentation
Presentation is loading. Please wait.
Published byVivian George Modified over 9 years ago
1
CS-338 Compiler Design Dr. Syed Noman Hasany Assistant Professor College of Computer, Qassim University
2
Chapter 3: Lexical Analyzer THE ROLE OF LEXICAL ANALYSER : It is the first phase of the compiler. It reads the input characters and produces as output a sequence of tokens that the parser uses for syntax analysis. It strips out from the source program comments and white spaces in the form of blank, tab and newline characters. It also correlates error messages from the compiler with the source program (because it keeps track of line numbers). 2
3
3 Interaction of the Lexical Analyzer with the Parser Lexical Analyzer Parser Source Program Token, tokenval Symbol Table Get next token error
4
4 The Reason Why Lexical Analysis is a Separate Phase Simplifies the design of the compiler –LL(1) or LR(1) parsing with 1 token lookahead would not be possible (multiple characters/tokens to match) Provides efficient implementation –Systematic techniques to implement lexical analyzers by hand or automatically from specifications –Stream buffering methods to scan input Improves portability –Non-standard symbols and alternate character encodings can be normalized (e.g. trigraphs)
5
5 Attributes of Tokens Lexical analyzer y := 31 + 28*x Parser token tokenval (token attribute)
6
6 Tokens, Patterns, and Lexemes A token is a classification of lexical units –For example: id and num Lexemes are the specific character strings that make up a token –For example: abc and 123 Patterns are rules describing the set of lexemes belonging to a token –For example: “letter followed by letters and digits” and “non-empty sequence of digits”
7
Tokens, Patterns, and Lexemes A lexeme is a sequence of characters from the source program that is matched by a pattern for a token. 7 lexeme Pattern Token
8
Tokens, Patterns, and Lexemes TokenSample LexemesInformal Description of Pattern const if relation id num literal const if, >, >= pi, count, D2 3.1416, 0, 6.02E23 “core dumped” const if or >= or > letter followed by letters and digits any numeric constant any characters between “ and “ except “ Classifies Pattern Actual values are critical. Info is : 1. Stored in symbol table 2. Returned to parser
9
Examining ways of speeding reading the source program –In one buffer technique, the last lexeme under process will be over- written when we reload the buffer. –Two-buffer scheme handling large look ahead safely 3.2 Input Buffering
10
Two buffers of the same size, say 4096, are alternately reloaded. Two pointers to the input are maintained: –Pointer lexeme_Begin marks the beginning of the current lexeme. –Pointer forward scans ahead until a pattern match is found. 3.2.1 Buffer Pairs
11
If forward at end of first half then begin reload second half; forward:=forward + 1; End Else if forward at end of second half then begin reload first half; move forward to beginning of first half End Else forward:=forward + 1;
12
3.2.2 Sentinels E = M * eofC * * 2 eof eof
13
forward:=forward+1; If forward ^ = EOF then begin If forward at end of first half then begin reload second half; forward:=forward + 1; End Else if forward at end of second half then begin reload first half; move forward to beginning of first half End Else terminate lexical analysis;
14
14 Specification of Patterns for Tokens: Definitions An alphabet is a finite set of symbols (characters) A string s is a finite sequence of symbols from – s denotes the length of string s – denotes the empty string, thus = 0 A language is a specific set of strings over some fixed alphabet
15
15 Specification of Patterns for Tokens: String Operations The concatenation of two strings x and y is denoted by xy The exponentation of a string s is defined by s 0 = ( Empty string : a string of length zero) s i = s i-1 s for i > 0 note that s = s = s
16
16 Specification of Patterns for Tokens: Language Operations Union L M = {s s L or s M} Concatenation LM = {xy x L and y M} Exponentiation L 0 = { }; L i = L i-1 L Kleene closure L * = i=0,…, L i Positive closure L + = i=1,…, L i
17
Language Operations Examples L = {A, B, C, D } D = {1, 2, 3} L D = {A, B, C, D, 1, 2, 3 } LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 } L 2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, … DD} L 4 = L 2 L 2 = ?? L* = { All possible strings of L plus } L + = L* - L (L D ) = ?? L (L D )* = ??
18
18 Specification of Patterns for Tokens: Regular Expressions Basis symbols: – is a regular expression denoting language { } –a is a regular expression denoting {a} If r and s are regular expressions denoting languages L(r) and M(s) respectively, then –r s is a regular expression denoting L(r) M(s) –rs is a regular expression denoting L(r)M(s) –r * is a regular expression denoting L(r) * –(r) is a regular expression denoting L(r) A language defined by a regular expression is called a regular set
19
Examples: –let a | b (a | b) a * (a | b)* a | a*b –We assume that ‘*’ has the highest precedence and is left associative. Concatenation has second highest precedence and is left associative and ‘|’ has the lowest precedence and is left associative (a) | ((b)*(c ) ) = a | b*c
20
Algebraic Properties of Regular Expressions AXIOMDESCRIPTION r | s = s | r r | (s | t) = (r | s) | t (r s) t = r (s t) r = r r = r r* = ( r | )* r ( s | t ) = r s | r t ( s | t ) r = s r | t r r** = r* | is commutative | is associative concatenation is associative concatenation distributes over | relation between * and Is the identity element for concatenation * is idempotent
21
Finite Automaton Given an input string, we need a “machine” that has a regular expression hard-coded in it and can tell whether the input string matches the pattern described by the regular expression or not. A machine that determines whether a given string belongs to a language is called a finite automaton.
22
Deterministic Finite Automaton Definition: Deterministic Finite Automaton –a five-tuple ( , S, , s 0, F) where is the alphabet S is the set of states is the transition function (S S) s 0 is the starting state F is the set of final states (F S) Notation: –Use a transition diagram to describe a DFA states are nodes, transitions are directed, labeled edges, some states are marked as final, one state is marked as starting If the automaton stops at a final state on end of input, then the input string belongs to the language.
23
① a ={a} L= {a} S = {1,2} (1,a)=2 S 0 = 1 F = {2}
24
② a|b ={a,b} L = {a,b} S = {1,2} (1,a)=2, (1,b)=2 S 0 = 1 F = {2}
25
③ a(a|b) ={a,b} L = {aa,ab} S = {1,2,3} (1,a)=2, (2,a)=3, (2,b)=3 S 0 = 1 F = {3}
26
④ a* = {a} L = { ,a,aa,aaa,aaaa,…} S = {1} (1, )=1, (1,a)=1 S 0 = 1 F = {1}
27
⑤a⁺⑤a⁺ ={a} L = {a,aa,aaa,aaaa,…} S = {1,2} (1,a)=2, (2,a)=2 S 0 = 1 F = {2} Note: a ⁺ =aa*
28
⑥ (a|b)(a|b)b = {a,b} L = {aab,abb,bab,bbb} S = {1,2,3,4} (1,a)=2, (1,b)=2, (2,a)=3, (2,b)=3, (3,b)=4 S 0 = 1 F = {4}
29
⑦ (a|b)* ={a,b} L={ ,a,b,aa,bb,ba,ab,aaa,…,bbb,…,abab,…,b aba,bbba,…,…} S = {1} (1,a)=1, (1,b)=1 S 0 = 1 F = {1}
30
⑧ (a|b) ⁺ ={a,b} L = {a,aa,aaa,…,b,bb,bbb,…} S = {1,2} (1,a)=2, (1,b)=2, (2,a)=2, (2,b)=2 S 0 = 1 F = {2} Note: (a|b) ⁺ =(a|b)(a|b)*
31
⑨ a ⁺ |b ⁺ ={a,b} L = {a,aa,aaa,…,b,bb,bbb,…} S = {1,2,3} (1,a)=2, (2,a)=2, (1,b)=3, (3,b)=3 S 0 = 1 F = {2,3}
32
⑩ a(a|b)* ={a,b} L={a,aa,ab,…,aba,…,abb,…,baa,abbb,…,bab aba,…} S = {1,2} (1,a)=2, (2,a)=2, (2,b)=2 S 0 = 1 F = {2}
33
⑪ a(b|a)b ⁺ ={a,b} L = {aab,abb,aabb,…,abbb,abbbb,…} S ={1,2,3,4} (1,a)=2, (2,a)=3, (2,b)=3, (3,b)=4, (4,b)=4 S 0 = 1 F = {4}
34
⑫ ab*a(a ⁺ |b ⁺ ) ={a,b} L = {aaa,aab,abaa,abbaa,…,abbab,abbabbb,…} S = {1,2,3,4,5} (1,a)=2, (2,b)=2, (2,a)=3, (3,a)=4, (4,a)=4, (3,b)=5, (5,b)=5 S 0 = 1 F = {4,5}
35
35 Specification of Patterns for Tokens: Regular Definitions Regular definitions introduce a naming convention: d 1 r 1 d 2 r 2 … d n r n where each r i is a regular expression over {d 1, d 2, …, d i-1 } Any d j in r i can be textually substituted in r i to obtain an equivalent set of definitions
36
36 Specification of Patterns for Tokens: Regular Definitions Example: letter A B … Z a b … z digit 0 1 … 9 id letter ( letter digit ) * Regular definitions are not recursive: digits digit digits digitwrong!
37
37 Specification of Patterns for Tokens: Notational Shorthand The following shorthands are often used: r + = rr * r? = r [ a - z ] = a b c … z Examples: digit [ 0 - 9 ] num digit + (. digit + )? ( E (+ -)? digit + )?
38
38 Regular Definitions and Grammars stmt if expr then stmt if expr then stmt else stmt expr term relop term term term id num if if then then else else relop > >= = id letter ( letter | digit ) * num digit + (. digit + )? ( E (+ -)? digit + )? Grammar Regular definitions
39
Constructing Transition Diagrams for Tokens Transition Diagrams (TD) are used to represent the tokens – these are automatons! As characters are read, the relevant TDs are used to attempt to match lexeme to a pattern Each TD has: States : Represented by Circles Actions : Represented by Arrows between states Start State : Beginning of a pattern (Arrowhead) Final State(s) : End of pattern (Concentric Circles) Each TD is Deterministic - No need to choose between 2 different actions !
40
Example : All RELOPs start< 0 other = 67 8 return(relop, LE) 5 4 > = 12 3 other > = * * return(relop, NE) return(relop, LT) return(relop, EQ) return(relop, GE) return(relop, GT)
41
Example TDs : id and delim Keyword or id : delim : startdelim 28 other 3029 delim * return(install_id(), gettoken()) startletter 9 other 1110 letter or digit *
42
Combine TD for KW and IDs Install_id(): decides for the attribute –It will check the accepted lexeme in the list of keywords; if it is matched, zero is returned. –Otherwise checks the lexeme in symbol table, if it is found, the address is returned. –If the lexeme not found in symbol table, install_id() first installs the ID in the symbol table and return the address of the newly created entry. Gettoken(): decides for the token –If zero returned by install_id(), the same word(or its numeric form) is returned as token –Otherwise token “ID” is returned. 42
43
Example TDs : Unsigned #s 1912141316151817 startotherdigit. E+ | -digit E * startdigit 25 other 2726 digit *startdigit 20 *. 21 digit 24 other 23 digit 22 * Questions: Is ordering important for unsigned #s ? Why are there no TDs for then, else, if ?
44
Keywords Recognition All Keywords / Reserved words are matched as ids After the match, the symbol table or a special keyword table is consulted Keyword table contains string versions of all keywords and associated token values if begin then 0 0 0... If a match is not found, then it is assumed that an id has been discovered
45
Transition Diagrams & Lexical Analyzers state = 0; token nexttoken() { while(1) { switch (state) { case 0: c = nextchar(); /* c is lookahead character */ if (c== blank || c==tab || c== newline) { state = 0; lexeme_beginning++; /* advance beginning of lexeme */ } else if (c == ‘<‘) state = 1; else if (c == ‘=‘) state = 5; else if (c == ‘>’) state = 6; else state = fail(); break; … /* cases 1-8 here */
46
case 9: c = nextchar(); if (isletter(c)) state = 10; else state = fail(); break; case 10; c = nextchar(); if (isletter(c)) state = 10; else if (isdigit(c)) state = 10; else state = 11; break; case 11; retract(1); install_id(); return ( gettoken() ); … /* cases 12-24 here */ case 25; c = nextchar(); if (isdigit(c)) state = 26; else state = fail(); break; case 26; c = nextchar(); if (isdigit(c)) state = 26; else state = 27; break; case 27; retract(1); install_num(); return ( NUM ); } } } Case numbers correspond to transition diagram states !
47
When Failures Occur: int state = 0, start = 0; Int lexical_value; /* to “return” second component of token */ Init fail() { forward = token_beginning; switch (start) { case 0: start = 9; break; case 9: start = 12; break; case 12: start = 20; break; case 20: start = 25; break; case 25: recover(); break; default: /* compiler error */ } return start; }
48
Using a Lex Generator Lex source prog lex.yy.c lex.l lex.yy.c a.out Input stream sequence of input.c tokens Lex Compiler C compiler a.out
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.