Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lexical Analysis Natawut Nupairoj, Ph.D.

Similar presentations


Presentation on theme: "Lexical Analysis Natawut Nupairoj, Ph.D."— Presentation transcript:

1 Lexical Analysis Natawut Nupairoj, Ph.D.
Department of Computer Engineering Chulalongkorn University

2 Outline Overview. Token, Lexeme, and Pattern.
Lexical Analysis Specification. Lexical Analysis Engine.

3 Front-End Components Front-End Scanner Source program (text stream) m
Group token. Scanner Source program (text stream) identifier main symbol ( m a i n ( ) { token next-token Construct parse tree. Symbol Table Parser parse-tree Check semantic/contextual. Semantic Analyzer Intermediate Representation (file or in memory)

4 Tasks for Scanner Read input and group tokens for Parser.
Strip comments and white spaces. Count line numbers. Create an entry in the symbol table. Preprocessing functions

5 Benefits Simpler design More efficient scanner Portability
parser doesn’t worry about comments and white spaces. More efficient scanner optimize the scanning process only. use specialize buffering techniques. Portability handle standard symbols on different platforms.

6 Basic Terminology Token Lexeme a set of strings Ex: token = identifier
a sequence of characters in the source program matched by the pattern for a token. Ex: lexeme = counter

7 Basic Terminology Pattern
a description of strings that can belong to a particular token set. Ex: pattern = letter followed by letters or digit {A,…,Z,a,…,z}{A,…,Z,a,…,z,0,…,9}*

8 Token Lexeme Pattern const if relation id num literal const if
<, <=, …, >= counter, x, y 12.53, 1.42E-10 “Hello World” Pattern const if comparison symbols letter (letter | digit)* any numeric constant characters between “

9 Language and Lexical Analysis
Fixed-format input i.e. FORTRAN must consider the alignment of a lexeme. difficult to scan. No reserved words i.e. PL/I keywords vs. id ? -- complex rules. if if = then then then := else; else else := then;

10 Regular Expression Revisited
e is a regular expression that denotes {e}. If a is an alphabet, a is a regular expression that denotes {a}. Suppose r and s are regular expressions: (r)|(s) denoting L(r) U L(s). (r)(s) denoting L(r)L(s). (r)* denoting (L(r))*

11 Precedence of Operator
Level of precedence Kleene clusure (*) concatenation union (|) All operators are left associative. Ex: a*b | cd* = ((a*)b) | (c(d*))

12 Regular Definition A sequence of definitions: d1 ฎ r1 d2 ฎ r2 ...
dn ฎ rn di is a distinct name ri is a regular expression over: ๅ U {d1, …, di-1}

13 Examples letter ฎ A | B | … | Z | a | b | … | z digit ฎ 0 | 1 | … | 9
id ฎ letter ( letter | digit )* digits ฎ digit digit* opt_fraction ฎ . digits | e opt_exponent ฎ ( E ( + | - | e ) digits ) | e num ฎ digits opt_fraction opt_exponent

14 Notational Shorthands
One or more instances r+ = rr* Zero or one instance r? = r | e (rs)? = rs | e Character Class [A-Za-z] = A | B | … | Z | a | b | … | z

15 Examples digit ฎ [0-9] digits ฎ digit+ opt_fraction ฎ ( . digits )?
opt_exponent ฎ ( E ( + | - )? digits )? num ฎ digits opt_fraction opt_exponent id ฎ [A-Za-z][A-Za-z0-9]*

16 Recognition of Tokens Consider tokens from the grammar.
pattern attribute Draw NFAs with retracting options.

17 Example : Grammar stmt ::= if expr then stmt
| if expr then stmt else stmt | expr expr ::= term relop term | term term ::= id | num

18 Example : Regular Definition
if ฎ if then ฎ then else ฎ else relop ฎ < | <= | = | <> | > | >= id ฎ letter (letter | digit)* num ฎ digit+ ( . digit+ )? ( E (+ | -)? digit+ ) ? delimฎ blank | tab | newline ws ฎ delim+

19 Example: Pattern-Token-Attribute
Regular Expression ws if then else id num < <= = <> ... Token - if then else id num relop ... Attribute-Value - Index in table LT LE EQ NE ..

20 Attributes for Tokens if count >= 0 then ... <if, >
<id, index for count in symbol table> <relop, GE> <num, integer value 0> <then, >

21 NFA – Lexical Analysis Engine
< = 1 2 return(relop, LE) > 3 = return(relop, NE) other * 4 return(relop, LT) > 5 return(relop, EQ) = 6 7 return(relop, GE) * 8 other return(relop, GT)

22 num ฎ digit+ ( . digit+ )? ( E (+ | -)? digit+ ) ?
Handle Numbers Pattern for number contains options. num ฎ digit+ ( . digit+ )? ( E (+ | -)? digit+ ) ? 31, 31.02, 31.02E-15 Always get the longest possible match. match the longest first if not match, try the next possible pattern.

23 Handle Numbers digit digit digit + or - . digit digit digit E 12 13 14 15 16 17 18 other digit E digit digit 19 * . other * digit digit 20 21 22 23 24 digit * other digit return(num, getnum()) 25 26 27

24 Handle Keywords Two approaches:
encode keywords into an NFA (if, then, etc.) complex NFA (too many states). use symbol table simple. require some tricks. * letter other 9 10 11 return(gettoken(), install_id()) letter or digit

25 Handle Keywords Symbol table contains both lexeme and token type.
Initialize symbol table with all keywords and corresponding token types. lexeme: if token type: if lexeme: then token type: then lexeme: else token type: else

26 Handle Keywords … … … … Scanner initial Symbol Table Lexeme Token Type
Parser 1 if 2 then 3 else 4 5

27 Handle Keywords gettoken():
If id is not found in the table, return token type ID. Otherwise, return token type from the table.

28 Handle Keywords … … … … i i f f c o u n t < = Source program
(text stream) Scanner gettoken Symbol Table if i f next-token Lexeme Token Type Parser 1 if 2 then 3 else 4 5

29 Handle Keywords install_id():
If id is not found in the table, it’s a new id. INSERT NEW ID INTO TABLE and return pointer to the new entry. If id is found and its type is ID, return pointer to that entry. Otherwise, it’s a keyword. Return 0.

30 Handle Keywords if … … … … i i f f c o u n t < = Source program
(text stream) Scanner install_id Symbol Table if token i f next-token Lexeme Token Type Parser 1 if 2 then 3 else 4 5

31 Handle Keywords … … … … i i f f c c o o u u n n t t < =
Source program (text stream) Scanner gettoken Symbol Table id c o u n t next-token Lexeme Token Type Parser 1 if 2 then 3 else 4 5 Not found!

32 Handle Keywords id 4 … … … … … i f c o u n t < = Source program
(text stream) Scanner install_id Symbol Table 4 id 4 token c o u n t next-token Lexeme Token Type Parser 1 if 2 then 3 else 4 count id 5


Download ppt "Lexical Analysis Natawut Nupairoj, Ph.D."

Similar presentations


Ads by Google