Download presentation
1
Chapter2 : Lexical Analysis
2
Intermediate Code Generator
Source Program Target Program Semantic Analyser Intermediate Code Generator Code Optimizer Code Generator Syntax Analyser Lexical Analyser Symbol Table Manager Error Handler Lexical Analysis
3
Languages An alphabet (Σ) is a finite set of symbols . {a, b, c}
A symbol is an element of an alphabet. a A word is a finite sequence of symbols drawn from the alphabet Σ. abcaa A language (over alphabet Σ) is a set of words. {abcaa, abc, b, caa}
4
bmz is a string of length 3
Languages Σ* denotes the set of all words over the alphabet Σ. | s | denotes the length of string ε denotes the word of length 0, the empty word. denotes the empty set, or {ε} Note1: In language theory the terms sentence and word are often used as synonyms for the term string Note2: A language (over alphabet Σ) is a set of string (over alphabet Σ). For example: Σ = {a}; one possible language is L = { ε, a; aa; aaa}. bmz is a string of length 3
5
Terms for parts of a string
DEFINITION prefix of s A string obtained by removing zero or more trailing symbols of string s; ban is a prefix of banana. suffix of s A string formed by deleting zero or more of the leading symbols of s; nana is a suffix of banana. substring of s A string obtained by deleting a prefix and a suffix from s; nan is a substring of banana. Every prefix and every suffix of s is a substring of s, but not every substring of s is a prefix or a suffix of s. For every string s, both s and e are prefixes, suffixes, and substrings of s. proper prefix, suffix, or substring of s Any nonempty string x that is, respectively, a prefix, suffix, or substring of s such that s x. subsequence of s Any string formed by deleting zero or more not necessarily contiguous symbols from s; baaa is a subsequence of banana.
6
Terms for parts of a string (examples)
Let us take this string: banana prefix: ε, b, ba, ban, ..., banana suffix: ε, a, na, ana, ..., banana substring: ε, b, a, n, ba, an, na, ..., banana subsequence: ε, b, a, n, ba, bn, an, aa, na, nn, ..., banana
7
Operations on Strings Concatenation: Concatenation of words is denoted by juxtaposition. If x and y are strings, then the concatenation of x and y is xy If x=dog and y= house, then xy=doghouse x(yz) = (xy)z x ε = ε x = x Concatenation is not symmetric Exponentiation s0 = ε s1 = s s2 = ss
8
Operations on Languages
Union of L and M, L M L M = { s | s L or s M} Concatenation of L and M, LM LM = {st | s L and t M} Kleene closure of L, L* L* = Positive closure of L, L+ L+ =
9
Example L is the set {A, B, . . ., Z, a, b, , z} and D the set {0, 1, , 9}. Since a symbol can be regarded as a string of length one, the sets L and D are each finite languages. The following are some examples of new languages created from L and D 1. L U D is the set of letters and digits. 2. LD is the set of strings consisting of a letter followed by a digit. 3. L4 is the set of all four-letter strings. 4. L* is the set of all strings of letters, including ε, the empty string. 5. L(L U D)* is the set of all strings of letters and digits beginning with a letter. 6. D+ is the set of all strings of one or more digits.
10
Operator Associativity
Grammar rules may influence operator Associativity How to specify operator Associativity for: 1. Multiplication operator (left associative) in FORTRAN: 2. Exponentiation (right associative) in FORTRAN: 1 * 2 * 3 (1 * 2) * 3 3 * 1 * 5 (3 * 1) * 5 X ** Y ** Z X ** (Y ** Z)
11
(We shall use this assumption in this course)
Example1 9 – 5 + 2 9 – (5 + 2) (9 – 5) + 2 right-associativity left-associativity The choice relies with the language designer, who must take into account intuitions and convenience. By convention, most arithmetic operations use left-associativity. (We shall use this assumption in this course)
12
Example2 right letter = right | letter letter a | b | c | … | z
list list + digit | list – digit | digit digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 left-associativity – 5 – 2 right-associativity a = b = c
13
Specifying Operator Associativity
* For left associative, rewrite grammar rule: LHS appears at the beginning of its RHS - this rule is also known as (aka) left recursive * For right associative, rewrite grammar rule: LHS appears at the end of its RHS - this rule is aka right recursive
14
Draw the parse tree for A = B * A + C
Operator Precedence Operator precedence defines the order in which an expression evaluates when several different operators are present Grammar rules may influence operator precedence assign ident = expr ident A | B | C expr ident + expr | ident * expr | ( expr ) | ident Draw the parse tree for A = B * A + C Operators generated lower in the parse tree is evaluated first, therefore, higher precedence than operators generated higher up in the parse tree
15
Precedence Levels ( ) higher ^ * / + - lower exp const 5 2 9 + *
<exp> ::= <exp> + <exp> | <exp> * <exp> | <const> <const> ::= 0..9 9+5*2 exp const 5 2 9 + *
16
Specifying Operator Precedence
Grammar rules can be made to exhibit operator precedence by introducing additional nonterminals and new rules. assign ident = expr ident A | B | C expr expr + term | term term term * factor | factor factor ( expr ) | ident Draw the parse tree for A = B * A + C
17
Postfix Notation The posy notation for an expression E can be defined inductively as follows: 1. If E is a variable or constant, then the postfix notation for E is E itself. 2. If E is an expression of the form E1 op E2, where op is any binary operator, then the postfix notation for E is El' E2' op, where El' and E2' are the postfix notations for El and E2, respectively. 3. If E is an expression of the form ( E1 ), then the postfix notation for E1, is also the postfix notation for E. the postfix notation for (9-5) +2 is 95-2+ the postfix notation for 9- ( 5+2 ) is 952+-
18
The Role of the Lexical Analyzer
The lexical Analyzer is the first phase of a compiler The Main Task: is to read the input characters and produce as output a sequence of tokens that the parser uses for syntax analysis source program lexical analyzer parser symbol table token get next token
19
The Role of the Lexical Analyzer (continued)
The lexical analyzer is the part of the compiler that reads the source text. The Secondary Tasks: 1. Eliminating the following from the source program: a. comments // global variables b. whitespace a= ; 1. tab write ( a); 2. newline characters write (a, a*2);
20
The Role of the Lexical Analyzer (continued)
2. Correlating error messages from the compiler with the source program. It may keep track of the number of newline characters seen, so that a line number can be associated with an error message. 3. Making a copy of source program with errors marked (in some compilers)
21
The Role of the Lexical Analyzer (continued)
Note: lexical analyzer is divided into a cascade of two phases (in some compilers): Scanning: The scanner is responsible for doing simple tasks Lexical Analysis: lexical analyzer is responsible for doing more complex operations FORTRAN Compiler, uses a scanner to eliminate blanks from the input. R.W. Do num 5 id I Do 5 I = 1,25 Enter a Number ==> 13 2 The number is 132 Do 5 I = 1.25 id Do5I
22
Advantages for Separating the Analysis Phase
The advantages for separating the analysis phase of compiling into lexical analysis and parsing: 1. Simpler Design: Separate lexical analysis from syntax analysis simplifies one or the other of these phases. (comments and white space) 2. Improved Efficiency: Large amount of time in a compiler is spent reading source and partitioning into tokens. Specialized buffering techniques for reading input characters and processing tokens can significantly speed up the performance of a compiler. 3. Enhanced Portability: Input alphabet peculiarities and other device specific anomalies can be restricted to the lexical analyzer. Representation of non-standard symbols can be isolated in the lexical analyzer
23
Symbol Table * It is a Data Structure used to store information about various source language constructs. - During lexical analysis, the character string or lexeme forming an identifier is saved in a symbol table entry. * Later phases of the compiler might add to this entry information such as the type of the identifier, its usage (variable or label) and its position in storage (address).
24
Tokens, Patterns, and Lexemes
Lexeme: a string matched by the pattern of a token Token: a set of strings Pattern: is a rule associated with token that describes the set of strings
25
Attributes of Tokens Attributes are used to distinguish different lexemes in a token E = M * C ** 2 <id, pointer to symbol-table entry for E> <assign_op, > <id, pointer to symbol-table entry for M > <mult_op, > <id, pointer to symbol-table entry for C> <exp-op, > <num, integer value 2> Tokens affect syntax analysis & Attributes affect semantic analysis
26
Describing Tokens * We use regular expressions to describe programming language tokens. * A regular expression (RE) is defined inductively a ordinary character stands for itself ε empty string R|S either R or S (alteration), where R,S = RE RS R followed by S (concatenation) R* concatenation of R 0 or more times
27
Language A regular expression R describes a set of strings of characters denoted L(R) L(R) = the language defined by R L(abc) = { abc } L(hello|goodbye) = { hello, goodbye } L(1(0|1)*) = all binary numbers that start with a 1 Each token can be defined using a regular expression
28
Lexical Errors Few errors are detectable at the lexical level, because the lexical analyzer has a very localized view of a source program fi(a==x) … Error-Recovery Actions: 1. Panic Mode Recovery: we delete successive characters from the remaining input until the lexical analyzer can find a well-formed token. 2. Deleting an extraneous character 3. Inserting a missing character 4. Replacing an incorrect character by a correct character 5. Transposing two adjacent characters (o0O)
29
Input Buffering There are three general approaches to implement lexical analyzer: 1. Use a lexical-analyzer generator, such as the Lex compiler to produce the lexical analyzer from a regular-expression-based specification. In this case, the generator provides routines for reading and buffering the input. 2. Write the lexical analyzer in a conventional systems-programming language, using the I/O facilities of that language to read the input. 3. Write the lexical analyzer in assembly language and explicitly manage the reading of input.
30
End
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.