Download presentation
Presentation is loading. Please wait.
1
Lexical Analysis
2
Organization What is Lexical Analysis ?
How do you Build a Lexical Analyzer ? Lexical Specifications How do Lexical Analysis Generators Work ?
3
What is Lexical Analysis
4
Token, lexeme and pattern
A token is a pair a token name and an optional token value A pattern is a description of the form that the lexemes of a token may take A lexeme is a sequence of characters in the source program that matches the pattern for a token
5
Lexical Analysis Reading the input source program and giving out a sequence of tokens
6
Lexical Analysis Removes comments, white spaces, tab and new line in the source program In conjunction with the parser creates Symbol Table used in various stages of the compiler (1) Even if preprocessing retains the comments, the lexical analyzer would strip the comments. This can be proved by running the cc1 command directly on the input source that does not have a #include statement. (a) Take the test2.c and run the command gcc -### test2.c –o test2 This should give you the list of commands that need to be executed (b) Execute the cc1 command (c) Show the test2.s
7
Lexical Analysis It reads character streams from the source code, checks for legal tokens, and passes the data to the syntax analyzer when it demands.
8
How do you Build a Lexical Analyzer ?
9
An Approach Taking the input character-by-character and checking for various constructs of that Language. Rules defining how to identify Keyword, Operator, Identifier or a String Literal will vary A lot of design/coding effort goes in parsing of the input that could be common to lexical analyzers of any programming language. The complexity of the Lexical Analyzer would be very high and adding a new construct to an existing language could become difficult. Developing a Lexical analyzer for a new language would be cumbersome and involve almost the same effort as any of the ones previously developed
10
Lexical Analyzer Generators
A Lexical Analyzer Generator is a tool that can generate code to perform Lexical Analysis of the input, given the rules for the basic building blocks of the Language. Rules for the basic building blocks of a Language are called its Lexical Specifications.
11
A Lexical Analyzer
12
Number of tokens?? int main(x,y) int x,y; /* find max of x and y */ { return (x>y?x:y); }
13
lex Illustration of generation of a Lexical Analyzer using ‘lex’ – A Lexical Analyzer Generator
14
Lexical Specifications
15
Introductory Concepts
Regular Expressions Pattern that describes a set of strings Regular expressions have the capability to express finite languages by defining a pattern for finite strings of symbols. The grammar defined by regular expressions is known as regular grammar. The language defined by regular grammar is known as regular language.
16
A Regular Expression can be recursively defined as follows −
ε is a Regular Expression indicates the language containing an empty string. (L (ε) = {ε}) If ‘a’ is symbol in Σ, then a is regular expression and L(a)={a}, that is a language with one string, of length one.
17
Operations The various operations on languages are:
Union of two languages L and M is written as L U M = {s | s is in L or s is in M} Concatenation of two languages L and M is written as LM = {st | s is in L and t is in M} The Kleene Closure of a language L is written as L* = Zero or more occurrence of language L.
18
If x is a regular expression, then:
x* -> zero or more occurrence of x. i.e., it can generate { e, x, xx, xxx, xxxx, … } x+ -> means one or more occurrence of x. i.e., it can generate { x, xx, xxx, xxxx … } or x.x* x? -> at most one occurrence of x i.e., it can generate either {x} or {e}. [ ] -> OR bracket [a|b] -> {a,b} ( ) -> AND bracket (abc) -> {abc} [a-z] is all lower-case alphabets of English language. [A-Z] is all upper-case alphabets of English language. [0-9] is all natural digits used in mathematics.
19
[-az] –> {-,a,z} [a-z] -> {a,b,c,d,……z} [a \-z] –> {a,-,z} [^a-d]-> {e,f,g,h,I,j,k,l,m,…..z} ^[a-z] -> Beginning should be from a to z a$ -> should end with a . -> anything except new line
20
Hello -> {hello} gray| grey -> {gray,grey} gr(a|e)y -> {gray,grey} gr[ae]y -> {gray,grey} B[aeiou]bble -> {babble, bebble, bibble, bobble, bubble} [b-chm-pP]at|ot -> {bat, cat, hat, mat, nat, oat, pat, Pat, ot}
21
colou?r -> {color, colour}
go*gle -> {ggle,gogle, google, gooogle,….} go+gle -> {gogle,google, gooogle,….} Z{3} -> {zzz} Z{3,6} -> {zzz,zzzz,zzzzz,zzzzzz} ^dog -> begins with dog dog$ -> ends with dog
22
Representing occurrence of symbols using regular expressions
letter = [a – z] or [A – Z] digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9] sign = [ + | - ] Representing language tokens using regular expressions: Decimal = (sign)?(digit)+ Identifier = letter(letter | digit)*
23
Lexical Specification File
Declarations Translation rules Auxiliary functions
24
Declarations Regular definitions that can be used in Translation Rules
Enclosed within %{and %} #defines, C prototype declarations of the functions used in Translation Rules #include statements for the library functions used in Translation rules
25
Translation Rules Pattern-Action Pairs
Where Pattern is a regular Expression and the Action is a C language Program Segment The action is typically a return Statement indicating the type of token that has been matched
26
Translation Rules Generated Global Variables that can be used in the Action Statements. yytext contains the Lexeme, yyleng gives the Length of the Lexeme. Tokens that do not have any significance for the Parser (like White Space, New Line etc) the action statement would not have a return Statement
27
Auxiliary Functions Definition of the C Functions used in the Action Statements. The whole section is copied “as is” into lex.yy.c. yylex routine is called repeatedly to continue getting the next token until the End of the Input
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.