Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lexical Analysis.

Similar presentations


Presentation on theme: "Lexical Analysis."— Presentation transcript:

1 Lexical Analysis

2 Organization What is Lexical Analysis ?
How do you Build a Lexical Analyzer ? Lexical Specifications How do Lexical Analysis Generators Work ?

3 What is Lexical Analysis

4 Token, lexeme and pattern
A token is a pair a token name and an optional token value A pattern is a description of the form that the lexemes of a token may take A lexeme is a sequence of characters in the source program that matches the pattern for a token

5 Lexical Analysis Reading the input source program and giving out a sequence of tokens

6 Lexical Analysis Removes comments, white spaces, tab and new line in the source program In conjunction with the parser creates Symbol Table used in various stages of the compiler (1) Even if preprocessing retains the comments, the lexical analyzer would strip the comments. This can be proved by running the cc1 command directly on the input source that does not have a #include statement. (a) Take the test2.c and run the command gcc -### test2.c –o test2 This should give you the list of commands that need to be executed (b) Execute the cc1 command (c) Show the test2.s

7 Lexical Analysis It reads character streams from the source code, checks for legal tokens, and passes the data to the syntax analyzer when it demands.

8 How do you Build a Lexical Analyzer ?

9 An Approach Taking the input character-by-character and checking for various constructs of that Language. Rules defining how to identify Keyword, Operator, Identifier or a String Literal will vary A lot of design/coding effort goes in parsing of the input that could be common to lexical analyzers of any programming language. The complexity of the Lexical Analyzer would be very high and adding a new construct to an existing language could become difficult. Developing a Lexical analyzer for a new language would be cumbersome and involve almost the same effort as any of the ones previously developed

10 Lexical Analyzer Generators
A Lexical Analyzer Generator is a tool that can generate code to perform Lexical Analysis of the input, given the rules for the basic building blocks of the Language. Rules for the basic building blocks of a Language are called its Lexical Specifications.

11 A Lexical Analyzer

12 Number of tokens?? int main(x,y) int x,y; /* find max of x and y */ { return (x>y?x:y); }

13 lex Illustration of generation of a Lexical Analyzer using ‘lex’ – A Lexical Analyzer Generator

14 Lexical Specifications

15 Introductory Concepts
Regular Expressions Pattern that describes a set of strings Regular expressions have the capability to express finite languages by defining a pattern for finite strings of symbols.  The grammar defined by regular expressions is known as regular grammar. The language defined by regular grammar is known as regular language.

16 A Regular Expression can be recursively defined as follows −
ε is a Regular Expression indicates the language containing an empty string.  (L (ε) = {ε}) If ‘a’ is symbol in Σ, then a is regular expression and L(a)={a}, that is a language with one string, of length one.

17 Operations The various operations on languages are:
Union of two languages L and M is written as L U M = {s | s is in L or s is in M} Concatenation of two languages L and M is written as LM = {st | s is in L and t is in M} The Kleene Closure of a language L is written as L* = Zero or more occurrence of language L.

18 If x is a regular expression, then:
x* -> zero or more occurrence of x. i.e., it can generate { e, x, xx, xxx, xxxx, … } x+ -> means one or more occurrence of x. i.e., it can generate { x, xx, xxx, xxxx … } or x.x* x? -> at most one occurrence of x i.e., it can generate either {x} or {e}. [ ] -> OR bracket [a|b] -> {a,b} ( ) -> AND bracket (abc) -> {abc} [a-z] is all lower-case alphabets of English language. [A-Z] is all upper-case alphabets of English language. [0-9] is all natural digits used in mathematics.

19 [-az] –> {-,a,z} [a-z] -> {a,b,c,d,……z} [a \-z] –> {a,-,z} [^a-d]-> {e,f,g,h,I,j,k,l,m,…..z} ^[a-z] -> Beginning should be from a to z a$ -> should end with a . -> anything except new line

20 Hello -> {hello} gray| grey -> {gray,grey} gr(a|e)y -> {gray,grey} gr[ae]y -> {gray,grey} B[aeiou]bble -> {babble, bebble, bibble, bobble, bubble} [b-chm-pP]at|ot -> {bat, cat, hat, mat, nat, oat, pat, Pat, ot}

21 colou?r -> {color, colour}
go*gle -> {ggle,gogle, google, gooogle,….} go+gle -> {gogle,google, gooogle,….} Z{3} -> {zzz} Z{3,6} -> {zzz,zzzz,zzzzz,zzzzzz} ^dog -> begins with dog dog$ -> ends with dog

22 Representing occurrence of symbols using regular expressions
letter = [a – z] or [A – Z] digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9] sign = [ + | - ] Representing language tokens using regular expressions: Decimal = (sign)?(digit)+ Identifier = letter(letter | digit)*

23 Lexical Specification File
Declarations Translation rules Auxiliary functions

24 Declarations Regular definitions that can be used in Translation Rules
Enclosed within %{and %} #defines, C prototype declarations of the functions used in Translation Rules #include statements for the library functions used in Translation rules

25 Translation Rules Pattern-Action Pairs
Where Pattern is a regular Expression and the Action is a C language Program Segment The action is typically a return Statement indicating the type of token that has been matched

26 Translation Rules Generated Global Variables that can be used in the Action Statements. yytext contains the Lexeme, yyleng gives the Length of the Lexeme. Tokens that do not have any significance for the Parser (like White Space, New Line etc) the action statement would not have a return Statement

27 Auxiliary Functions Definition of the C Functions used in the Action Statements. The whole section is copied “as is” into lex.yy.c. yylex routine is called repeatedly to continue getting the next token until the End of the Input


Download ppt "Lexical Analysis."

Similar presentations


Ads by Google