Lexical Analysis Consider the program: #include main() { double value = 0.95; printf("value = %f\n", value); } How is this translated into meaningful machine instructions? First, each separate entity must be recognised: e.g. the 5th line is processed as This process is known as lexical analysis
Application: Lex A program generator Series of regular expressions lex A lexical analyser Lex input file:... definitions... %... regular expression/action pairs... %... user-defined functions...
Lex Regular Expressions meta-characters (do not match themselves): ( ) [ ] { } + /, ^ * |. \ " $ ? - % Let c be a character, x,y, regular expressions, s a string, m,n integers and i an identifier. regular expressions: cany character except meta-characters [...]the list of chars enclosed (may be range) [ ...]the list of chars not enclosed.any ASCII char except newline xyconcatenation of x and y x/yx, only if followed by y (y not read) x{m,n}m to n occurrences of x xx, only at beginning of line x$x, only at end of line "s"exactly what is in the quotes (except for "\" and following character) x*same as x * x+same as x + x?an optional x (same as x+ ) x|yx or y {i}definition of i
Lex Regular Expressions (cont.) meta characters are obtained by preceding with "\". regular expresions are terminated by space or tab backslash, tab and newline represented by \\, \t, \n
Definitions if identifier string appears in the definition section, string replaces identifier in {identifier}. L [a-zA-Z] % {L}+; is same as: % [a-zA-Z]+; Anything enclosed between %{... %} in this section will be copied straight into lex.yy.c include and define statements, all variables, all function definitions, and any comments should be placed here. E.g. %{ #include /* an example program */ %}
Actions A C-language statement followed by ; Example: [0-9]+printf("Integer\n"); [a-zA-Z]+printf("String\n"); will output "Integer" after receiving a digit string, and "String" after receiving a character string. Input: 12+19=sum; will be result in: Integer +Integer =String ; Note: a recognised regular expression is held in the string yytext. Its length is held in the integer yylen.
Running Lex To run a lex program "example.l", type lex example.l cc lex.yy.c -ll a.out "-ll" calls the lex library. This library contains a "main" program, which calls yylex(). You can override this by defining your own "main".
Example Lex Program %{ /* simple word recognition */ %} L[a-zA-Z] % [ \t]+;/* ignore whitespace */ is|areprintf("verb: %s; ", yytext); a|theprintf("determiner: %s; ", yytext); dog | cat | male | femaleprintf("noun: %s; ", yytext); {L}+printf("unknown: %s; ", yytext);.|\nECHO; % main() { yylex(); }
Example Session % word the dog is a male determiner: the;noun: dog; verb: is; determiner: a; noun: male; female cat dog is noun: female; noun: cat; noun: dog; verb: is; catdog is male unknown: catdog; verb: is; noun: male; -d %
Practical 1: Lexical Analysis Aim: To write a lexical analyser in C using Lex, for the language L, defined below. identifiers: sequence of one or more letters, must be declared before use, int or real. integers: optional sign, one or more digits reals: optional sign, one or more digits, decimal point, one or more digits expressions: bracketed expressions using +, -, *, / and :=. comments: start with !, to end of line print statements: either printi or printr, for printing integers and reals, one argument.
Example L Program ! example L program real a; real baboon; int x y; ! end of declarations x := 300; printi(x); y := 7 - x; a := / 3 * 5 - 5; baboon := a * y; printi(5); printr(baboon);
Required Structure Output should be in the form of pairs. Every element of the program should be classified. Thus, output for the 9th line should be:, Numbers should be converted from strings to the appropriate form. The input must be described by regular expressions. You must use Lex. A "tokens.h" file will be supplied, defining all the different tokens to be used. You should output the token names and not the associated numbers.