Lexical Analysis Mooly Sagiv html://www.cs.tau.ac.il/~msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Lexical Analysis Mooly Sagiv html://www.cs.tau.ac.il/~msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2

A motivating example Create a program that counts the number of lines in a given input text file

Solution int num_lines = 0; % \n ++num_lines;. ; % main() { yylex(); printf( "# of lines = %d\n", num_lines); }

Solution int num_lines = 0; % \n ++num_lines;. ; % main() { yylex(); printf( "# of lines = %d\n", num_lines); } initial ; newline \n other

Subjects Roles of lexical analysis The straightforward solution a manual scanner for C Regular Expressions Finite automata From regular languages into finite automata Flex

Basic Compiler Phases Source program (string) Fin. Assembly lexical analysis syntax analysis semantic analysis Tokens Abstract syntax tree Front-End Back-End

Example Tokens TypeExamples IDfoo n_14 last NUM73 00 517 082 REAL66.1.5 10. 1e67 5.5e-10 IFif COMMA, NOTEQ!= LPAREN( RPAREN)

Example NonTokens TypeExamples comment/* ignored */ preprocessor directive#include #define NUMS 5, 6 macroNUMS whitespace\t \n \b

Example void match0(char *s) /* find a zero */ { if (!strncmp(s, “0.0”, 3)) return 0. ; } VOID ID(match0) LPAREN CHAR DEREF ID(s) RPAREN LBRACE IF LPAREN NOT ID(strncmp) LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3) RPAREN RPAREN RETURN REAL(0.0) SEMI RBRACE EOF

input –program text (file) output –sequence of tokens Read input file Identify language keywords and standard identifiers Handle include files and macros Count line numbers Remove whitespaces Report illegal symbols Produce symbol table Lexical Analysis (Scanning)

Simplifies the syntax analysis –And language definition Modularity Reusability Efficiency Why Lexical Analysis

Writing Scanners in C…

A simplified scanner for C Token nextToken() { char c ; loop: c = getchar(); switch (c){ case ` `:goto loop ; case `;`: return SemiColumn; case `+`: c = getchar() ; switch (c) { case `+': return PlusPlus ; case '=’ return PlusEqual; default: putchar(c); return Plus; } case `<`: case `w`: }

Automatic Generation of Lexical Analysis Every Token can be defined using a regular expression Examples: –A regular expression for C identifier –A regular expression for strings A minimal finite automaton for Token recognition can be automatically generated

Flex Input – regular expressions and actions (C code) Output – A scanner program that reads the input and applies actions when input regular expression is matched flex regular expressions input program tokens scanner

Regular Expression Notations aAn ordinary character stands for itself M|NM or N MNM followed by N M*Zero or more times of M M+One or more times of M M?Zero or one occurrence of M [a-zA-Z]Character set alternation (single character).Any (single) character but newline “a.+”Quotation \Convert an operator into text

Ambiguity Resolving Find the longest matching token Between two tokens with the same length use the one declared first

A Flex specification of C Scanner Letter [a-zA-Z_] Digit [0-9] % [ \t]{;} [\n]{line_count++;} “;”{ return SemiColumn;} “++”{ return PlusPlus ;} “+=“{ return PlusEqual ;} “+”{ return Plus} “while”{ return While ; } {Letter}({Letter}|{Digit})*{ return Id ;} “<=”{ return LessOrEqual;} “<”{ return LessThen ;}

How to Implement Ambiguity Resolving Between two tokens with the same length use the one declared first Find the longest matching token

Running Example if{ return IF; } [a-z][a-z0-9]*{ return ID; } [0-9]+ { return NUM; } [0-9]”.”[0-9]*|[0-9]*”.”[0-9]+{ return REAL; } (\-\-[a-z]*\n)|(“ “|\n|\t){ ; }.{ error(); }

int edges[][256] ={ /* …, 0, 1, 2, 3,..., -, e, f, g, h, i, j,... */ /* state 0 */ {0,..., 0, 0, …, 0, 0, 0, 0, 0,..., 0, 0, 0, 0, 0, 0} /* state 1 */ {13,..., 7, 7, 7, 7, …, 9, 4, 4, 4, 4, 2, 4,..., 13, 13} /* state 2 */ {0, …, 4, 4, 4, 4,..., 0, 4, 3, 4, 4, 4, 4,..., 0, 0} /* state 3 */ {0, …, 4, 4, 4, 4, …, 0, 4, 4, 4, 4, 4, 4,, 0, 0} /* state 4 */{0, …, 4, 4, 4, 4,..., 0, 4, 4, 4, 4, 4, 4,..., 0, 0} /* state 5 */{0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0} /* state 6 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0,..., 0, 0} /* state 7 */... /* state 13 */{0, …, 0, 0, 0, 0, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0}

Pseudo Code for Scanner Token nextToken() { lastFinal = 0; currentState = 1 ; inputPositionAtLastFinal = input; currentPosition = input; while (not(isDead(currentState))) { nextState = edges[currentState][*currentPosition]; if (isFinal(nextState)) { lastFinal = nextState ; inputPositionAtLastFinal = currentPosition; } currentState = nextState; advance currentPosition; } input = inputPositionAtLastFinal ; return action[lastFinal]; }

Example Input: “if --not-a-com”

finalstateinput 01if --not-a-com 22 33 30 return IF

finalstateinput 01 --not-a-com 12 --not-a-com 120 --not-a-com found whitespace

finalstateinput 01--not-a-com 99 910--not-a-com 910--not-a-com 910--not-a-com 90 error

finalstateinput 01-not-a-com 99 90 error

Efficient Scanners Efficient state representation Input buffering Using switch and gotos instead of tables

Constructing Automaton from Specification Create a non-deterministic automaton (NDFA) from every regular expression Merge all the automata using epsilon moves (like the | construction) Construct a deterministic finite automaton (DFA) –State priority Minimize the automaton starting with separate accepting states

NDFA Construction if{ return IF; } [a-z][a-z0-9]*{ return ID; } [0-9]+ { return NUM; } [0-9]”.”[0-9]*|[0-9]*”.”[0-9]+{ return REAL; } (\-\-[a-z]*\n)|(“ “|\n|\t){ ; }.{ error(); }

DFA Construction

Minimization

%{ /* C declarations */ #include “tokens.h” /* Mapping of tokens into integers */ #include “errormsg.h” /* Shared by all the phases */ union {int ival; string sval; double fval;} yylval; int charPos=1 ; #define ADJ (EM_tokPos=charPos, charPos+=yyleng) %} /* Lex Definitions */ digits [0-9]+ % if { ADJ; return IF;} [a-z][a-z0-9] { ADJ; yylval.sval=String(yytext); return ID; } {digits} {ADJ; yylval.ival=atoi(yytext); return NUM; } ({digits}\.{digits}?)|({digits}?\.{digits}) { ADJ; yylval.fval=atof(yytext); return REAL; } (\-\-[a-z]*\n)|([\n\t]|" ")* { ADJ; }. { ADJ; EM_error(“illegal character''); }

Start States Regular expressions may be more complicated than automata –C comments Solutions –Conversion of automata into regular expressions –Start States % start s1 s2 % r1 { action0 ; BEGIN s1; } r1 { action1 ; BEGIN s2; } r2 { action2 ; BEGIN INITIAL};

Realistic Example % start Comment % ”/*'' { BEGIN Comment; } r1 { Usual actions; } r2 { Usual actions; }... rk { Usual actions; } ”*/”’ { BEGIN Initial; }.|\n ;

Summary For most programming languages lexical analyzers can be easily constructed Exceptions: –Fortran –PL/1 Flex is a useful tool beyond compilers

Lexical Analysis Mooly Sagiv html://www.cs.tau.ac.il/~msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Similar presentations

Presentation on theme: "Lexical Analysis Mooly Sagiv html://www.cs.tau.ac.il/~msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lexical Analysis Mooly Sagiv html://www.cs.tau.ac.il/~msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Similar presentations

Presentation on theme: "Lexical Analysis Mooly Sagiv html://www.cs.tau.ac.il/~msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2."— Presentation transcript:

Similar presentations

About project

Feedback