Download presentation
Presentation is loading. Please wait.
Published byBasil Holt Modified over 9 years ago
1
Lexical Analysis with lex(1) and flex(1) © 2014 Clinton Jeffery
2
Reading Read Sections 3-10 of Lexical Analysis with Flex Check out the class lecture notes Ask questions from either source
3
Traits of Scanners Function: convert from chars to tokens Identify and categorize kinds of tokens Detect boundaries between tokens Discard comments and whitespace Remember line/col #’s for error reporting Report lexical errors Run as fast as possible
4
Regular Expressions ε is a r.e. Any char in the alphabet is a r.e. If r and s are r.e.’s then r | s is a r.e. If r and s are r.e.’s then r s is a r.e. If r is a r.e. then r* is a r.e. If r is a r.e. then (r) is a r.e.
5
Common extensions to regular expression notation r+ is equivalent to rr* r? is equivalent to r|ε [abc] is equivalent to a|b|c [a-z] is equivalent to a | b| … |z [^abc] is equivalent to anything but a,b, or c
6
Lex’s extended regular expressions \cescapes for most operators “s”match C string as-is (superescape) r{m,n}match r between m and n times r/smatch r when s follows ^rmatch r when at beginning of line r$match r when at end of line
7
Lexical Attributes A lexical attribute is a piece of information about a token Compiler writer can define as needed Typically: – Categoryinteger code, used in parsing – Lexemeactual string as appears in source – Line, columnlocation in source code – Valuefor literals, the binary they represent
8
Meanings of the word “token” A single word from the source code An integer code that categorizes a word A set of lexical attributes that are computed from a single word of input An instance of a class (given by category)
9
Lex public interface FILE *yyin; /* set before calling yylex() */ int yylex(); /* call once per token */ char yytext[];/* chars matched by yylex() */ int yywrap();/* end-of-file handler */
10
.l file format header % body % helper functions
11
Lex header C code inside %{ … %} – prototypes for helper functions – #include’s that #define integer token categories Macro definitions, e.g. letter[a-zA-Z] digit[0-9] ident{letter}({letter}|{digit})* Warning: macros are fraught with peril
12
Lex body Regular expressions with semantic actions “ “{ /* discard */ } {ident}{ return IDENT; } “*”{ return ASTERISK; } “.”{ return PERIOD; } Match the longest r.e. possible Break ties with whichever appears first If it fails to match: copy unmatched to stdout
13
Lex helper functions Follows rules of ordinary C code Compute lexical attributes Do stuff the regular expressions can’t do Write a yywrap() to switch files on EOF
14
struct token – typical compiler struct token { int category; char *text; int linenumber; int column; char *filename; union literal value; }
15
“string removal tool” % “zap me”
16
whitespace trimmer % [ \t]+putchar(‘ ‘); [ \t]+/* drop entirely */
17
string replacement % usernameprintf(“%s”, getlogin() );
18
Line/word counter int lines=0, chars=0; % \n++lines; ++chars;.++chars; % main() { yylex(); printf(“lines: %d chars: %d\n”, lines, chars); }
19
Example: C/C++ reals Allow.2 ? What about 2. ? Is it: [0-9]*.[0-9]* Is it: ([0-9]+.[0-9]* | [0-9]*.[0-9]+) What about scientific notation? 3e4
20
Tweaking the Input Stream From within a semantic action after a match: – yymore() - append next token onto yytext, instead of replacing it – yyless(n) – consume only first n characters – unput(c) – place c back into input stream – input() - reads next char of input
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.