lex(1) and flex(1)
Lex public interface FILE *yyin; /* set before calling yylex() */ int yylex(); /* call once per token */ char yytext[];/* chars matched by yylex() */ int yywrap();/* end-of-file handler */
.l file format header % body % helper functions
Lex header C code inside %{ … %} – prototypes for helper functions – #include’s that #define integer token categories Macro definitions, e.g. letter[a-zA-Z] digit[0-9] ident{letter}({letter}|{digit})* Warning: macros are fraught with peril
Lex body Regular expressions with semantic actions “ “{ /* discard */ } {ident}{ return IDENT; } “*”{ return ASTERISK; } “.”{ return PERIOD; } Match the longest r.e. possible Break ties with whichever appears first If it fails to match: copy unmatched to stdout
Lex helper functions Follows rules of ordinary C code Compute lexical attributes Do stuff the regular expressions can’t do Write a yywrap() to switch files on EOF
Lex regular expressions \cescapes for most operators “s”match C string as-is (superescape) r{m,n}match r between m and n times r/smatch r when s follows ^rmatch r when at beginning of line r$match r when at end of line
struct token struct token { int category; char *text; int linenumber; int column; char *filename; union literal value; }
“string removal tool” % “zap me”
whitespace trimmer % [ \t]+putchar(‘ ‘); [ \t]+/* drop entirely */
string replacement % usernameprintf(“%s”, getlogin() );
Line/word counter int lines=0, chars=0; % \n++lines; ++chars;.++chars; % main() { yylex(); printf(“lines: %d chars: %d\n”, lines, chars); }
Example: C reals Is it: [0-9]*.[0-9]* Is it: ([0-9]+.[0-9]* | [0-9]*.[0-9]+)