Download presentation
Presentation is loading. Please wait.
1
Compiler Tools Lex/Yacc – Flex & Bison
2
Intermediate Representation
Compiler Front End (from Engineering a Compiler) Scanner (Lexical Analyzer) Maps stream of characters into words Basic unit of syntax x = x + y ; becomes <id,x> <eq,=> <id,x> <plus_op,+> <id,y> <sc,; > The actual words are its lexeme Its part of speech (or syntactic category) is called its token type Scanner discards white space & (often) comments Source code tokens Intermediate Representation Scanner Parser Errors Speed is an issue in scanning use a specialized recognizer
3
The Front End (from Engineering a Compiler)
Parser Checks stream of classified words (parts of speech) for grammatical correctness Determines if code is syntactically well-formed Guides checking at deeper levels than syntax Builds an IR representation of the code Parsing is harder than scanning. Better to put more rules in scanner (whitespace etc). Scanner Source code tokens IR Parser Errors
4
The Big Picture Language syntax is specified with parts of speech, not words Syntax checking matches parts of speech against a grammar 1. goal expr 2. expr expr op term | term 4. term number | id 6. op + | – S = goal T = { number, id, +, - } N = { goal, expr, term, op } P = { 1, 2, 3, 4, 5, 6, 7} Parts of speech, not words! No words here!
5
Why study lexical analysis?
We want to avoid writing scanners by hand Finite automata are used in other applications: grep, website filtering, various “find” commands Goals: To simplify specification & implementation of scanners To understand the underlying techniques and technologies source code parts of speech & words Scanner tables or code Represent words as indices into a global table specifications Scanner Generator Specifications written as “regular expressions”
6
Finite Automata Formally a finite automata is a five-tuple(S,S,, s0, SF) where S is the set of states, including error state Se. S must be finite. is the alphabet or character set used by recognizer. Typically union of edge labels (transitions between states). (s,c) is a function that encodes transitions (i.e., character c in changes to state s in S. ) s0 is the designated start state SF is the set of final states, drawn with double circle in transition diagram
7
Finite Automata Finite automata to recognize fee and fie:
S = {s0, s1, s2, s3, s4, s5, se} = {f, e, i} (s,c) set of transitions shown above s0 = s0 SF= { s3, s5} Set of words accepted by a finite automata F forms a language L(F). Can also be described by regular expressions. S0 S4 S1 f S3 S5 S2 e i
8
Finite Automata Quick Exercise
Draw a finite automata that can recognize CU | CSU | CSM | DU (drawing included below for reference) S0 S4 S1 f S3 S5 S2 e i
9
Regular Expressions in Lex*
The characters that form regular expressions include: . matches any single character except newline * matches zero or more copies of preceding expression [] a character class that matches any character within the brackets. If first character is ^ will match any character except those within brackets. A dash can be used for character range, e.g., [0-4] is equivalent to [01234]. more in book… ^ matches beginning of line as first character of expression (also negation within [], as listed above). $ matches end of line as last character of expression {} indicates how many times previous pattern is allowed to match, e.g., A{1,3} matches one to three occurrences of A. \ used to escape metacharacters, e.g., \* is literal asterisk, \” is a literal quote, \{ is literal open brace, etc. * from lex & yacc by Levine, Mason & Brown
10
Regular Expressions, continued
+ matches one or more occurrences of preceding expression, e.g., [0-9]+ matches “1” “11” or “1234” but not empty string ? matches zero or one occurrence of preceding expression, e.g., -?[0-9]+ matches signed number with optional leading minus sign | matches either preceding or following expression, e.g., cow|pig|sheep matches any of the three words “…” interprets everything inside quotation marks literally / matches preceding expression only if followed by following expression, e.g., 0/1 matches “0” in “01” but not in “02”. Material in pattern following the / is not “consumed” () Groups a series of regular expressions into a new regular expression, e.g., (01) becomes character sequence 01. Useful when building up complex patterns with *, + and |.
11
Regular Expression Examples
digit: [0-9] int with at least 1 digit: [0-9]+ int that can have 0 digits: [0-9]* What about float? [0-9]*\.[0-9]+ // literal ., at least 1 digit after . – what about 0 or 2? ([0-9]+)| ([0-9]*\.[0-9]+) // combine int and float, notice use of (), what about unary -? -?(([0-9]+)| ([0-9]*\.[0-9]+))
12
More Regular Expression Examples
What’s a regular expression for matching quotes? \”.*\” won’t work for lines like “mine” and “yours” because lex matches largest possible pattern. \”[^”\n]*[“\n] will work by excluding “ (forces lex to stop as soon as “ is reached). The \n keeps a quoted string from exceeding one line.
13
Flex – Fast Lexical Analyzer
Here’s where we’ll put the regular expressions to good use! lex.yy.c, contains yylex() regular expressions & C-code rules scanner (program to recognize patterns in text) FLEX compile executable – analyzes and executes input
14
Flex input file 3 sections definitions %% rules user code
15
Definition Section Examples
name definition DIGIT [0-9] ID [a-z][a-z0-9]* A subsequent reference to {DIGIT}+"."{DIGIT}* is identical to: ([0-9])+"."([0-9])*
16
Can include C-code in definitions
%{ /* This is a comment inside the definition */ #include <math.h> // may need headers %}
17
Rules The rules section of the flex input contains a series of rules of the form: pattern action In the definitions and rules sections, any indented text or text enclosed in %{ and %} is copied verbatim to the output (with the %{ %}'s removed). The %{ %}'s must appear unindented on lines by themselves.
18
Example: Simple Pascal-like recognizer
Definitions section: /* scanner for a toy Pascal-like language */ %{ /* need for the call to atof() below */ #include <math.h> %} DIGIT [0-9] ID [a-z][a-z0-9]* Remember these are on a line by themselves, unindented! } Lines inserted as-is into resulting code } Definitions that can be used in rules section
19
Example continued Rules section: text that matched the pattern
(a char*) Rules section: %% {DIGIT}+ { printf("An integer: %s (%d)\n", yytext, atoi(yytext ));} {DIGIT}+"."{DIGIT}* {printf("A float: %s (%g)\n", yytext, atof(yytext));} if|then|begin|end|procedure|function {printf("A keyword: %s\n", yytext);} {ID} { printf( "An identifier: %s\n", yytext ); } "+"|"-"|"*"|"/" { printf( "An operator: %s\n", yytext ); } "{"[^}\n]*"}" /* eat up one-line comments */ [ \t\n]+ /* eat up whitespace */ . { printf( "Unrecognized character: %s\n", yytext ); } pattern action
20
Example continued User code (required for flex, in library for lex) %%
int main(int argc, char ** argv ) { ++argv, --argc; /* skip over program name */ if ( argc > 0 ) yyin = fopen( argv[0], "r" ); else yyin = stdin; yylex(); } lex input file lexer function produced by lex
21
Flex exercise #1 Download pascal.l
From a command prompt (Start->Run->cmd): Flex -opascal.c -L pascal.l NOTE: without –o option output file will be called lex.yy.c -L option suppresses #lines that cause problems with some compilers (e.g. DevC++) Compile and execute pascal.c (batch on Blackboard) gcc –opascal.exe –Lc:\progra~1\gnuwin32\lib pascal.c –lfl -ly Execute program. Type in digits, ids, keywords etc. End program with Ctrl-Z
22
Flex exercise #2 Copy words.l (from lex & yacc)
Use flex then compile and execute What does it do? Extend the example with 1 new part of speech. Recognize lexemes R0-R9 as register names Recognize complex numbers, including for example -3+4i, +5-6i, +7i, 8i, -12i, but not 3++4i (hint: print newline before displaying your complex number, lexer may display 3+ and then recognize +4i)
23
Lex techniques Hardcoding lists not very effective. Often use symbol table. Example in lec & yacc, not covered in class but see me if you’re interested.
24
And now… Let’s continue with chapter 4!
25
Bison – like Yacc (yet another compiler compiler)
Context-free Grammar in BNF form, LALR(1)* Bison parser (c program) group tokens according to grammar rules Bison Bison parser provides yyparse You must provide: the lexical analyzer (e.g., flex) an error-handling routine named yyerror a main routine that calls yyparse *LookAhead Left Recursive
26
Bison Parser Same sections as flex (yacc came first): definitions, rules, C-Code
27
Bison Parser – Definition Section
Tokens used in grammar, values used on parser stack, may include C code within %{ }% Single quoted characters can be used as tokens without declaring them, e.g., ‘+’, ‘=‘ etc. List tokens, Bison will create header with defines %token NAME NUMBER YYSTYPE determines the data type of the values returned by the lexer. If lexer returns different types depending on what is read, include a union: %union { char cval; char *sval; int ival; } Types declared in union can be used to specify types for tokens and also for non-terminals %token <ival>NUMBER %type <sval>bibKey
28
Bison Parser – Rule Section
Use : between lhs and rhs, place ; at end. statement: NAME ‘=‘ expression | expression { printf("= %d\n", $1); } ; expression: NUMBER ‘+’ NUMBER { $$ = $1 + $3; } | NUMBER ‘-’ NUMBER { $$ = $1 + $3; } | NUMBER { $$ = $1; } Unlike flex, bison doesn’t care about line boundaries, so add white space for readability Symbol on lhs of first rule is start symbol, can override with %start declaration in definition section $1, $3 refer to RHS values. $$ sets value of LHS. In expression, $$ = $1 + $3 means it sets the value of lhs (expression) to NUMBER ($1) + NUMBER ($3) white space
29
More on Symbol Values and Actions
Symbols in bison have values. YYSTYPE typedef contains value types Default for all values is int A rules action is executed when the parser reduces that rule (will have recognized both NUMBER symbols, lexer should have returned a value via yylval). expression: NUMBER ‘+’ NUMBER { $$ = $1 + $3; } | NUMBER ‘-’ NUMBER { $$ = $1 - $3; } ;
30
More on Symbol Values and Actions
Example to return int value: [0-9]+ { yylval = atoi(yytext); return NUMBER;} sets value for use in actions This one just returns the numeric value of the string stored in yytext returns recognized token In prior examples we just returned tokens, not values
31
Bison Parser – C Section
At a minimum, provide yyerror and main routines yyerror(char *errmsg) { fprintf(stderr, "%s\n", errmsg); } main() yyparse();
32
Bison Intro Exercise Download SimpleCalc.y and SimpleCalc.l
Create calculator program: bison -d simpleCalc.y flex -L -osimpleCalc.c simpleCalc.l gcc -c simpleCalc.c gcc -c simpleCalc.tab.c gcc -Lc:\progra~1\gnuwin32\lib simpleCalc.o simpleCalc.tab.o -osimpleCalc.exe -lfl –ly As a convenience, you can use the batch file mbison.bat instead of typing all the above: mbison simpleCalc Test with valid sentences (e.g., 3+6-4) and invalid sentences.
33
Understanding simpleCalc
Explanation: When the lexer recognizes a number [0-9]+ it returns the token NUMBER and sets yylval to the corresponding integer value. When the lexer sees a carriage return it returns 0. If it sees a space or tab it ignores it. When it sees any other character it returns that character (the first character in the yytext buffer). If the yyparse recognizes it – good! Otherwise the parser can generate an error. %{ #include "simpleCalc.tab.h" extern int yylval; %} %% [0-9]+ { yylval = atoi(yytext); return NUMBER; } [ \t] ; /* ignore white space */ \n return 0; /* logical EOF */ . return yytext[0]; /* */ /* 5. Other C code that we need */ yyerror(char *errmsg) { fprintf(stderr, "%s\n", errmsg); } main() yyparse(); #ifndef YYTOKENTYPE # define YYTOKENTYPE /* Put the tokens into the symbol table, so that GDB and other debuggers know about them. */ enum yytokentype { NAME = 258, NUMBER = 259 }; #endif /* Tokens. */ #define NAME 258 #define NUMBER 259 simpleCalc.l simpleCalc.tab.h
34
Understanding simpleCalc, continued
%token NAME NUMBER %% statement: NAME '=' expression | expression { printf("= %d\n", $1); } ; expression: expression '+' NUMBER { $$ = $1 + $3; } | expression '-' NUMBER { $$ = $1 - $3; } | NUMBER { $$ = $1; } Explanation When you execute simpleCalc and type an expression such as 1+2, the main program calls yyparse. This calls lex to recognize 1 as a NUMBER (puts 1 in yylval), calls lex which returns +, calls lex to recognize 2 as a NUMBER. At this point it will recognize expression + NUMBER and “reduce” this rule, meaning it does the action {$$ = $1 + $3}. It then recognizes expression as a statement, so it does the printf action.
35
Even more detail (if you’re curious)
Running flex creates simpleCalc.c. This creates the following case statement (I added the printf statements: case 1: YY_RULE_SETUP printf("returning number value %d\n", atoi(yytext)); { yylval = atoi(yytext); return NUMBER; } YY_BREAK case 2: printf("ignoring white space\n"); ; /* ignore white space */ case 3: printf("recognized eof\n"); return 0; /* logical EOF */ case 4: printf("returning other character %c\n", yytext[0]); return yytext[0];
36
Continuing more detail
Running bison creates simpleCalc.tab.c switch (yyn) { case 3: #line 4 "simpleCalc.y" { printf("= %d\n", (yyvsp[0])); ;} break; case 4: #line 7 "simpleCalc.y" { (yyval) = (yyvsp[-2]) + (yyvsp[0]); ;} case 5: #line 8 "simpleCalc.y" { (yyval) = (yyvsp[-2]) - (yyvsp[0]); ;} case 6: #line 9 "simpleCalc.y" { (yyval) = (yyvsp[0]); ;} Notice use of stack pointer sp for $values NOTE: I added extra printf statements to each case, which is what you can see in the trace.
37
Continuing more detail
In exercise 2 you define a union. This gets translated to code within SimpleCalc.tab.h: #if ! defined (YYSTYPE) && ! defined (YYSTYPE_IS_DECLARED) #line 1 "simpleCalcEx2.y" typedef union YYSTYPE { float fval; int ival; } YYSTYPE; extern YYSTYPE yylval; This is what makes your yylval return part of the union
38
Continuing more detail
Symbols you define in bison’s CFG are added to a symbol table: static const char *const yytname[] = { "$end", "error", "$undefined", "NUMBER", "FNUMBER", "NAME", "'='", "'+'", "'*'", "'('", "')'", "$accept", "statement", "expression", "term", "factor", 0 };
39
Continuing more detail
New rules make use of union: switch (yyn) { case 3: #line 15 "simpleCalcEx2.y" { printf("= %f\n", (yyvsp[0].fval)); ;} break; case 4: #line 18 "simpleCalcEx2.y" { (yyval.fval) = (yyvsp[-2].fval) + (yyvsp[0].fval); ;} case 5: #line 19 "simpleCalcEx2.y" { (yyval.fval) = (yyvsp[0].fval); ;} expression is defined as <fval>, so is NUMBER
40
Bison Exercise #1 Change simpleCalc to handle + and * with correct precedence using the grammar with terms and factors presented in chapter 4 of text: Expr -> Expr + Term | Term Term -> Term * Factor | Factor Factor -> (Expr) | NUMBER changed id to NUMBER for simplicity
41
Bison Exercise #2 Change simpleCalc.l to accept floating point values OR integers. Remove extern int yylval; (yylval is no longer simply an int) Modify simpleCalc.tab.h if you change the name of your file. use atof for floating point value you will create a union in simpleCalc.y. Use the name of that union in simpleCalc.l, for example yylval.ival = atoi(yytext); would be used to set a named union of ival to an integer value. Change simpleCalc.y to accept floating point values. Create a union, example: %union { float fval; int ival; } Add %token statements for every token and %type statements for your non-terminals, for example: %token <ival>NUMBER %type <fval> expression Update factor to accept NUMBER or a floating point type of number (e.g., FNUMBER) The printf in statement needs to print a floating point value (printf("= %f\n", $1);)
42
Bison Exercise #3 Update simpleCalc to accept statements = 3.4*4 Output will be: myVar = 13.6 Purpose: adding another type to union (char*). I called this sval. using a C-function as part of lexer to preprocess yytext before setting yylval. Steps in simpleCalc.l: add prototype for a function named extract_name. The parameter to this function is a char* (you will pass in yytext). You can either return a char* or just modify the parameter, since it’s an array. Prototype is in declaration section. add function extract_name to C section. This function will just remove from the front of the variable name. HINT: remember that c strings end in ‘\0’. You can modify this string in place, but for more extensive processing you might need to create your own c-strings. You can use malloc, strdup and free in such a case. When you have recognized a variable followed by upper or lower case letters, in our simple example), you will set yylval.sval = extract_name(yytext); Steps in simpleCalc.y: Be sure you still have NAME = expression in your grammar, and add an action so it prints both the variable and the expression result. Declare NAME as a token of type <sval> (or whatever name you used in your union)
43
Create a small input file with a single line of input, something like:
Bison Exercise #4 Modify simpleCalc.l so that it accepts input from a file. The last slide contains a main method that will read from a file. Create a small input file with a single line of input, something like: @myVar = 8+3*2.5+6
44
Summary of steps (from online manual)
The actual language-design process using Bison, from grammar specification to a working compiler or interpreter, has these parts: Formally specify the grammar in a form recognized by Bison (i.e., machine-readable BNF). For each grammatical rule in the language, describe the action that is to be taken when an instance of that rule is recognized. The action is described by a sequence of C statements. Write a lexical analyzer to process input and pass tokens to the parser. Write a controlling function (main) that calls the Bison-produced parser. Write error-reporting routines.
45
Using files with Bison The standard file for Bison is yyin. The following code can be used to take an optional command-line argument: int main(argc, argv) int argc; char **argv; { FILE *file; if (argc == 2) file = fopen(argv[1], "r"); if (!file) { fprintf(stderr, “Couldn't open %s\n", argv[1]); exit(1); } yyin = file;
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.