lex & yacc CIS*2750 Winter 2013
CIS*2750 (W13)D. McCaughan Scanners A “scanner” turns an input stream in the source language into token codes –in principle: takes some action when it recognizes a token in the input –discard non-semantic content (i.e. whitespace, comments) –may do other small jobs, like converting numeric constants this is the wrong scanners if (a == 0) { /* increase b */ b++; } IF LPAREN ID EQ CONSTANT LBRACE ID INCR SEMI RBRACE
CIS*2750 (W13)D. McCaughan Scanners: Lexical Analysis Analyze the structural components of input –scanner: groups input characters into tokens What is a token? –a sequence of characters that can be treated as an atomic grammatical unit –a language specifies a finite set of token types (the lexical units of the language), e.g. ID (“foo”, “bar”, “abc123”, …}, IF (“if”), INTEGER, REAL, COMMA, NEQ, LPAREN, RPAREN, … –tokens with additional semantic values e.g. identifiers, string literals, numbers
CIS*2750 (W13)D. McCaughan Scanners: example program: /* find a zero */ float mach0(char *s) { if (!strncmp(*s, “0.0”,3)) return(0.0); } scanner tokenizes: FLOAT ID(match0) LPAREN ID(s) RPAREN LBRACE IF LPAREN BANG ID(strcmp) LPAREN ID(s) COMMA STRING(0.0) COMMA INTEGER(3) RPAREN RPAREN RETURN REAL(0.0) SEMI RBRACE EOF
CIS*2750 (W13)D. McCaughan Specifying tokens Structure of tokens can be complex –problem defining complex tokens ad hoc –e.g. string literals –e.g. floating point format Need a formal language to specify token types without ambiguity –permit review of design and validation of input Regular expressions –succinct, precise –capable of representing infinite sets of strings CAUTION: cannot describe all sets of strings with regular expressions consider writing a regex for “strings containing an equal number of a and b characters”
CIS*2750 (W13)D. McCaughan Finite automata Need a formalism that can be implemented in code Finite Automaton: a simple idealized “computer” that recognizes strings belonging to regular sets: –a finite set of states S –a finite alphabet –a set of transitions between states based on the input read in a given state T:(S x ) S –a specific start state s S –a set of final (accepting) states F S –the set of all strings accepted by a given FA is the language it defines Compare the above with what you understand about regular expressions… they are equivalent
CIS*2750 (W13)D. McCaughan Finite automata (cont.) Can represent a FA using transition graphs –directed graph –each state is a vertex accepting states are marked as such –each transition is a directed edge between states edges are labeled with a symbol from the alphabet a symbol can appear on only 1 outgoing edge from a given state an unlabelled edge is directed into the start state a-zA-Z_ 0-9
CIS*2750 (W13)D. McCaughan Finite automata (cont.) Deterministic finite automata –no two edges leaving the same state are labelled with the same input symbol Processing –begin in start state –for each character in input do follow edge labelled with this character to next state –after n transitions (for input of length n), if current state is final state: ACCEPT string, else: REJECT string Easily implemented –CASE-based processing given global current state –matrix-based transition table (table lookup) newstate = matrix[current_state][input]
CIS*2750 (W13)D. McCaughan Scanner generators Writing scanners is a common requirement –parsing is a ubiquitous activity Process is repetitive, resulting code is similar in structure Process is not difficult to automate Scanner generators receive a specification file –definitions of the tokens to be scanned –non-procedural programming not “how”, but “what” e.g. lex
CIS*2750 (W13)D. McCaughan lex What is lex? –a lexical analyzer (scanner) generator –INPUT: a description file that uses regular expressions to specify patterns to be tokenized –OUTPUT: source code that implements the scanner –be default, input is taken from stdin and sent to stdout (this can be changed) Specific variants –flex for creating C/C++ lexers –JFlex for creating Java lexers –etc.
CIS*2750 (W13)D. McCaughan Structure of a lex file Definitions section –small building blocks of regular expressions to simplify the scanner specification declared outside of %{ %} special directives to change lex’s behaviour also appear here –anything inside of %{ %} is copied verbatim into the final program (so should be C code) comments, #include, #define, variables (e.g. line counter), etc. DEFINITIONS SECTION % RULES SECTION % USER CODE SECTION
CIS*2750 (W13)D. McCaughan Structure of a lex file (cont.) Rules section –a pattern (regex) and an action (program code) to execute when that pattern is found the action starts on the same line as the pattern patterns only match a given input string once the longest possible match is always used –“island” would match [a-zA-Z]+ before is User code section –any legal program code, not enclosed in %{ %} –copied verbatim into final program –main(), other subroutines used (or expected) by actions from the rules section –NOTE: comments outside of %{ %} must be indented!
CIS*2750 (W13)D. McCaughan Example The simplest lex script: –simply copies standard input to standard output –ECHO is a special lex directive, not a C command %.|\n{ ECHO; } %
CIS*2750 (W13)D. McCaughan Running lex Executing lex: lex e.g. % lex example.l (produces lex.yy.c ) –outputs C source code for lexer - by default this file is called lex.yy.c, which can be compiled normally –some systems may require you to link in the lex library (i.e. - ll note for flex: -lfl ) e.g. % gcc -Wall -ansi lex.yy.c -o scanner -fl
CIS*2750 (W13)D. McCaughan Running lex (cont.) Key points: –automatically generates a function yylex() which when called begins scanning the input ( stdin ) for patterns and executing actions if actions have no return statements, yylex() won’t return until EOF –internal variables are always available in actions yytext - text that matched the pattern yyleng - length of string in yytext etc. (some implementations will have built-in support for lineno) –if a main() routine is not explicitly provided, lex will include one automatically that simply calls yylex()
CIS*2750 (W13)D. McCaughan Example Things to note here: –local variables in the definitions section; #define and #include would also belong there –special internal variables (yyleng) and functions (yylex) %{ /* a word counting program */ unsigned char_count = 0, word_count = 0, line_count = 0; %} word[^ \t\n]+ eol\n % {word}{ word_count++; char_count += yyleng; } {eol}{ char_count++; line_count++; }.{ char_count++; } % int main() { yylex(); printf( “ l: %d - w: %d - c: %d\n ”, line_count, word_count, char_count); } DEFINITIONS RULES USER CODE
CIS*2750 (W13)D. McCaughan Example %{ /* crude verb recognition program */ %} % [\t ]+{ /* ignore whitespace */ } is | are | was | being | do | did | would | can | have | go{ printf( “ %s: is a verb\n ”,yytext); } [a-zA-Z]+{ printf( “ %s: is not a verb\n ”, yytext); }.|\n{ ECHO; /* default catch-all */ } % int main() { yylex(); }
CIS*2750 (W13)D. McCaughan Example (cont.) Compiled & run: %./verb did I have fun? did: is a verb I: is not a verb have: is a verb fun: is not a verb ? ^D
CIS*2750 (W13)D. McCaughan Hints and tips Error reporting –you’ll want to be able to report (at least) a line number for unrecognized toekns (and other error conditions related to the parser to follow) –consider using %option yylineno in flex you can easily implement this function yourself (how?) –it can be useful to have special actions apply inside a comment (for example) or other semantic construct have a look at “start conditions” (lex manual) and > rules Recall that tokens often have associated semantic values that must be recorded over time –symbol table: a look-up table (typically a hash table) that permits storage and retrieval of data to be associated with a symbol –consider how this would be integrated with lex
CIS*2750 (W13)D. McCaughan Parsers What we’ve seen to this point is syntax analysis –only concerned with identifying the structural components of the input Typically the sequences of tokens are also significant: this is semantic analysis –recognize sequences of tokens (or classes of tokens) and perform appropriate actions “Parsers” validate the phrase structure of input –specific sequences of tokens –recognizer –determine the semantics of the input consider parse trees (abstract syntax)
CIS*2750 (W13)D. McCaughan Parsers (cont.) A language is defined by the phrase structure of its component expressions. e.g.: addition expression = ID ADDOP ID e.g. a = b decl = TYPE ID decls SEMICOLON decls = decls COMMA decls | ID e.g. int a, b, c;
CIS*2750 (W13)D. McCaughan Specification of languages Consider defining phrases with regex’s –e.g. addition expressions digits = [0-9]+ sum = (digits “+”)* digits e.g –what about parentheses? digits = [0-9]+ sum = expr “+” expr expr = “(“ sum “)” | digits e.g. ( ) … 61 … (1 + ( ))
CIS*2750 (W13)D. McCaughan Specification of languages (cont.) BUT…it is impossible for a DFA to recognize balanced parentheses (can’t count to arbitrary N) –sum and expr are thus not regular expressions recall abbreviations in lex –what does lex do with such abbreviations? –RHS is substituted for LHS prior to generation of DFA –try substituting abbreviations in prev. example explosion of abbreviations –abbreviations does not increase expressive power What we need is recursive abbreviations
CIS*2750 (W13)D. McCaughan Context Free Grammars (CFGs) A precise method of specifying context free languages Incorporate recursion into definitions –counting e.g. balanced parentheses –arbitrary repetition e.g. mathematical expressions
CIS*2750 (W13)D. McCaughan CFG Terminology Non-terminals: variables that represent a language (UPPER CASE) Terminals: atomic symbols in the language (lower case) Productions: rules relating variables ( ) –languages associated with given non-terminal contains strings formed by concatenating strings from langauges of other non-terminals, and possibly terminals Start symbol: a special symbol that starts all derivations (S)
CIS*2750 (W13)D. McCaughan Backus-Nuar Form (BNF) From Hopcroft & Ullman, 1979 Describing natural language: boy little Generally not adequate for describing natural language (no accommodation of context) Ideal for most programming languages –Backus-Nuar Form (BNF)
CIS*2750 (W13)D. McCaughan Productions Example –arithmetic expressions with + and - operators, id- class operands and balanced parentheses S EXPR EXPR EXPR + EXPR EXPR EXPR - EXPR EXPR ( EXPR ) EXPR id S EXPR EXPR EXPR + EXPR | EXPR - EXPR | ( EXPR ) | id
CIS*2750 (W13)D. McCaughan Derivations To show a sentence is in the language defined by a grammar, we can perform a derivation --- start with start symbol and repeatedly replace any non-terminal by one of its RHSs S EXPR EXPR + EXPR EXPR + id id + id S EXPR EXPR - EXPR ( EXPR ) - EXPR ( EXPR ) - id ( EXPR + EXPR ) - id ( EXPR + id ) - id ( id + id ) - id
CIS*2750 (W13)D. McCaughan Parse trees A tree in which each symbol in a derivation is connected to the one from which it was derived –several derivations can have the same parse tree S EXPR - + id () EXPR S ( id + id ) - id
CIS*2750 (W13)D. McCaughan Derivation sequence Many different possible derivations of the same sentence –if more than one non-terminal appears in the RHS of productions, we can choose which to expand first Two obvious conventions: –leftmost derivation choose leftmost non-terminal to expand top down (recursive descent) parsing easiest to write by hand –rightmost derivation choose rightmost non-terminal to expand “canonical” derivation bottom up parsers (e.g. yacc)
CIS*2750 (W13)D. McCaughan Repetition and recursion Two ways to specify recursion Left Recursion –non-terminal appears as the first symbol on RHS of production (NOTE: for yacc it is better to use left recursion where possible - minimizes stack size) –e.g. A Az | z Right recursion –non-terminal appears as the last symbol on RHS of production –e.g. A zA | z Either produces the same language rule –which we use can have significant effect depending on the parsing algorithm used
CIS*2750 (W13)D. McCaughan Example Specifying a programming language (Pascal-like) PROGRAM HEADER VARS BODY HEADER program string ‘(‘ IO ‘)’ ‘;’ IO input | output | inpout | none VARS DECLS | void ‘;’ DECLS DECLS DECL | DECL DECL TYPE IDS ‘;’ TYPE integer | real IDS IDS ‘,’ id | id BODY begin STMTS end STMTS STMTS STMT | STMT STMT EXPR ‘;’ EXPR EXPR ‘+’ EXPR | EXPR ‘=‘ EXPR | ‘(‘ EXPR ‘)’ | id | number
CIS*2750 (W13)D. McCaughan Errors in grammars Ambiguity: effect on semantics –consider –(2 - 1) - 3 != 2 - (1 - 3) –checking for ambiguous grammars in general CFG is impossible algorithms exist for certain classes of grammar (such as those for which we can generate parsers) Recall: grammar used to define a language –errors in grammar: wrong language defined –comparison for identity (equality) between pairs of grammars in the general case is also impossible
CIS*2750 (W13)D. McCaughan Ambiguous grammars A grammar is ambiguous if we can derive a sentence with two different parse trees –semantics are no longer necessarily clear; e.g. S EXPR EXPR EXPR + EXPR | EXPR - EXPR | id –NOTE :multiple ways to derive id + id - id Leftmost derivation: S EXPR EXPR + EXPR id + EXPR id + EXPR - EXPR id + id - EXPR id + id - id Rightmost derivation: S EXPR EXPR - EXPR EXPR - id EXPR + EXPR - id EXPR + id - id id + id - id
CIS*2750 (W13)D. McCaughan Ambiguous grammars (cont.) S EXPR + -id EXPR S - + id EXPR
CIS*2750 (W13)D. McCaughan Resolving ambiguity Disambiguating rules –explicitly states which parse tree is correct –no change required to grammar Precedence –stated order of derivations based on operator –recall: subtrees will be evaluated before expressions represented by root node --- order of derivations is opposite to order of evaluation Associativity –stated order of derivations based on location –left associative: derivation from first choice –right associative: derivation from last choice
CIS*2750 (W13)D. McCaughan Resolving ambiguity (cont.) Rewrite the grammar –accommodate concepts of precedence and associativity in statement of grammar write rules that have phrases to be evaluated first deriving later in production sequence Precedence EXPR EXPR + EXPR | MEXPR MEXPR MEXPR * MEXPR | AEXPR AEXPR ( EXPR ) | number Associativity EXPR EXPR + MEXPR | MEXPR MEXPR MEXPR * AEXPR | AEXPR AEXPR ( EXPR ) | number
CIS*2750 (W13)D. McCaughan Common ambiguities Mathematical expressions –if parentheses are not required, operators that are not associative by nature may be ambiguous Conditional expressions –dangling else if condition statements else statements if condition statements else statements
CIS*2750 (W13)D. McCaughan Notes Classes of grammars –regular grammar (regex) A zB | z OR (i.e. not both) A Bz | z –context free grammar (CFG) A B (A is any non-terminal, B is any string) –context sensitive grammar xAz xBz (A is any non-terminal, B is any string) –unrestricted grammar also called recursively enumerable Example: context issues in programming languages –symbols defined prior to use –cannot specify with CFGs
CIS*2750 (W13)D. McCaughan yacc What is yacc? –a parser generator –INPUT: a description file that uses a BNF-like notation to specify sequences of tokens to be recognized as a semantic unit –OUTPUT: source code that implements the parser –yacc operates on tokens rather than the input directly requires a source of tokens (like lex!) Specific variants –bison, byacc for creating C/C++ parsers –CUP for creating Java parsers –etc.
CIS*2750 (W13)D. McCaughan Using lex & yacc together The parser is the higher level routine –it calls the lexer when it needs a token from the input –the scanner sends tokens to the parser as codes –not all input is of interest to the parser (whitespace, comments) so the lexer does not return these What are the token codes? –scanner and parser must agree solution: let yacc define the token codes tokens defined in the parser will automatically be defined as a small integer value using #define macros in a header file generated automatically by yacc
CIS*2750 (W13)D. McCaughan yacc and parsing Shift/reduce parsing –a yacc parser looks for rules that might match the tokens seen so far –has a set of states: each reflects a possible position in one or more partially matched rules –when it reads a token that doesn’t complete a rule, it pushes the token onto a stack and switches to a new state this is a shift –when it reads a token that completes a rule, it pops the RHS symbols off the stack, pushese the LHS symbol onto the statck and switches to a new state this is a reduce –whenever a rule is reduced, user code associated with the rule is executed
CIS*2750 (W13)D. McCaughan Shift/reduce parsing e.g. statement NAME = expression expression NUMBER + NUMBER | NUMBER - NUMBER Parse: A = stack: A (shift A ) A= (shift = ) A=12 (shift 12 ) A=12+ (shift + ) A=12+13 (shift 13 ) This matches the rule expression NUMBER + NUMBER so reduce: pop 13, +, 12 and push expression stack: A=expression (shift A ) This matches the rule statement NAME = expression so reduce: pop expression, =, A and push statement End of input. Stack has been reduced to the start symbol, so the input was valid according to the grammar
CIS*2750 (W13)D. McCaughan Structure of a yacc file Definitions section –specify tokens and types for symbols, precedence and associativity rules declared outside of %{ %} tokens and types for symbols in the grammer (with %token and %type respectively) we can specify a non-integer token type (as a union) with %union a start symbol can be explicitly specified with %start –anything inside of %{ %} is copied verbatim into the final program (so should be C code) comments, #include, #define, variables (e.g. symbol table) DEFINITIONS SECTION % RULES SECTION % USER CODE SECTION
CIS*2750 (W13)D. McCaughan Structure of a yacc file (cont.) Rules section –a grammar rule and an action (program code) to execute when that pattern is found default start symbol is the LHS of the first rule NOTE: yacc cannot parse ambiguous grammars! –a rule consists of a list of grammar rules (using “:” instead of “->”), optionally including an action consisting of program code, with a semi-colon terminating each rule –parser generated will execute any action present when it reduces a rule User code section –any legal program code, not enclosed in %{ %} –copied verbatim into final program –main(), other subroutines used (or expected) by actions from the rules section caution: there should only be one main() between lex and yacc (obviously)
CIS*2750 (W13)D. McCaughan Symbols NOTE: yacc reverses the BNF conventions with respect to terminals and non-terminals –non-terminals are lower-case; terminals are upper-case Every symbol in a yacc grammar has a value –symbols can be of different types by using the %union and %type directives –the LHS is referred to as $$ ; the symbols on the RHS are referred to by position, as $1, $2, $3, … –these shorthand notations are replaced in the generated code by the actual variable containing the value
CIS*2750 (W13)D. McCaughan Running yacc Executing yacc yacc -d -y –outputs C source, by default named y.tab.c, and an include file for use by a scanner, named y.tab.h –must also produce/compile a scanner and link it all together % yacc -d -y example.y (produces y.tab.[ch] ) Key points –automatically provides a function yyparse() scans the input, shifting/reducing until the scanner reports the end of input (subsequent calls will reset the state and continue) –internal variables are available to both lexer and parser ( yyin - input stream; yylval - value of lexer token, etc.)
CIS*2750 (W13)D. McCaughan Example: an expression parser %{ #include %} %union { int ival; char *sval; } %token PLUS MINUS EQUALS %token NAME %token NUMBER %type expression % statement : NAME EQUALS expression { printf( “ %s = %d “, $1, $3); } | expression { printf( “ = %d\n ”,$1); } ; expression : expression PLUS NUMBER { $$ = $1 + $3; } | expression MINUS NUMBER { $$ = $1 - $3; } | NUMBER { $$ = $1; } ; % extern FILE *yyin; int yyerror(char *s) { fprintf(stderr, “ %s\n ”,s); } int main() { if (yyin == NULL) yyin = stdin; while (!feof(yyin)) yyparse(); } DEFINITIONS RULES USER CODE
CIS*2750 (W13)D. McCaughan Example: expression parser ’ s scanner Things to note here: –control source of input by setting yyin (in yacc) –yyerror() is called by yacc on parse errors (and can be freely used in actions otherwise), and should be provided –y.tab.h is the include file generated by yacc that contains the token definitions –yacc parsers contain an internal variable called yylval that the lexer should set to contain any value associated with a token (the token itself is always returned as an integer - as defined by yacc) –note: o.k. to return yytext[0], but not yytext - why? careful managing memory when copying strings (this coupling can’t be avoided with lex/yacc) %{ #include “ y.tab.h ” %} % [a-zA-Z_] {yylval.sval = strdup(yytext); return(NAME); } [0-9]+ {yylval.ival = atoi(yytext); return(NUMBER); } “ = “ { return(EQUALS); } “ + ” { return(PLUS); } “ - ” { return(MINUS); } [ \t] { /* ignore whitespace */ } \n { return(0); /* logical EOF */ } %
CIS*2750 (W13)D. McCaughan Understanding conflicts Pointer model –you can think of yacc processing as a “pointer” which moves through the yacc grammar as each token is read –at first there is only 1 pointer; may be >1 to represent partially recognized rules –e.g. start : A B C ; reads A and B start : A B C ; This material is drawn from “lex & yacc (2e)”, Levine, Mason and Brown
CIS*2750 (W13)D. McCaughan Understanding conflicts (cont.) e.g., Recall: rule is reduced when a pointer reaches the end of a rule start : x | y ; x : A B z R ; y : A B z S ; z : C D ; reads A and B start : x | y ; x : A B z R ; y : A B z S ; z : C D ; reads D start : x | y ; x : A B z R ; y : A B z S ; z : C D ; reads C start : x | y ; x : A B z R ; y : A B z S ; z : C D ;
CIS*2750 (W13)D. McCaughan Understanding conflicts (cont.) Reduce/reduce conflict –rule is reduced while there is more than one pointer start : x | y ; x : A ; y : A ; reads A start : x | y ; x : A ; y : A ; reduce rule x ? reduce rule y ?
CIS*2750 (W13)D. McCaughan Understanding conflicts (cont.) Shift/reduce conflict –rule is reduced while there is more than one pointer start : x | y ; x : A R ; y : A ; reads A start : x | y ; x : A R; y : A ; shift R in rule x ? reduce rule y ?
CIS*2750 (W13)D. McCaughan Understanding look-ahead issues Keep in mind that the implementation of a parser algorithm is a separate issue from CFGs yacc parsers use 1 token look-ahead –the following is not a reduce/reduce error as yacc makes decisions based on the next token as well –the following grammar is not ambiguous, however requires 2 tokens of look-ahead yacc cannot do this, so: reduce/reduce error start : x B | y C ; x : A ; y : A ; start : x B C | y B C ; x : A ; y : A ;
CIS*2750 (W13)D. McCaughan Understanding token typing Default token type is int %union - identifies all possible C types that tokens can have e.g. %union { char *str; double real; int integer; } Permits symbols to be of type, or, with the type corresponding to the C type in the %union Note: most of this is handled automatically for you - the declaration is what is important
CIS*2750 (W13)D. McCaughan Understanding token typing (cont.) Now: %token TOKEN1, TOKEN2, … –declares all listed tokens to be of the stated type e.g. %token NAME –the NAME (terminal) token has an associated semantic value that corresponds to the type associated with the identifier str in the %union directive What about non-terminals? %type nonterm1, nonterm2, …
CIS*2750 (W13)D. McCaughan Issues We are ignoring much in this overview: –redefining input() and output() routines to work on sources other than streams ( FILE * ) –default main() routines in yacc –incorporating lexers and parsers as modules in a larger system –changing the default names of files/internal functions/internal variables (necessary if you want more than one parser in a program) –many internal variables/functions ( yywrap, etc.) –we are probably ignoring issues in covering the ignored issues
CIS*2750 (W13)D. McCaughan Additional resources Online manual “lex & yacc (2e)”, John Levine, Tony Mason & Doug Brown, O’Reilly, 1992 The lex & yacc primer/HOWTO Google remains your friend (so I’m told)