Scanning & Parsing with Lex and YACC

Slides:



Advertisements
Similar presentations
Compiler construction in4020 – lecture 2 Koen Langendoen Delft University of Technology The Netherlands.
Advertisements

Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis.
 Lex helps to specify lexical analyzers by specifying regular expression  i/p notation for lex tool is lex language and the tool itself is refered to.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
Tools for building compilers Clara Benac Earle. Tools to help building a compiler C –Lexical Analyzer generators: Lex, flex, –Syntax Analyzer generator:
A brief [f]lex tutorial Saumya Debray The University of Arizona Tucson, AZ
Compilers: Yacc/7 1 Compiler Structures Objective – –describe yacc (actually bison) – –give simple examples of its use , Semester 1,
LEX and YACC work as a team
1 Flex. 2 Flex A Lexical Analyzer Generator  generates a scanner procedure directly, with regular expressions and user-written procedures Steps to using.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Parser-Driven Games Tool programming © Allan C. Milne Abertay University v
Compilers: lex/3 1 Compiler Structures Objectives – –describe lex – –give many examples of lex's use , Semester 1, Lex.
Lesson 10 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
Using CookCC.  Use *.l and *.y files.  Proprietary file format  Poor IDE support  Do not work well for some languages.
1 YACC Parser Generator. 2 YACC YACC (Yet Another Compiler Compiler) Produce a parser for a given grammar.  Compile a LALR(1) grammar Original written.
Review: Regular expression: –How do we define it? Given an alphabet, Base case: – is a regular expression that denote { }, the set that contains the empty.
PL&C Lab, DongGuk University Compiler Lecture Note, MiscellaneousPage 1 Miscellaneous 컴파일러 입문.
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
FLEX Fast Lexical Analyzer EECS Introduction Flex is a lexical analysis (scanner) generator. Flex is provided with a user input file or Standard.
Flex: A fast Lexical Analyzer Generator CSE470: Spring 2000 Updated by Prasad.
LEX (04CS1008) A tool widely used to specify lexical analyzers for a variety of languages We refer to the tool as Lex compiler, and to its input specification.
Compiler Tools Lex/Yacc – Flex & Bison. Compiler Front End (from Engineering a Compiler) Scanner (Lexical Analyzer) Maps stream of characters into words.
Algorithms  Problem: Write pseudocode for a program that keeps asking the user to input integers until the user enters zero, and then determines and outputs.
JLex Lecture 4 Mon, Jan 24, JLex JLex is a lexical analyzer generator in Java. It is based on the well-known lex, which is a lexical analyzer generator.
Introduction to Lex Ying-Hung Jiang
Introduction to Yacc Ying-Hung Jiang
1 Using Lex. 2 Introduction When you write a lex specification, you create a set of patterns which lex matches against the input. Each time one of the.
1 Using Lex. Flex – Lexical Analyzer Generator A language for specifying lexical analyzers Flex compilerlex.yy.clang.l C compiler -lfl a.outlex.yy.c a.outtokenssource.
Introduction to Lex Fan Wu
Introduction to Lexical Analysis and the Flex Tool. © Allan C. Milne Abertay University v
Flex Fast LEX analyzer CMPS 450. Lexical analysis terms + A token is a group of characters having collective meaning. + A lexeme is an actual character.
Practical 1-LEX Implementation
1 Lex & Yacc. 2 Compilation Process Lexical Analyzer Source Code Syntax Analyzer Symbol Table Intermed. Code Gen. Code Generator Machine Code.
Compiler Principle and Technology Prof. Dongming LU Mar. 26th, 2014.
YACC. Introduction What is YACC ? a tool for automatically generating a parser given a grammar written in a yacc specification (.y file) YACC (Yet Another.
Lex & Yacc By Hathal Alwageed & Ahmad Almadhor. References *Tom Niemann. “A Compact Guide to Lex & Yacc ”. Portland, Oregon. 18 April 2010 *Levine, John.
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
Applications of Context-Free Grammars (CFG) Parsers. The YACC Parser-Generator. by: Saleh Al-shomrani.
1 LEX & YACC Tutorial February 28, 2008 Tom St. John.
COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.
PL&C Lab, DongGuk University Compiler Lecture Note, MiscellaneousPage 1 Yet Another Compiler-Compiler Stephen C. Johnson July 31, 1978 YACC.
Scanner Generation Using SLK and Flex++ Followed by a Demo Copyright © 2015 Curt Hill.
LECTURE 11 Semantic Analysis and Yacc. REVIEW OF LAST LECTURE In the last lecture, we introduced the basic idea behind semantic analysis. Instead of merely.
More yacc. What is yacc – Tool to produce a parser given a grammar – YACC (Yet Another Compiler Compiler) is a program designed to compile a LALR(1) grammar.
2-1. LEX & YACC. 2 Overview  Syntax  What its program looks like –Context-free grammar, BNF  Syntax-directed translation –A grammar-oriented compiling.
YACC (Yet Another Compiler-Compiler) Chung-Ju Wu
1 Syntax Analysis Part III Chapter 4 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University,
9-December-2002cse Tools © 2002 University of Washington1 Lexical and Parser Tools CSE 413, Autumn 2002 Programming Languages
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture Ahmed Ezzat.
LEX & Yacc Sung-Dong Kim, Dept. of Computer Engineering, Hansung University.
YACC SUNG-DONG KIM, DEPT. OF COMPUTER ENGINEERING, HANSUNG UNIVERSITY.
Lexical Analysis.
NFAs, scanners, and flex.
Tutorial On Lex & Yacc.
Algorithms Problem: Write pseudocode for a program that keeps asking the user to input integers until the user enters zero, and then determines and outputs.
Context-free Languages
Regular Languages.
TDDD55- Compilers and Interpreters Lesson 2
Bison: Parser Generator
Syntax Analysis Part III
Subject Name:Sysytem Software Subject Code: 10SCS52
CS 3304 Comparative Languages
Compiler Structures 3. Lex Objectives , Semester 2,
Compiler Lecture Note, Miscellaneous
Compiler Design Yacc Example "Yet Another Compiler Compiler"
Regular Expressions and Lexical Analysis
CMPE 152: Compiler Design December 4 Class Meeting
Systems Programming & Operating Systems Unit – III
Compiler Design 3. Lexical Analyzer, Flex
Presentation transcript:

Scanning & Parsing with Lex and YACC Submissions: 99 Average for A2: 71% Early submission bonus: 1 Full marks: 5 16 teams attempted nonce bonus 7 got full marks 7 teams attempted ACC bonus Can we generate code to support mundane coding tasks and safe time? Scanning & Parsing with Lex and YACC Give you an example for Milestone 1. Hans-Arno Jacobsen ECE 297 Powerful, but not easy

CoursePeer – try it out! Developed by a former ECE297 student Many of the videos under tips & tricks are from him too Short video about CoursePeer To sign up and auto-enrol under ECE297, use this link http://www.crspr.com/?rid=339 Will have a quick demo and use it on Wednesday for our Q&A session

Know your tools! Can we generate code based on a specification of what we want? Is the specification simpler than writing a program for doing the same task? Fully automated program generation has been a dream since the early days of computing.

Where do we need parsing in the storage server?

Where do we need parsing in the storage server? Configuration file (file) Bulk loading of data files (file) Protocol messages (network) Command line arguments (string)

Parsing PROPERTY VALUE server_host localhost server_port 1111 default.conf – the way the disk may see it server_host localhost \n server_port 1111 \n table marks \n # This data directory may be an absolute or relative path. \n data_directory ./data \n\n\n \EOF PROPERTY VALUE (TABLE TABLE-NAME)+ server_host localhost server_port 1111 table marks data_directory ./data Tokens

Scenarios Where we’d like to safe time in writing a quick language processor? Conceptually speaking In our storage servers Languages Data description language Script language Markup language System configurations Workload generation Languages Data schema & data Query language Output formatting (Web, Latex, PDF, Word, Excel) Storage server configuration Benchmarking

Parser generation from 30K feet Written by developer Specification Specification Generator Generated code Generator Other code Written by developer Other code Compiler / Linker Execut- able

Scanning & parsing I server_host localhost \n server_port 1111 \n table marks \n # This data PROPERTY VALUE PROPERTY VALUE … Scanning PROPERTY VALUE (TABLE TABLE-NAME)+ Parsing Verify content, add to data structures, … Processing

Regular expressions (TABLE TABLE-NAME)+ Patterns (TABLE TABLE-NAME)+ TABLE TABLE-NAME TABLE TABLE-NAME TABLE TABLE-NAME … Regular expressions (formal languages) Extended regular expressions (UNIX)

Scanning & parsing II Parsing is really two steps Scanning (a.k.a. tokenizing or lexical analysis) Parsing, i.e., analysis of structure and syntax according to a grammar (i.e., a set of rules) flex is the scanner generator (open source) Fast Lex for lexical analysis YACC is the parser generator Yet Another Compiler Compiler for structural and syntax analysis Lex and YACC work together Generated scanner drives the generated parser We use flex (fast Lex) and Bison (GNU YACC) There are myriads of other tools for Java, C++, …, some of which combine Lex/Yacc into one tool (e.g., javacc)

Objectives for today Cover the basics of Lex & Yacc Everybody should have an appreciation of the potential of these tools There is a lot more detail that remains unsaid To challenge you

representation of input) Lex & YACC overview server_host localhost \n server_port 1111 \n table marks \n # This data directory may be an absolute or relative path. \n data_directory ./data \n\n\n \EOF Lexical Analyzer input stream token stream PROPERTY VALUE PROPERTY VALUE Output defined by actions in parser specification (often an in-memory representation of input) Structural Analyzer token stream

Lexical Analysis with Lex

Lex introduction Input specification (*.l) lex.yy.c input stream Synonyms: lexical analyzer, scanner, lexer, tokenizer flex is fast Lex Lex introduction Input specification (*.l) flex You can control the name of generated file lex.yy.c C compiler Lexical Analyzer input stream token stream You generate the lexical analyzer by using flex

Lex Input specification for lex – the “program” Three parts: Definitions, Rules, User code Use “%%” as a delimiter for each part First part: Definitions Options used by flex inside the scanner Defines variables & macros Code within “%{” and “%}” directly copied into the scanner (e.g., global variables, header files) Second part: Rules Patterns and corresponding actions Actions are executed when corresponding pattern(s) matches Patterns are defined by regular expressions

Parsing the configuration file of Milestone 1 {host} { return HOST_PROPERTY; } {port} { return PORT_PROPERTY; } table { return TABLE; } {dir} { return DDIR_PROPERTY; } [\t\n ]+ { } #.*\n { } {a2Z}* { yylval.sval = strdup(yytext); return STRING; } [0-9]+ { yylval.pval = (int) atoi(yytext); return PORT_NUMBER; } . { return yytext[0]; } … %{ #include "config_parser.tab.h" ... %} a2Z [a-zA-Z] host server_host port server_port dir data_directory %% Pattern Action Shorthands for use below config_parser.l

flex pattern matching principles Actions are executed when patterns match Tokens are returned to caller; next pattern … Patterns match a given input character or string only once Input stream is consumed flex executes the action for the longest possible matching input Order of patterns in the spec. is important

Note the flex syntax on the next slides. Regular expressions Concise description of a character string Used widely in tools (editors, text retrieval, …); Main operators A | B matches A or B A (A | B) matches A followed by A or B A* 0 or more occurrences of A A? 0 or 1 A+ 1 or more Note the flex syntax on the next slides.

flex regular expressions by example I (Really: extended regular expressions) `x‘ match the character 'x' `.‘ any character (byte) except newline `[xyz]’ match either an 'x', a 'y', or a 'z' `[abj-oZ]‘ match an 'a', a 'b', any letter from 'j' through 'o', or a 'Z‘ `[^A-Z]‘ a "negated character class", i.e., any character EXCEPT those in the class `[^A-Z\n]’ any character EXCEPT an uppercase letter or a newline

flex regular expression by example II `r*‘ zero or more r's, where r is any regular expression `r+‘ one or more r's `r?‘ zero or one r (that is, “an optional r”) ‘r{2,5}‘ anywhere from two to five r's `r{2,}‘ two or more r's `r{4}‘ exactly 4 r's ‘<<EOF>>' an end-of-file r is any regular expression

flex regular expressions There are many more expressions, see manual Form complex expressions E.g.: IP address, names, … The expression syntax is used in other tools as well (well worth learning)

Parsing the configuration file of Milestone 1 %{ #include "config_parser.tab.h" ... %} a2Z [a-zA-Z] host server_host port server_port dir data_directory %% {host} { return HOST_PROPERTY; } {port} { return PORT_PROPERTY; } table { return TABLE; } {dir} { return DDIR_PROPERTY; } [\t\n ]+ { } #.*\n { } {a2Z}* { yylval.sval = strdup(yytext); return STRING; } [0-9]+ { yylval.pval = (int) atoi(yytext); return PORT_NUMBER; } . { return yytext[0]; } <<EOF>> { return 0; } User-defined variable in YACC (conveys token value to YACC) server_host localhost server_port 1111 table marks data_directory ./data config_parser.l

Parsing with Yacc

You can control the name of generated file YACC introducing You can control the name of generated file Input specification (*.y) YACC y.tab.c C compiler Output defined by actions in parser specification Syntax analyzer / parser token stream, e.g., via flex From the specified grammar, YACC generates a parser which recognizes “sentences” according to the grammar

YACC Input specification for YACC (similar to flex) Three parts: Definitions, Rules, User code Use “%%” as a delimiter for each part First part: Definitions Definition of tokens for the second part and for use by flex Definition of variables for use by the parser code Second part: Rules Grammar for the parser Third part: User code The code in this part is copied into the parser generated by YACC

Configuration file parser Milestone 1 %{ #include <string.h> #include <stdio.h> struct table *tl, *t; struct configuration *c; /* define a linked list of table names */ struct table { char *table_name; struct table *next; }; /* define a structure for the configuration information */ struct configuration { char *host; int port; struct table *tlist; char *data_dir; }; Definition section config_parser.y

Configuration file parser Milestone 1 %} %union{ char *sval; // String value (user defined) int pval; // Port number value (user defined) } %token <sval> STRING %token <pval> PORT_NUMBER %token HOST_PROPERTY PORT_PROPERTY DDIR_PROPERTY TABLE %% Definition section cont’d. config_parser.y

Configuration file parser Milestone 1 property_list: HOST_PROPERTY STRING PORT_PROPERTY NUMBER table_list data_directory ; table_list: table_list TABLE STRING | TABLE STRING data_directory: DDIR_PROPERTY STRING ; %% (Grammar) Rules section (simplified) config_parser.y

(Grammar) Rules section struct configuration { char *host; int port; struct table *tlist; char *data_dir; }; struct configuration *c; data_directory: DDIR_PROPERTY STRING { c = (struct configuration *) malloc(sizeof(struct configuration)); // Check c for NULL c->data_dir = strdup( $2 ); } ; $1 $2 (Grammar) Rules section (details) config_parser.y

(Grammar) Rules section struct configuration { char *host; int port; struct table *tlist; char *data_dir; }; struct configuration *c; property_list: HOST_PROPERTY STRING PORT_PROPERTY PORT_NUMBER table_list data_directory { c->host = strdup( $2 ); c->port = $4; c->tlist = tl; } ; (Grammar) Rules section (details) config_parser.y

Configuration file parser Milestone 1 property_list: HOST_PROPERTY STRING PORT_PROPERTY NUMBER table_list data_directory ; table_list: table_list TABLE STRING | TABLE STRING data_directory: DDIR_PROPERTY STRING ; %% … TABLE STRING TABLE STRING (Grammar) Rules section (simplified) config_parser.y

table_list is a recursive rule Example table specification in configuration file table MyCourses table MyMarks table MyFriends table_list: table_list TABLE STRING | TABLE STRING ; Terminology table_list is called a non-terminal TABLE & STRING are terminals

Recursive rule execution table_list : table_list TABLE STRING table_list TABLE STRING TABLE STRING TABLE STRING TABLE STRING TABLE STRING table MyCourses table MyMarks table MyCourses table MyFriends table MyMarks table MyCourses table MyCourses table MyMarks table MyFriends table_list: table_list TABLE STRING | TABLE STRING ;

table_list TABLE STRING { struct table *tl, *t; struct table { char *table_name; struct table *next; }; table_list: table_list TABLE STRING { t = (struct table *) malloc(sizeof(struct table)); t->table_name = strdup( $3 ); t->next = tl; tl = t; } | TABLE STRING tl = (struct table *) malloc(sizeof(struct table)); tl->table_name = strdup( $2 ); tl->next = NULL; ; $1 $2 $3 t->next = tl table tl = t $1 $2 tl->next = NULL tl table config_parser.y

How to invoke the parser int main (int argc, char **argv){ FILE *f; extern FILE *yyin; if (argc == 2) { f = fopen(argv[1],"r"); if (!f){ …// error handling …} yyin = f; while( ! feof(yyin) ) { if (yyparse() != 0) { … yyerror(""); exit(0); }; } fclose(f); yylex() for calling generated scanner by default called within yyparse()

In the Makefile lexer: config_parser.l ${LEX} config_parser.l ${CC} ${CFLAGS} ${INCLUDE} -c lex.yy.c yaccer: config_parser.y ${YACC} -d config_parser.y ${CC} ${CFLAGS} ${INCLUDE} -c config_parser.tab.c parser: config_parser.tab.o lex.yy.o ${CC} ${CFLAGS} ${INCLUDE} -c parser.c ${CC} -o p ${CFLAGS} ${INCLUDE} lex.yy.o \ config_parser.tab.o \ parser.o

Benefits Faster development Compared to manual implementation Easier to change the specification and generate new parser Than to modify 1000s of lines of code to add, change, delete an existing feature Less error-prone, as code is generated Cost: Learning curve Invest once, amortized over 40+ years career

If you want to know more Lecture, examples and some recommended reading are enough to tackle all of the parsing for Milestone 3 & 4 3rd and 4th year lectures on Compilers may show you the algorithms behind & inside Lex & YACC Lectures on Computability and Theory of Computation may also show you these algorithms

Regular expressions annotated with actions A flex specification %{ #include <stdio.h #include "y.tab.h" int c; extern int yylval; %} %% " " ; [a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); } [0-9] { yylval = c - '0'; return(DIGIT); [^a-z0-9\b] { return(c); The Header The “Guts”: Regular expressions annotated with actions

The header %{ #include <stdio.h #include "y.tab.h" int c; extern int yylval; %} %% Temporary variable(s) Special variable defined in scanner used in parser for transferring values associated with tokens to parser dividing line between header and rules section

The rules %% " " ; [a-z] { c = yytext[0]; yylval = c - 'a'; " " ; [a-z] { c = yytext[0]; yylval = c - 'a'; return (LETTER); } [0-9] { yylval = c - '0'; return (DIGIT); [^a-z0-9\b] { return(c); yytext: the string associated with the token the string associated with the token the string associated with the token

sets yylval to the character’s alphabetical order The rules sets yylval to the character’s alphabetical order %% " " ; [a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); } [0-9] { yylval = c - '0'; return(DIGIT); [^a-z0-9\n] { return(c); sets yylval to digit’s numerical value otherwise simply returns that character; presumably it’s an operator: +*-, etc.

Simple example Implement a calculator which can recognize adding or subtracting of numbers [linux33]% ./y_calc 1+101 = 102 [linux33] % ./y_calc 1000-300+200+100 = 1000 [linux33] %

Example – the Lex part %{ #include <math.h> #include "y.tab.h" extern int yylval; %} %% [0-9]+ { yylval = atoi(yytext); return NUMBER; } [\t ]+ ; /* Do nothing for white space */ \n return 0;/* End of the logic */ . return yytext[0]; Definitions pattern action Rules

Example – the Yacc part %token NAME NUMBER %% statement: NAME '=' expression | expression { printf("= %d\n", $1); } ; expression:expression '+' NUMBER { $$ = $1 + $3; } |expression '-' NUMBER { $$ = $1 - $3; } | NUMBER { $$ = $1; } Definitions Include Yacc library (-ly) Rules