Compiler Design 3. Lexical Analyzer, Flex

Slides:



Advertisements
Similar presentations
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Advertisements

 Lex helps to specify lexical analyzers by specifying regular expression  i/p notation for lex tool is lex language and the tool itself is refered to.
CPSC Compiler Tutorial 2 Scanner & Lex.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis - Scanner- Contd Computer Science Rensselaer Polytechnic Compiler Design Lecture 4(01/26/98)
Tools for building compilers Clara Benac Earle. Tools to help building a compiler C –Lexical Analyzer generators: Lex, flex, –Syntax Analyzer generator:
A brief [f]lex tutorial Saumya Debray The University of Arizona Tucson, AZ
LEX and YACC work as a team
Compilers: lex/3 1 Compiler Structures Objectives – –describe lex – –give many examples of lex's use , Semester 1, Lex.
Lexical Analysis Hira Waseem Lecture
1 YACC Parser Generator. 2 YACC YACC (Yet Another Compiler Compiler) Produce a parser for a given grammar.  Compile a LALR(1) grammar Original written.
Review: Regular expression: –How do we define it? Given an alphabet, Base case: – is a regular expression that denote { }, the set that contains the empty.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
FLEX Fast Lexical Analyzer EECS Introduction Flex is a lexical analysis (scanner) generator. Flex is provided with a user input file or Standard.
Flex: A fast Lexical Analyzer Generator CSE470: Spring 2000 Updated by Prasad.
LEX (04CS1008) A tool widely used to specify lexical analyzers for a variety of languages We refer to the tool as Lex compiler, and to its input specification.
Compiler Tools Lex/Yacc – Flex & Bison. Compiler Front End (from Engineering a Compiler) Scanner (Lexical Analyzer) Maps stream of characters into words.
Compiler design Lecture 1: Compiler Overview Sulaimany University 2 Oct
Syntax Specification with YACC © Allan C. Milne Abertay University v
Introduction to Lex Ying-Hung Jiang
1 Using Lex. 2 Introduction When you write a lex specification, you create a set of patterns which lex matches against the input. Each time one of the.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
IN LINE FUNCTION AND MACRO Macro is processed at precompilation time. An Inline function is processed at compilation time. Example : let us consider this.
Introduction to Lex Fan Wu
Flex Fast LEX analyzer CMPS 450. Lexical analysis terms + A token is a group of characters having collective meaning. + A lexeme is an actual character.
Practical 1-LEX Implementation
1 Lex & Yacc. 2 Compilation Process Lexical Analyzer Source Code Syntax Analyzer Symbol Table Intermed. Code Gen. Code Generator Machine Code.
Compiler Principle and Technology Prof. Dongming LU Mar. 26th, 2014.
YACC. Introduction What is YACC ? a tool for automatically generating a parser given a grammar written in a yacc specification (.y file) YACC (Yet Another.
Lex & Yacc By Hathal Alwageed & Ahmad Almadhor. References *Tom Niemann. “A Compact Guide to Lex & Yacc ”. Portland, Oregon. 18 April 2010 *Levine, John.
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
1 LEX & YACC Tutorial February 28, 2008 Tom St. John.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Lexical Analysis - Scanner- Contd Computer Science Rensselaer Polytechnic Compiler Design Lecture 3(01/21/98)
The Role of Lexical Analyzer
1 Steps to use Flex Ravi Chotrani New York University Reviewed By Prof. Mohamed Zahran.
Scanner Generation Using SLK and Flex++ Followed by a Demo Copyright © 2015 Curt Hill.
LECTURE 7 Lex and Intro to Parsing. LEX Last lecture, we learned a little bit about how we can take our regular expressions (which specify our valid tokens)
More yacc. What is yacc – Tool to produce a parser given a grammar – YACC (Yet Another Compiler Compiler) is a program designed to compile a LALR(1) grammar.
9-December-2002cse Tools © 2002 University of Washington1 Lexical and Parser Tools CSE 413, Autumn 2002 Programming Languages
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture Ahmed Ezzat.
CS 310 – Fall 2008 Pacific University CS310 Parsing with Context Free Grammars Today’s reference: Compilers: Principles, Techniques, and Tools by: Aho,
Lecture 9 Symbol Table and Attributed Grammars
Lexical Analyzer in Perspective
Compiler Design (40-414) Main Text Book:
Chapter 3 Lexical Analysis.
NFAs, scanners, and flex.
Tutorial On Lex & Yacc.
8. Symbol Table Chih-Hung Wang
CSc 453 Lexical Analysis (Scanning)
Using SLK and Flex++ Followed by a Demo
Compiler Construction
TDDD55- Compilers and Interpreters Lesson 2
Bison: Parser Generator
Chapter 3: Lexical Analysis
Syntax Analysis Part III
Review: Compiler Phases:
Lecture 5: Lexical Analysis III: The final bits
CS 3304 Comparative Languages
Lecture 4: Lexical Analysis & Chomsky Hierarchy
Other Issues - § 3.9 – Not Discussed
Compiler Structures 3. Lex Objectives , Semester 2,
Compiler Design Yacc Example "Yet Another Compiler Compiler"
More on flex.
Systems Programming & Operating Systems Unit – III
NFAs, scanners, and flex.
Lexical Analysis - Scanner-Contd
Lex Appendix B.1 -- Lex.
Presentation transcript:

Compiler Design 3. Lexical Analyzer, Flex Kanat Bolazar January 26, 2010

Lexical Analyzer The main task of the lexical analyzer is to read the input source program, scanning the characters, and produce a sequence of tokens that the parser can use for syntactic analysis. The interface may be to be called by the parser to produce one token at a time Maintain internal state of reading the input program (with lines) Have a function “getNextToken” that will read some characters at the current state of the input and return a token to the parser Other tasks of the lexical analyzer include Skipping or hiding whitespace and comments Keeping track of line numbers for error reporting Sometimes it can also produce the annotated lines for error reports Produce the value of the token Optional: Insert identifiers into the symbol table

Character Level Scanning The lexical analyzer needs to have a well-defined valid character set Produce invalid character errors Delete invalid characters from token stream so as not to be used in the parser analysis E.g. don’t want invisible characters in error messages For every end-of-line, keep track of line numbers for error reporting Skip over or hide whitespace and comments If comments are nested (not common), must keep track of nesting to find end of comments May produce hidden tokens, for convenience of scanner structure Always produce an end-of-file token Important that quoted strings and comments don’t get stuck if an unexpected end of file occurs

Tokens, Token Types and Values The set of tokens is typically something like the following table Or may have separate token types for different operators or reserved words May want to keep line number with each token Token Type Token Value Informal Description Integer constant Numeric value Numbers like 3, -5, 12 without decimal pts. Floating constant Numbers like 3.0, -5.1, 12.2456789 Reserved word Word string Words like if, then, class, … Identifiers Symbol table index Words not reserved starting with letter or _ and containing only letters, _, and digits Relations Operator string <, <=, ==, … Operators =, +, - , ++, … Char constant Char value ‘A’, … String “this is a string”, … Hidden: end-of-line Hidden: comment

Token Actions Each token recognized can have an action function Many token types produce a value In the case of numeric values, make sure property numeric errors produced, e.g. integer overflow Put identifiers in the symbol table Note that at this time, no effort is made to distinguish scope; there will be one symbol table entry for each identifier Later, separate scope instances will be produced Other types of actions End-of-line (can be treated as a token type that doesn’t output to the parser) Increment line number Get next line of input to scan

Testing Execute lexical analyzer with test cases and compare results with expected results Test cases Exercise every part of lexical analyzer code Produce every error message Don’t have to be valid programs – just valid sequence of tokens

Lex and Yacc Two classical tools for compilers: Lex: A Lexical Analyzer Generator Yacc: “Yet Another Compiler Compiler” Lex creates programs that scan your tokens one by one. Yacc takes a grammar (sentence structure) and generates a parser. Lexical Rules Grammar Rules Lex Yacc Input yylex() yyparse() Parsed Input

Flex: A Fast Scanner Generator Often, instead of the standard Lex and Yacc, Flex and Bison are used: Flex: A fast lexical analyzer (GNU) Bison: A drop-in replacement for (backwards compatible with) Yacc Resources: http://en.wikipedia.org/wiki/Flex_lexical_analyser http://en.wikipedia.org/wiki/GNU_Bison http://dinosaur.compilertools.net/ (the Lex & Yacc Page)

Flex Example 1: Delete This Shortest Flex example, “deletethis.l”: %% deletethis This scanner will match and not echo (default behavior) the word “deletethis”. Compile and run it: $ flex deletethis.l # creates lex.yy.c $ gcc -o scan lex.yy.c -lfl # fl: flex library $ ./scan This deletethis is not deletethis useful. This is not useful. ^D

Flex Example 2: Replace This Another very short Flex example, “replacer.l”: %% replacethis printf(“replaced”); This scanner will match “replacethis” and replace it with “replaced”. Compile and run it: $ flex -o replacer.yy.c replacer.l $ gcc -o replacer replacer.yy.c -lfl $ ./replacer This replacethis is not very replacethis useful. This replaced is not very replaced useful. Please dontreplacethisatall. Please dontreplacedatall.

Flex Example 3: Common Errors Let's replace “the the” with “the”: %% the the printf(“the”); uhh Unfortunately, this does not work: The second “the” is considered part of C code: the the printf(“the”); Also, the open and close matching double quotes used in documents will give errors, so you must always replace: “the” → "the"

Flex Example 3: Common Errors, cont'd You discover such errors when you compile the C code, not when you use flex: $ flex -o errors.yy.c errors.l $ gcc -o errors errors.yy.c -lfl errors.l: In function ‘yylex’: errors.l:2: error: ‘the’ undeclared ... The error is reported back in our errors.l file, but we can also find it in errors.yy.c: case 1: YY_RULE_SETUP #line 2 "errors.l" <-- For error reporting the printf("the"); <-- the ? not C code YY_BREAK case 2:

Flex Example 4: Replace Duplicate Let's replace “the the” with “the”: %% "the the" printf("the"); This time, it works: $ flex -o duplicate.yy.c duplicate.l $ gcc -o duplicate duplicate.yy.c -lfl $ ./duplicate This is the the file. This is the file. This is the the the file. Lathe theory Latheory

Flex Example 4: Replace And Delete Let's replace “the the” with “the” and delete “uhh”: %% "the the" printf("the"); uhh Run as before: This uhh is the the uhhh file. This is the h file. Generally, lexical rules are pattern-action pairs: pattern1 action1 (C code) pattern2 action2 ... Tokens almost never go across space chars as in "the the" above. Regular expressions are often needed and used.

Flex File Structure In Lex and Flex, the general rule file structure is: definitions %% rules user code Definitions: DIGIT [0-9] ID [a-z][a-z0-9]* can be used later in rules with {DIGIT}, {ID}, etc: {DIGIT}+"."{DIGIT}* This is the same as: ([0-9])+"."([0-9])*

Flex Example 5: Count Lines int num_lines = 0, num_chars = 0; %% \n ++num_lines; ++num_chars; . ++num_chars; main() { yylex(); printf( "# of lines = %d, # of chars = %d\n", num_lines, num_chars ); }

Some Regular Expressions for Flex \"[^"]*\" string "\t"|"\n"\" " whitespace (most common forms) [a-zA-Z] [a-zA-Z_][a-zA-Z0-9_]* identifier: allows a, aX, a45__ [0-9]*"."[0-9]+ allows .5 but not 5. [0-9]+"."[0-9]* allows 5. but not .5 [0-9]*"."[0-9]* allows . by itself !!

Resources Aho, Lam, Sethi, and Ullman, Compilers: Principles, Techniques, and Tools, 2nd ed. Addison-Wesley, 2006. (The “purple dragon book”) Flex Manual. Available as single postscript file at the Lex and Yacc page online: http://dinosaur.compilertools.net/#flex http://en.wikipedia.org/wiki/Flex_lexical_analyser