Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

Compiler construction in4020 – lecture 2 Koen Langendoen Delft University of Technology The Netherlands.
1 Week 2 Questions / Concerns Schedule this week: Homework1 & Lab1a due at midnight on Friday. Sherry will be in Klamath Falls on Friday Lexical Analyzer.
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
Chapter 2 Lexical Analysis Nai-Wei Lin. Lexical Analysis Lexical analysis recognizes the vocabulary of the programming language and transforms a string.
Lecture 02 – Lexical Analysis Eran Yahav 1. 2 You are here Executable code exe Source text txt Compiler Lexical Analysis Syntax Analysis Parsing Semantic.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Recap Mooly Sagiv. Outline Subjects Studied Questions & Answers.
1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical.
Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1.
Tools for building compilers Clara Benac Earle. Tools to help building a compiler C –Lexical Analyzer generators: Lex, flex, –Syntax Analyzer generator:
Syntax Analysis Mooly Sagiv html:// Textbook:Modern Compiler Design Chapter 2.2 (Partial) Hashlama 11:00-14:00.
Lexical Analysis Mooly Sagiv html:// Textbook:Modern Compiler Implementation in C Chapter 2.
Syntax Analysis Mooly Sagiv Textbook:Modern Compiler Design Chapter 2.2 (Partial)
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
Compiler Construction Lexical Analysis Rina Zviel-Girshin and Ohad Shacham School of Computer Science Tel-Aviv University.
Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1.
Scanning with Jflex.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Material taught in lecture Scanner specification language: regular expressions Scanner generation using automata theory + extra book-keeping.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CS 536 Spring Learning the Tools: JLex Lecture 6.
Compiler Construction Lexical Analysis Rina Zviel-Girshin and Ohad Shacham School of Computer Science Tel-Aviv University.
Compilation Lecture 2: Lexical Analysis Syntax Analysis (1): CFLs, CFGs, PDAs Noam Rinetzky 1.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 1 Chapter 4 Chapter 4 Lexical analysis.
1 Flex. 2 Flex A Lexical Analyzer Generator  generates a scanner procedure directly, with regular expressions and user-written procedures Steps to using.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Lexical Analysis Mooly Sagiv Schrierber Wed 10:00-12:00 html:// Textbook:Modern.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
CPSC 388 – Compiler Design and Construction Scanners – JLex Scanner Generator.
Compilation (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
Lexical Analysis: Regular Expressions CS 671 January 22, 2008.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras.
Flex: A fast Lexical Analyzer Generator CSE470: Spring 2000 Updated by Prasad.
1 Languages and Compilers (SProg og Oversættere) Lexical analysis.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
JLex Lecture 4 Mon, Jan 24, JLex JLex is a lexical analyzer generator in Java. It is based on the well-known lex, which is a lexical analyzer generator.
Lexical Analysis – Part I EECS 483 – Lecture 2 University of Michigan Monday, September 11, 2006.
Introduction to Lex Ying-Hung Jiang
Compiler Construction Lexical Analysis. 2 Administration Project Teams Project Teams Send me your group Send me your group
1 Using Lex. Flex – Lexical Analyzer Generator A language for specifying lexical analyzers Flex compilerlex.yy.clang.l C compiler -lfl a.outlex.yy.c a.outtokenssource.
CSc 453 Lexical Analysis (Scanning)
1 Lex & Yacc. 2 Compilation Process Lexical Analyzer Source Code Syntax Analyzer Symbol Table Intermed. Code Gen. Code Generator Machine Code.
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Scanner Introduction to Compilers 1 Scanner.
The Role of Lexical Analyzer
Exercise Solution for Exercise (a) {1,2} {3,4} a b {6} a {5,6,1} {6,2} {4} {3} {5,6} { } b a b a a b b a a b a,b b b a.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Lexical Analysis: Regular Expressions CS 471 September 3, 2007.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
9-December-2002cse Tools © 2002 University of Washington1 Lexical and Parser Tools CSE 413, Autumn 2002 Programming Languages
LEX & Yacc Sung-Dong Kim, Dept. of Computer Engineering, Hansung University.
Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lexical Analysis (Sections )
CSc 453 Lexical Analysis (Scanning)
Textbook:Modern Compiler Design
Lecture 2: Lexical Analysis Noam Rinetzky
CSc 453 Lexical Analysis (Scanning)
RegExps & DFAs CS 536.
Chapter 3: Lexical Analysis
Review: Compiler Phases:
Compiler Structures 2. Lexical Analysis Objectives
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1

A motivating example Create a program that counts the number of lines in a given input text file

Solution (Flex) int num_lines = 0; % \n ++num_lines;. ; % main() { yylex(); printf( "# of lines = %d\n", num_lines); }

Solution(Flex) int num_lines = 0; % \n ++num_lines;. ; % main() { yylex(); printf( "# of lines = %d\n", num_lines); } initial ; newline \n other

JLex Spec File User code –Copied directly to Java file JLex directives –Define macros, state names Lexical analysis rules –Optional state, regular expression, action –How to break input to tokens –Action when token matched % Possible source of javac errors down the road DIGIT= [0-9] LETTER= [a-zA-Z] YYINITIAL {LETTER} ({LETTER}|{DIGIT})*

Jlex linecount import java_cup.runtime.*; % %cup %{ private int lineCounter = 0; %} %eofval{ System.out.println("line number=" + lineCounter); return new Symbol(sym.EOF); %eofval} NEWLINE=\n % {NEWLINE} { lineCounter++; } [^{NEWLINE}] { } File: lineCount

Outline Roles of lexical analysis What is a token Regular expressions Lexical analysis Automatic Creation of Lexical Analysis Error Handling

Basic Compiler Phases Source program (string) Fin. Assembly lexical analysis syntax analysis semantic analysis Tokens Abstract syntax tree Front-End Back-End Annotated Abstract syntax tree

Example Tokens TypeExamples IDfoo n_14 last NUM REAL e67 5.5e-10 IFif COMMA, NOTEQ!= LPAREN( RPAREN)

Example Non Tokens TypeExamples comment/* ignored */ preprocessor directive#include #define NUMS 5, 6 macroNUMS whitespace\t \n \b

Example void match0(char *s) /* find a zero */ { if (!strncmp(s, “0.0”, 3)) return 0. ; } VOID ID(match0) LPAREN CHAR DEREF ID(s) RPAREN LBRACE IF LPAREN NOT ID(strncmp) LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3) RPAREN RPAREN RETURN REAL(0.0) SEMI RBRACE EOF

input –program text (file) output –sequence of tokens Read input file Identify language keywords and standard identifiers Handle include files and macros Count line numbers Remove whitespaces Report illegal symbols [Produce symbol table] Lexical Analysis (Scanning)

Simplifies the syntax analysis –And language definition Modularity Reusability Efficiency Why Lexical Analysis

What is a token? Defined by the programming language Can be separated by spaces Smallest units Defined by regular expressions

A simplified scanner for C Token nextToken() { char c ; loop: c = getchar(); switch (c){ case ` `:goto loop ; case `;`: return SemiColumn; case `+`: c = getchar() ; switch (c) { case `+': return PlusPlus ; case '=’ return PlusEqual; default: ungetc(c); return Plus; } case `<`: case `w`: }

Regular Expressions Basic patternsMatching xThe character x.Any character expect newline [xyz]Any of the characters x, y, z R?An optional R R*Zero or more occurrences of R R+One or more occurrences of R R1R2R1R2 R 1 followed by R 2 R 1 |R 2 Either R 1 or R 2 (R)R itself

Escape characters in regular expressions \ converts a single operator into text –a\+ –(a\+\*)+ Double quotes surround text –“a+*”+ Esthetically ugly But standard

Ambiguity Resolving Find the longest matching token Between two tokens with the same length use the one declared first

The Lexical Analysis Problem Given –A set of token descriptions Token name Regular expression –An input string Partition the strings into tokens (class, value) Ambiguity resolution –The longest matching token –Between two equal length tokens select the first

A Jlex specification of C Scanner import java_cup.runtime.*; % %cup %{ private int lineCounter = 0; %} Letter= [a-zA-Z_] Digit= [0-9] % ”\t” { } ”\n” { lineCounter++; } “;” { return new Symbol(sym.SemiColumn);} “++” {return new Symbol(sym.PlusPlus); } “+=” {return new Symbol(sym.PlusEq); } “+” {return new Symbol(sym.Plus); } “while” {return new Symbol(sym.While); } {Letter}({Letter}|{Digit})* {return new Symbol(sym.Id, yytext() ); } “<=” {return new Symbol(sym.LessOrEqual); } “<” {return new Symbol(sym.LessThan); }

Jlex Input – regular expressions and actions (Java code) Output – A scanner program that reads the input and applies actions when input regular expression is matched Jlex regular expressions input program tokens scanner

How to Implement Ambiguity Resolving Between two tokens with the same length use the one declared first Find the longest matching token

Pathological Example if{ return IF; } [a-z][a-z0-9]*{ return ID; } [0-9]+ { return NUM; } [0-9]”.”[0-9]*|[0-9]*”.”[0-9]+{ return REAL; } (\-\-[a-z]*\n)|(“ “|\n|\t){ ; }.{ error(); }

int edges[][256] ={ /* …, 0, 1, 2, 3,..., -, e, f, g, h, i, j,... */ /* state 0 */ {0,..., 0, 0, …, 0, 0, 0, 0, 0,..., 0, 0, 0, 0, 0, 0} /* state 1 */ {13,..., 7, 7, 7, 7, …, 9, 4, 4, 4, 4, 2, 4,..., 13, 13} /* state 2 */ {0, …, 4, 4, 4, 4,..., 0, 4, 3, 4, 4, 4, 4,..., 0, 0} /* state 3 */ {0, …, 4, 4, 4, 4, …, 0, 4, 4, 4, 4, 4, 4,, 0, 0} /* state 4 */{0, …, 4, 4, 4, 4,..., 0, 4, 4, 4, 4, 4, 4,..., 0, 0} /* state 5 */{0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0} /* state 6 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0,..., 0, 0} /* state 7 */... /* state 13 */{0, …, 0, 0, 0, 0, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0}

Pseudo Code for Scanner Token nextToken() { lastFinal = 0; currentState = 1 ; inputPositionAtLastFinal = input; currentPosition = input; while (not(isDead(currentState))) { nextState = edges[currentState][*currentPosition]; if (isFinal(nextState)) { lastFinal = nextState ; inputPositionAtLastFinal = currentPosition; } currentState = nextState; advance currentPosition; } input = inputPositionAtLastFinal ; return action[lastFinal]; }

Example Input: “if --not-a-com”

finalstateinput 01if --not-a-com return IF

found whitespace finalstateinput 01 --not-a-com 12 --not-a-com not-a-com

finalstateinput 01--not-a-com not-a-com 910--not-a-com 910--not-a-com 90 error

finalstateinput 01-not-a-com error

Efficient Scanners Efficient state representation Input buffering Using switch and gotos instead of tables

Constructing Automaton from Specification Create a non-deterministic automaton (NDFA) from every regular expression Merge all the automata using epsilon moves (like the | construction) Construct a deterministic finite automaton (DFA) –State priority Minimize the automaton starting with separate accepting states

NDFA Construction if{ return IF; } [a-z][a-z0-9]*{ return ID; } [0-9]+ { return NUM; }

DFA Construction

Minimization

Start States It may be hard to specify regular expressions for certain constructs –Examples Strings Comments Writing automata may be easier Can combine both Specify partial automata with regular expressions on the edges –No need to specify all states –Different actions at different states

Missing Creating a lexical analysis by hand Table compression Symbol Tables Nested Comments Handling Macros

Summary For most programming languages lexical analyzers can be easily constructed automatically Exceptions: –Fortran –PL/1 Lex/Flex/Jlex are useful beyond compilers