Compiler Baojian Hua bjhua@ustc.edu.cn Lexical Analysis (II) Compiler Baojian Hua bjhua@ustc.edu.cn.

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

Compiler construction in4020 – lecture 2 Koen Langendoen Delft University of Technology The Netherlands.
4b Lexical analysis Finite Automata
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
Finite Automata CPSC 388 Ellen Walker Hiram College.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
 Lex helps to specify lexical analyzers by specifying regular expression  i/p notation for lex tool is lex language and the tool itself is refered to.
Compiler Construction
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
Lexical Analysis Compiler Baojian Hua
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Automata and Regular Expression Discrete Mathematics and Its Applications Baojian Hua
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
CS 426 Compiler Construction
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 1 Chapter 4 Chapter 4 Lexical analysis.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Other Issues - § 3.9 – Not Discussed More advanced algorithm construction – regular expression to DFA directly.
Lexical Analysis (I) Compiler Baojian Hua
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Lexical Analysis – Part I EECS 483 – Lecture 2 University of Michigan Monday, September 11, 2006.
By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability.
IN LINE FUNCTION AND MACRO Macro is processed at precompilation time. An Inline function is processed at compilation time. Example : let us consider this.
1 Lex & Yacc. 2 Compilation Process Lexical Analyzer Source Code Syntax Analyzer Symbol Table Intermed. Code Gen. Code Generator Machine Code.
YACC. Introduction What is YACC ? a tool for automatically generating a parser given a grammar written in a yacc specification (.y file) YACC (Yet Another.
Compiler Construction Sohail Aslam Lecture 9. 2 DFA Minimization  The generated DFA may have a large number of states.  Hopcroft’s algorithm: minimizes.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
C Chuen-Liang Chen, NTUCS&IE / 35 SCANNING Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University Taipei,
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
Lexical Analysis.
Exercise Solution for Exercise (a) {1,2} {3,4} a b {6} a {5,6,1} {6,2} {4} {3} {5,6} { } b a b a a b b a a b a,b b b a.
1 Steps to use Flex Ravi Chotrani New York University Reviewed By Prof. Mohamed Zahran.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
1 February 23, February 23, 2016February 23, 2016February 23, 2016 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University.
using Deterministic Finite Automata & Nondeterministic Finite Automata
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Deterministic Finite Automata Nondeterministic Finite Automata.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture Ahmed Ezzat.
Department of Software & Media Technology
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Lecture 2 Lexical Analysis
Lexical analysis Finite Automata
RegExps & DFAs CS 536.
Recognizer for a Language
Lexical Analysis Why separate lexical and syntax analyses?
Review: Compiler Phases:
4b Lexical analysis Finite Automata
Compiler Construction
4b Lexical analysis Finite Automata
Lexical Analysis Uses formalism of Regular Languages
Presentation transcript:

Compiler Baojian Hua bjhua@ustc.edu.cn Lexical Analysis (II) Compiler Baojian Hua bjhua@ustc.edu.cn

Recap lexical analyzer character sequence token sequence

Lexer Implementation Options: Write a lexer by hand from scratch Automatic lexer generator We’ve discussed the first approach, now we continue to discuss the second one

declarative specification Lexer Implementation declarative specification lexical analyzer

Regular Expressions How to specify a lexer? What’s a lexer-generator? Develop another language Regular expressions, along with others What’s a lexer-generator? Finite-state automata Another compiler…

Lexer Generator History Lexical analysis was once a performance bottleneck certainly not true today! As a result, early research investigated methods for efficient lexical analysis While the performance concerns are largely irrelevant today, the tools resulting from this research are still in wide use

History: A long-standing goal In this early period, a considerable amount of study went into the goal of creating an automatic compiler generator (aka compiler-compiler) declarative compiler specification compiler

History: Unix and C In the mid-1960’s at Bell Labs, Ritchie and others were developing Unix A key part of this project was the development of C and a compiler for it Johnson, in 1968, proposed the use of finite state machines for lexical analysis and developed Lex [CACM 11(12), 1968] Lex realized a part of the compiler-compiler goal by automatically generating fast lexical analyzers

The Lex-like tools The original Lex generated lexers written in C (C in C) Today every major language has its own lex tool(s): flex, sml-lex, Ocaml-lex, JLex, C#lex, … One example next: written in flex (GNU’s implementation of Lex) concepts and techniques apply to other tools

FLex Specification Lexical specification consists of 3 parts (yet another programming language): Definitions (RE definitions) %% Rules (association of actions with REs) User code (plain C code)

Definitions Code fragments that are available to the rule section REs: %{…%} REs: e.g., ALPHA [a-zA-Z] Options: e.g., %s STRING

Rules Rules: A rule consists of a pattern and an action: Pattern is a regular expression. Action is a fragment of ordinary C code. Longest match & rule priority used for disambiguation Rules may be prefixed with the list of lexers that are allowed to use this rule. <lexerList> regularExp {action}

Example %{ #include <stdio.h> %} ALPHA [a-zA-Z] %% <INITIAL>{ALPHA} {printf (“%c\n”), yytext);} <INITIAL>.|\n => {} int main () { yylex (); }

Lex Implementation Lex accepts REs (along with others) and produce FAs So Lex is a compiler from REs to FAs Internal: table-driven algorithm RE NFA DFA

Finite-state Automata (FA) Input String M {Yes, No} M = (, S, q0, F, ) Input alphabet Transition function State set Final states Initial state

Transition functions DFA : S    S NFA : S    (S)

DFA example Which strings of as and bs are accepted? 2 1 Transition function: { (q0,a)q1, (q0,b)q0, (q1,a)q2, (q1,b)q1, (q2,a)q2, (q2,b)q2 } 1 2 a b a,b

NFA example 1 Transition function: {(q0,a){q0,q1}, (q0,b){q1}, (q1,b){q0,q1}} 1 a,b a b

RE -> NFA: Thompson algorithm Break RE down to atoms construct small NFAs directly for atoms inductively construct larger NFAs from smaller NFAs Easy to implement a small recursion algorithm

RE -> NFA: Thompson algorithm -> c -> e1 e2 -> e1 | e2 -> e1*  c  e1 e2

RE -> NFA: Thompson algorithm -> c -> e1 e2 -> e1 | e2 -> e1* e1     e2    e1 

Example alpha = [a-z]; id = {alpha}+; %% ”if” => (…); /* Equivalent to: * “if” | {id} */

Example ”if” => (…); {id} => (…); i  f   …

NFA -> DFA: Subset construction algorithm (* subset construction: workList algorithm *) q0 <- e-closure (n0) Q <- {q0} workList <- q0 while (workList != []) remove q from workList foreach (character c) t <- e-closure (move (q, c)) D[q, c] <- t if (t\not\in Q) add t to Q and workList

NFA -> DFA: -closure /* -closure: fixpoint algorithm */ /* Dragon book Fig 3.33 gives a DFS-like * algorithm. * Here we give a recursive version. (Simpler) */ X <- \phi fun eps (t) = X <- X ∪ {t} foreach (s \in one-eps(t)) if (s \not\in X) then eps (s)

NFA -> DFA: -closure /* -closure: fixpoint algorithm */ /* Dragon book Fig 3.33 gives a DFS-like * algorithm. * Here we give a recursive version. (Simpler) */ fun e-closure (T) = X <- T foreach (t \in T) X <- X ∪ eps(t)

NFA -> DFA: -closure /* -closure: fixpoint algorithm */ /* And a BFS-like algorithm. */ X <- empty; fun e-closure (T) = Q <- T X <- T while (Q not empty) q <- deQueue (Q) foreach (s \in one-eps(q)) if (s \not\in X) enQueue (Q, s) X <- X ∪ s

Example ”if” => (…); {id} => (…); i  f  4 1 2 3  [a-z]   8  [a-z]   8 5 6 7 [a-z]

Example q0 = {0, 1, 5} Q = {q0} D[q0, ‘i’] = {2, 3, 6, 7, 8} Q ∪= q1 D[q0, _] = {6, 7, 8} Q ∪= q2 D[q1, ‘f’] = {4, 7, 8} Q ∪= q3  f i  4 1 2 3  [a-z]   8 5 6 7 f q1 q3 i _ [a-z] q0 q2

Example D[q1, _] = {7, 8} Q ∪= q4 D[q2, _] = {7, 8} Q  f i  1 2 3 4  [a-z]   8 5 6 7 f q3 q1 _ i _ [a-z] _ _ q0 q4 _ q2

Example q0 = {0, 1, 5} q1 = {2, 3, 6, 7, 8} q2 = {6, 7, 8} q3 = {4, 7, 8} q4 = {7, 8}  f i  1 2 3 4  [a-z]   8 5 6 7 f q3 letter [a-z] q1 i letter-f q4 q0 letter letter q2 letter-i

Example q0 = {0, 1, 5} q1 = {2, 3, 6, 7, 8} q2 = {6, 7, 8} q3 = {4, 7, 8} q4 = {7, 8}  f i  1 2 3 4  [_a-zA-Z]   8 5 6 7 f q3 letter q1 [_a-zA-Z0-9] i letter-f q4 q0 letter letter q2 letter-i

DFA -> Table-driven Algorithm Conceptually, an FA is a directed graph Pragmatically, many different strategies to encode an FA in the generated lexer Matrix (adjacency matrix) sml-lex Array of list (adjacency list) Hash table Jump table (switch statements) flex Balance between time and space

Example: Adjacency matrix ”if” => (…); {id} => (…); state\char i f letter-i-f other q0 q1 q2 error q4 q3 f q3 state q0 q1 q2 q3 q4 action ID IF letter q1 i letter-f q4 q0 letter letter q2 letter-i

DFA Minimization: Hopcroft’s Algorithm (Generalized) q3 letter q1 i letter-f q4 q0 letter letter letter-i q2 state q0 q1 q2 q3 q4 action ID IF

DFA Minimization: Hopcroft’s Algorithm (Generalized) q3 letter q1 i letter-f q4 q0 letter letter letter-i q2 state q0 q1 q2 q3 q4 action Id IF

DFA Minimization: Hopcroft’s Algorithm (Generalized) q3 q1 i letter letter-f q0 letter-i q2, q4 letter state q0 q1 q2, q4 q3 action ID IF

Summary A Lexer: input: stream of characters output: stream of tokens Writing lexers by hand is boring, so we use lexer generators RE -> NFA -> DFA -> table-driven algorithm Moral: don’t underestimate your theory classes! great application of cool theory developed in mathematics. we’ll see more cool apps. as the course progresses