Lexical Analysis Compiler Baojian Hua

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

Compiler construction in4020 – lecture 2 Koen Langendoen Delft University of Technology The Netherlands.
Compiler Baojian Hua Lexical Analysis (II) Compiler Baojian Hua
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
Abstract Syntax Trees Compiler Baojian Hua
1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical.
Parsing Compiler Baojian Hua Front End source code abstract syntax tree lexical analyzer parser tokens IR semantic analyzer.
College of Computer Science & Technology Compiler Construction Principles & Implementation Techniques -1- Compiler Construction Principles & Implementation.
Compiler Construction
Lexing Discrete Mathematics and Its Applications Baojian Hua
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Automata and Regular Expression Discrete Mathematics and Its Applications Baojian Hua
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
LR Parsing Compiler Baojian Hua
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis (I) Compiler Baojian Hua
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analyzer (Checker)
1 Chapter 3 Scanning – Theory and Practice. 2 Overview of scanner A scanner transforms a character stream of source file into a token stream. It is also.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Lexical Analysis – Part I EECS 483 – Lecture 2 University of Michigan Monday, September 11, 2006.
By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability.
CPS 506 Comparative Programming Languages Syntax Specification.
Abstract Syntax Trees Compiler Baojian Hua
LL(k) Parsing Compiler Baojian Hua
CSc 453 Lexical Analysis (Scanning)
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Compiler Construction Sohail Aslam Lecture 9. 2 DFA Minimization  The generated DFA may have a large number of states.  Hopcroft’s algorithm: minimizes.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Lexical Analysis: DFA Minimization & Wrap Up. Automating Scanner Construction PREVIOUSLY RE  NFA ( Thompson’s construction ) Build an NFA for each term.
C Chuen-Liang Chen, NTUCS&IE / 35 SCANNING Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University Taipei,
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
Lexical Analysis.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
1 February 23, February 23, 2016February 23, 2016February 23, 2016 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University.
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
1 Compiler Construction Vana Doufexi office CS dept.
Deterministic Finite Automata Nondeterministic Finite Automata.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
COMP 3438 – Part II - Lecture 3 Lexical Analysis II Par III: Finite Automata Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lecture 2 Lexical Analysis
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Finite-State Machines (FSMs)
RegExps & DFAs CS 536.
Finite-State Machines (FSMs)
Two issues in lexical analysis
Lexical Analysis Why separate lexical and syntax analyses?
Chapter 3: Lexical Analysis
Lecture 5: Lexical Analysis III: The final bits
Presentation transcript:

Lexical Analysis Compiler Baojian Hua

Compiler source program target program compiler

Front and Back Ends source program target program front end back end IR

Front End source code abstract syntax tree lexical analyzer parser tokens IR semantic analyzer

Lexical Analyzer The lexical analyzer translates the source program into a stream of lexical tokens Source program: stream of characters vary from language to language (ASCII or Unicode, or … ) Lexical token: compiler internal data structure that represents the occurrence of a terminal symbol vary from compiler to compiler

Conceptually character sequence token sequence lexical analyzer

Example Recall the min-ML language in “code3” prog -> decs decs -> dec; decs | dec -> val id = exp | val _ = printInt exp exp -> id | num | exp + exp | true | false | if (exp) then exp else exp | (exp)

Example val x = 3; val y = 4; val z = if (2) then (x) else y; val _ = printInt z; VAL IDENT(x) ASSIGN INT(3) SEMICOLON VAL IDENT(y) ASSIGN INT(4) SEMICOLON VAL IDENT(z) ASSIGN IF LPAREN INT(2) RPAREN THEN LPAREN IDENT(x) RPAREN ELSE IDENT(y) SEMICOLON VAL UNDERSCORE ASSIGN PRINTINT INDENT(z) SEMICOLON EOF lexical analysis

Lexer Implementation Options: Write a lexer by hand from scratch boring, error-prone, and too much work see dragon book sec3.4 Automatic lexer generator Quick and easy

Lexer Implementation declarative specification lexical analyzer

Regular Expressions How to specify a lexer? Develop another language Regular expressions What ’ s a lexer-generator? Another compiler …

Basic Definitions Alphabet: the char set (say ASCII or Unicode) String: a finite sequence of char from alphabet Language: a set of strings finite or infinite say the C language

Regular Expression (RE) Construction by induction each c \in alphabet {a} empty \eps {} for M and N, then M|N (a|b) = {a, b} for M and N, then MN (a|b)(c|d) = {ac, ad, bc, bd} for M, then M* (Kleen closure) (a|b)* = {\eps, a, aa, b, ab, abb, baa, … }

Regular Expression Or more formally: e -> {} | c | e | e | e e | e*

Example C ’ s indentifier: starts with a letter ( “ _ ” counts as a letter) followed by zero or more of letter or digit (…) (_|a|b|…|z|A|B|…|Z) (…) (_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9) (_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9)* It’s really error-prone and tedious…

Syntax Sugar More syntax sugar: [a-z] == a|b| … |z e+ == one or more of e e? == zero or one of e “ a* ” == a* itself e{i, j} == more than i and less than j of e. == any char except \n All these can be translated into core RE

Example Revisted C ’ s indentifier: starts with a letter ( “ _ ” counts as a letter) followed by zero or more of letter or digit (…) (_|a|b|…|z|A|B|…|Z) (…) (_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9) [_a-zA-Z][_a-zA-Z0-9]* What about the key word “if”?

Ambiguous Rule A single RE is not ambiguous But in a language, there may be many REs? [_a-zA-Z][_a-zA-Z0-9]* “if” So, for a string, which RE to match?

Ambiguous Rule Two conventions: Longest match: The regular expression that matches the longest string takes precedence. Rule Priority: The regular expressions identifying tokens are written down in sequence. If two regular expressions match the same (longest) string, the first regular expression in the sequence takes precedence.

Lexer Generator History Lexical analysis was once a performance bottleneck certainly not true today! As a result, early research investigated methods for efficient lexical analysis While the performance concerns are largely irrelevant today, the tools resulting from this research are still in wide use

History: A long-standing goal In this early period, a considerable amount of study went into the goal of creating an automatic compiler generator (aka compiler-compiler) declarative compiler specification compiler

History: Unix and C In the mid-1960 ’ s at Bell Labs, Ritchie and others were developing Unix A key part of this project was the development of C and a compiler for it Johnson, in 1968, proposed the use of finite state machines for lexical analysis and developed Lex [CACM 11(12), 1968] read the accompanying paper on course page Lex realized a part of the compiler-compiler goal by automatically generating fast lexical analyzers

The Lex tool The original Lex generated lexers written in C (C in C) Today every major language has its own lex tool(s): sml-lex, ocamllex, JLex, C#lex, … Our topic next: sml-lex concepts and techniques apply to other tools

SML-Lex Specification Lexical specification consists of 3 parts (yet another programming language): User Declarations (plain SML types, values, functions) % SML-LEX Definitions (RE abbreviations, special stuff) % Rules (association of REs with tokens) (each token will be represented in plain SML)

User Declarations User Declarations: User can define various values that are available to the action fragments. Two values must be defined in this section: type lexresult type of the value returned by each rule action. fun eof () called by lexer when end of input stream is reached. (EOF)

SML-LEX Definitions ML-LEX Definitions: User can define regular expression abbreviations: Define multiple lexers to work together. Each is given a unique name. digits = [0-9] +; letter = [a-zA-Z]; %s lex1 lex2 lex3;

Rules Rules: A rule consists of a pattern and an action: Pattern in a regular expression. Action is a fragment of ordinary SML code. Longest match & rule priority used for disambiguation Rules may be prefixed with the list of lexers that are allowed to use this rule. regularExp => (action) ;

Rules Rule actions can use any value defined in the User Declarations section, including type lexresult type of value returned by each rule action val eof : unit -> lexresult called by lexer when end of input stream reached special variables: yytext: input substring matched by regular expression yypos: file position of the beginning of matched string continue (): doesn ’ t return token; recursively calls lexer

Example #1 (* A language called Toy *) prog -> word prog -> word -> symbol -> number symbol -> [_a-zA-Z][_0-9a-zA-Z]* number -> [0-9]+

Example #1 (* Lexer Toy, see the accompany code for detail *) datatype token = Symbol of string * int | Number of string * int exception End type lexresult = unit fun eof () = raise End fun output x = …; % letter = [_a-zA-Z]; digit = [0-9]; ld = {letter}|{digit}; symbol = {letter} {ld}*; number = {digit}+; % {symbol} =>(output (Symbol(yytext, yypos))); {number} =>(output (Number(yytext, yypos)));

Example #2 (* Expression Language * C-style comment, i.e. /* … */ *) prog -> stms stms -> stm; stms -> stm -> id = e -> print e e -> id -> num -> e bop e -> (e) bop -> + | - | * | /

Sample Program x = 4; y = 5; z = x+y*3; print z;

Example #2 (* All terminals *) prog -> stms stms -> stm; stms -> stm -> id = e -> print e e -> id -> num -> e bop e -> (e) bop -> + | - | * | /

Example #2 in Lex (* Expression language, see the accompany code * for detail. * Part 1: user code *) datatype token = Id of string * int | Number of string * int | Print of string * int | Plus of string * int | … (* all other stuffs *) exception End type lexresult = unit fun eof () = raise End fun output x = …;

Example #2 in Lex, cont ’ (* Expression language, see the accompany code * for detail. * Part 2: lex definition *) % letter = [_a-zA-Z]; digit = [0-9]; ld = {letter}|{digit}; sym = {letter} {ld}*; num = {digit}+; ws = [\ \t]; nl = [\n];

Example #2 in Lex, cont ’ (* Expression language, see the accompany code * for detail. * Part 3: rules *) % {ws} =>(continue ()); {nl} =>(continue ()); ”+” =>(output (Plus (yytext, yypos))); ”-” =>(output (Minus (yytext, yypos))); ”*” =>(output (Times (yytext, yypos))); ”/” =>(output (Divide (yytext, yypos))); ”(” =>(output (Lparen (yytext, yypos))); ”)” =>(output (Rparen (yytext, yypos))); ”=” =>(output (Assign (yytext, yypos))); ”;” =>(output (Semi (yytext, yypos)));

Example #2 in Lex, cont ’ (* Expression language, see the accompany code * for detail. * Part 3: rules cont’ *) ”print”=>(output (Print(yytext, yypos))); {sym} =>(output (Id (yytext, yypos))); {num} =>(output (Number(yytext, yypos))); ”/*” => (YYBEGIN COMMENT; continue ()); ”*/” => (YYBEGIN INITIAL; continue ()); {nl} => (continue ());. => (continue ());. => (error (…));

Lex Implementation Lex accepts regular expressions (along with others) So SML-lex is a compiler from RE to a lexer Internal: RE  NFA  DFA  table-driven alog ’

Finite-state Automata (FA) Input String M {Yes, No} M = (, S, q0, F, ) Input alphabet State set Initial state Final states Transition function

Transition functions DFA  : S    S NFA  : S     ( S)

DFA example Which strings of as and bs are accepted? Transition function: { (q0,a)  q1, (q0,b)  q0, (q1,a)  q2, (q1,b)  q1, (q2,a)  q2, (q2,b)  q2 } aa bba,b

NFA example Transition function: {(q0,a)  {q0,q1}, (q0,b)  {q1}, (q1,a) , (q1,b)  {q0,q1}} 0 1 a,b ab b

RE -> NFA: Thompson algorithm Break RE down to atoms construct small NFAs directly for atoms inductively construct larger NFAs from small NFAs Easy to implement a small recursion algorithm

RE -> NFA: Thompson algorithm e ->  -> c -> e1 e2 -> e1 | e2 -> e1*  c e1 e2 

RE -> NFA: Thompson algorithm e ->  -> c -> e1 e2 -> e1 | e2 -> e1* e1 e2 e1        

Example % letter = [_a-zA-Z]; digit = [0-9]; id = {letter} ({letter}|{digit})* ; % ”if” => (IF (yytext, yypos)); {id} => (Id (yytext, yypos)); (* Equivalent to: * “if” | {id} *)

Example ”if” => (IF (yytext, yypos)); {id} => (Id (yytext, yypos)); i    f … 

NFA -> DFA: Subset construction algorithm (* subset construction: workList algorithm *) q0 <- e-closure (n0) Q <- {q0} workList <- q0 while (workList != \phi) remove q from workList foreach (character c) t <-  -closure (move (q, c)) D[q, c] <- t if (t\not\in Q) add t to Q and workList

NFA -> DFA:  -closure (*  - closure: fixpoint algorithm *) (* Dragon Fig 3.33 gives a DFS-like algorithm. * Here we give a recursive version. (Simpler) *) X <- \phi fun eps (t) = X <- X ∪ {t} foreach (s \in one-eps(t)) if (s \not\in X) then eps (s)

NFA -> DFA:  -closure (*  - closure: fixpoint algorithm *) (* dragon Fig 3.33 gives a DFS-like algorithm. * Here we give a recursive version. (Simpler) *) fun e-closure (T) = X <- T foreach (t \in T) X <- X ∪ eps(t)

NFA -> DFA:  -closure (*  -closure: fixpoint algorithm *) (* A BFS-like algorithm. *) X <- empty; fun e-closure (T) = Q <- T X <- T while (Q not empty) q <- deQueue (Q) foreach (s \in one-eps(q)) if (s \not\in X) enQueue (Q, s) X <- X ∪ s

Example ”if” => (IF (yytext, yypos)); {id} => (Id (yytext, yypos)); 1 i    3 f 6 [_a-zA-Z] 7   [_a-zA-Z0-9] 4

Example q0 = {0, 1, 5} Q = {q0} D[q0, “i”] = {2, 3, 6, 7, 8} Q ∪ q1 D[q0, _] = {6, 7, 8} Q ∪ q2 D[q1, “f”] = {4, 7, 8} Q ∪ q3 1 i    3 f 6 [_a-zA-Z] 7   [_a-zA-Z0-9] q0 q1 q2 q3 i f _ 4

Example D[q1, _] = {7, 8} Q ∪ q4 D[q2, _] = {7, 8} Q D[q3, _] = {7, 8} Q D[q4, _] = {7, 8} Q 1 i    3 f 6 [_a-zA-Z] 7   [_a-zA-Z0-9] q0 q1 q2 q3 i f _ q4 _ _ _ _ 4

Example q0 = {0, 1, 5} q1 = {2, 3, 6, 7, 8} q2 = {6, 7, 8} q3 = {4, 7, 8} q4 = {7, 8} 1 i    3 f 6 [_a-zA-Z] 7   [_a-zA-Z0-9] q0 q1 q2 q3 “i”“i” “f”“f” letter- ” i ” q4 ld- ” f ” ld 4

Example q0 = {0, 1, 5} q1 = {2, 3, 6, 7, 8} q2 = {6, 7, 8} q3 = {4, 7, 8} q4 = {7, 8} 1 i    3 f 6 [_a-zA-Z] 7   [_a-zA-Z0-9] q0 q1 q2 q3 “i”“i” “f”“f” letter- ” i ” q4 ld- ” f ” ld 4

Table-driven Algorithm Conceptually, an FA is a directed graph Pragmatically, many different strategies to encode an FA: Matrix (adjacency matrix) sml-lex Array of list (adjacency list) Hash table Jump table (switch statements) flex Balance between time and space

Example q0 q1 q2 q3 “i”“i” “f”“f” letter- ” i ” q4 ld- ” f ” ld state\char “i”“i”“f”“f” letter- ” i ” - ” f ” … other q0q1q2 … error q1q4q3q4 … error q2q4 … error q3q4 … error q4 … error ”if” => (IF (yytext, yypos)); {id} => (Id (yytext, yypos)); stateq0q1q2q3q4 actionId IFId

DFA Minimization: Hopcroft ’ s Algorithm q0 q1 q2 q3 “i”“i” “f”“f” letter- ” i ” q4 ld- ” f ” ld stateq0q1q2q3q4 actionId IFId

DFA Minimization: Hopcroft ’ s Algorithm q0 q1 q2 q3 “i”“i” “f”“f” letter- ” i ” q4 ld- ” f ” ld stateq0q1q2q3q4 actionId IFId

DFA Minimization: Hopcroft ’ s Algorithm q0 q1 q2, q4 q3 “i”“i” “f”“f” letter- ” i ” ld- ” f ” ld stateq0q1q2, q4q3 actionId IF

Summary A Lexer: input: stream of characters output: stream of tokens Writing lexers by hand is boring, so we use a lexer generator: ml-lex RE -> NFA -> DFA -> table-driven algo Moral: don ’ t underestimate your theory classes! great application of cool theory developed in mathematics. we ’ ll see more cool apps as the course progresses