2. Lexical Analysis Prof. O. Nierstrasz

Slides:



Advertisements
Similar presentations
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
Advertisements

COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
1 Chapter 2: Scanning 朱治平. Scanner (or Lexical Analyzer) the interface between source & compiler could be a separate pass and places its output on an.
From Cooper & Torczon1 Automating Scanner Construction RE  NFA ( Thompson’s construction )  Build an NFA for each term Combine them with  -moves NFA.
N. XXX Prof. O. Nierstrasz Thanks to Jens Palsberg and Tony Hosking for their kind permission to reuse and adapt the CS132 and CS502 lecture notes.
Compiler Construction
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary  Quoted string in.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
CS 426 Compiler Construction
Scanner Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the.
CPSC 388 – Compiler Design and Construction
Topic #3: Lexical Analysis
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
2. Lexical Analysis Prof. O. Nierstrasz Thanks to Jens Palsberg and Tony Hosking for their kind permission to reuse and adapt the CS132 and CS502 lecture.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis Constructing a Scanner from Regular Expressions.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analyzer in Perspective
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.
By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Lexical Analysis: DFA Minimization & Wrap Up. Automating Scanner Construction PREVIOUSLY RE  NFA ( Thompson’s construction ) Build an NFA for each term.
1st Phase Lexical Analysis
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2007.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
1 Chapter 3 Regular Languages.  2 3.1: Regular Expressions (1)   Regular Expression (RE):   E is a regular expression over  if E is one of:
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
1 Compiler Construction Vana Doufexi office CS dept.
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Lexical Analyzer in Perspective
Lecture 2 Lexical Analysis
Chapter 3 Lexical Analysis.
Two issues in lexical analysis
Recognizer for a Language
פרק 3 ניתוח לקסיקאלי תורת הקומפילציה איתן אביאור.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Review: Compiler Phases:
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Recognition of Tokens.
Chapter 3. Lexical Analysis (2)
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Compiler Construction
Presentation transcript:

2. Lexical Analysis Prof. O. Nierstrasz Thanks to Jens Palsberg and Tony Hosking for their kind permission to reuse and adapt the CS132 and CS502 lecture notes. http://www.cs.ucla.edu/~palsberg/ http://www.cs.purdue.edu/homes/hosking/

Roadmap Regular languages Finite automata recognizers From regular expressions to deterministic finite automata Limits of regular languages See, Modern compiler implementation in Java (Second edition), chapter 2.

Roadmap Regular languages Finite automata recognizers From regular expressions to deterministic finite automata Limits of regular languages

Scanner map characters to tokens character string value for a token is a lexeme eliminates white space (tabs, blanks, comments etc.) a key issues is speed  use specialized recognizer x = x + y <id,x> = <id,x> + <id,y>

Specifying patterns A scanner must recognize various parts of the language’s syntax White space <ws> ::= <ws> ’ ’ | <ws> ’\t’ | ’ ’ | ’\t’ Keywords and operators specified as literal patterns: do, end Comments opening and closing delimiters: /* … */ Some parts are easy:

Specifying patterns Other parts are much harder: Identifiers alphabetic followed by k alphanumerics (_, $, &, …)) Numbers integers: 0 or digit from 1-9 followed by digits from 0-9 decimals: integer ’.’ digits from 0-9 reals: (integer or decimal) ’E’ (+ or —) digits from 0-9 complex: ’(’ real ’,’ real ’)’ We need an expressive notation to specify these patterns!

Operations on languages A language is a set of strings Operation Definition Union L  M = { s  s  L or s  M } Concatenation LM = { st  s  L and t  M } Kleene closure L* = I=0, Li Positive closure L+ = I=1, Li

Regular expressions describe regular languages Regular expressions over an alphabet :  is a RE denoting the set {} If a  , then a is a RE denoting {a} If r and s are REs denoting L(r) and L(s), then: (r) is a RE denoting L(r) (r)(s) is a RE denoting L(r) L(s) (r)(s) is a RE denoting L(r)L(s) (r)* is a RE denoting L(r)* Patterns are often specified as regular languages. Notations used to describe a regular language (or a regular set) include both regular expressions and regular grammars If we adopt a precedence for operators, the extra parentheses can go away. We assume closure, then concatenation, then alternation as the order of precedence.

Examples identifier letter  (a b  c … z  A  B  C  …  Z ) digit  (0123456789) id  letter ( letter  digit )* numbers integer  (+— ) (0(123… 9) digit * ) decimal  integer . ( digit )* real  ( integer  decimal ) E (+ —) digit * complex  ’(‘ real ’,’ real ’)’ Numbers can get much more complicated. Most programming language tokens can be described with REs. We can use REs to build scanners automatically.

Algebraic properties of REs rs = sr  is commutative r(st) = (rs)t  is associative r (st) = (rs)t concatenation is associative r(st) = rsrt (st)r = srtr concatenation distributes over  r = r r = r  is the identity for concatenation r * = (r)*  is contained in * r ** = r* * is idempotent

Examples Let  = {a,b} ab denotes {a,b} (ab) (ab) denotes {aa,ab,ba,bb} a* denotes {,a,aa,aaa,…} (ab)* denotes the set of all strings of a’s and b’s (including ), i.e., (ab)* = (a*b*)* aa*b denotes {a,b,ab,aab,aaab,aaaab,…}

Roadmap Regular languages Finite automata recognizers From regular expressions to deterministic finite automata Limits of regular languages

Recognizers From a regular expression we can construct a deterministic finite automaton (DFA) letter  (a b  c … z  A  B  C  …  Z ) digit  (0123456789) id  letter ( letter  digit )*

Code for the recognizer

Tables for the recognizer Two tables control the recognizer char a-z A-Z 0-9 other value letter digit char_class 1 2 3 letter — digit other next_state To change languages, we can just change tables

Automatic construction Scanner generators automatically construct code from regular expression-like descriptions construct a DFA use state minimization techniques emit code for the scanner (table driven or direct code ) A key issue in automation is an interface to the parser lex is a scanner generator supplied with UNIX emits C code for scanner provides macro definitions for each token (used in the parser)

Grammars for regular languages Provable fact: For any RE r, there exists a grammar g such that L(r) = L(g) The grammars that generate regular sets are called regular grammars Definition: In a regular grammar, all productions have one of two forms: A  aA A  a where A is any non-terminal and a is any terminal symbol These are also called type 3 grammars (Chomsky)

More regular languages Example: the set of strings containing an even number of zeros and an even number of ones NB: The states capture whether there are an even number of 0s or 1s => 4 possible states. Note how the RE walks through the DFA. The RE is (0011)*((0110)(0011)*(0110)(0011)*)*

More regular expressions What about the RE (ab)*abb ? State s0 has multiple transitions on a! It is a non-deterministic finite automaton

Finite automata A non-deterministic finite automaton (NFA) consists of: a set of states S = { s0 , … , sn } a set of input symbols Σ (the alphabet) a transition function move mapping state-symbol pairs to sets of states a distinguished start state s0 a set of distinguished accepting or final states F A Deterministic Finite Automaton (DFA) is a special case of an NFA: no state has a ε-transition, and for each state s and input symbol a, there is at most one edge labelled a leaving s. A DFA accepts x iff there exists a unique path through the transition graph from the s0 to an accepting state such that the labels along the edges spell x.

DFAs and NFAs are equivalent DFAs are clearly a subset of NFAs Any NFA can be converted into a DFA, by simulating sets of simultaneous states: each DFA state corresponds to a set of NFA states possible exponential blowup

Roadmap Regular languages Finite automata recognizers From regular expressions to deterministic finite automata Limits of regular languages

NFA to DFA using the subset construction: example 1

Constructing a DFA from a regular expression RE  NFA Build NFA for each term; connect with εmoves NFA  DFA Simulate the NFA using the subset construction DFA  minimized DFA Merge equivalent states DFA  RE Construct Rkij = Rk-1ik (Rk-1kk)* Rk-1kj  Rk-1ij

RE to NFA

RE to NFA example: (ab)*abb

NFA to DFA: the subset construction Input: NFA N Output: DFA D with states SD and transitions TD such that L(D) = L(N) Method: Let s be a state in N and P be a set of states. Use the following operations: ε-closure(s) — set of states of N reachable from s by ε transitions alone ε-closure(P) — set of states of N reachable from some s in P by ε transitions alone move(T,a) — set of states of N to which there is a transition on input a from some s in P add state P = ε-closure(s0) unmarked to SD while  unmarked state P in SD mark P for each input symbol a U = ε-closure(move(P,a)) if U  SD then add U unmarked to SD TD[T,a] = U end for end while ε-closure(s0) is the start state of D A state of D is accepting if it contains an accepting state of N Renamed some terms from Palsberg/Hosking slide

NFA to DFA using subset construction: example 2

Roadmap Regular languages Finite automata recognizers From regular expressions to deterministic finite automata Limits of regular languages

Limits of regular languages Not all languages are regular! One cannot construct DFAs to recognize these languages: L = { pkqk } L = { wcwr  w  * } In general, DFAs cannot count! I guess k and r range over arbitrary positive numbers However, one can construct DFAs for: Alternating 0’s and 1’s: (  1)(01)*(  0) Sets of pairs of 0’s and 1’s (01  10)+

So, what is hard? Certain language features can cause problems: Reserved words PL/I had no reserved words if then then then = else; else else = then Significant blanks FORTRAN and Algol68 ignore blanks do 10 i = 1,25 do 10 i = 1.25 String constants Special characters in strings Newline, tab, quote, comment delimiter Finite limits Some languages limit identifier lengths Add state to count length FORTRAN 66 — 6 characters(!)

How bad can it get? Compiler needs context to distinguish variables from control constructs!

What you should know! What are the key responsibilities of a scanner? What is a formal language? What are operators over languages? What is a regular language? Why are regular languages interesting for defining scanners? What is the difference between a deterministic and a non-deterministic finite automaton? How can you generate a DFA recognizer from a regular expression? Why aren’t regular languages expressive enough for parsing?

Can you answer these questions? Why do compilers separate scanning from parsing? Why doesn’t NFA  DFA translation normally result in an exponential increase in the number of states? Why is it necessary to minimize states after translation a NFA to a DFA? How would you program a scanner for a language like FORTRAN?

Your fair use and other rights are in no way affected by the above. License http://creativecommons.org/licenses/by-sa/2.5/ Attribution-ShareAlike 2.5 You are free: to copy, distribute, display, and perform the work to make derivative works to make commercial use of the work Under the following conditions: Attribution. You must attribute the work in the manner specified by the author or licensor. Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under a license identical to this one. For any reuse or distribution, you must make clear to others the license terms of this work. Any of these conditions can be waived if you get permission from the copyright holder. Your fair use and other rights are in no way affected by the above.