Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Slides:



Advertisements
Similar presentations
4b Lexical analysis Finite Automata
Advertisements

1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Chapter 2 Lexical Analysis Nai-Wei Lin. Lexical Analysis Lexical analysis recognizes the vocabulary of the programming language and transforms a string.
Chapter 3 Lexical Analysis Yu-Chen Kuo.
Chapter 3 Lexical Analysis. Definitions The lexical analyzer produces a certain token wherever the input contains a string of characters in a certain.
LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006.
Lexical Analyzer Second lecture. Compiler Construction Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical.
CS-338 Compiler Design Dr. Syed Noman Hasany Assistant Professor College of Computer, Qassim University.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 Chapter 2: Scanning 朱治平. Scanner (or Lexical Analyzer) the interface between source & compiler could be a separate pass and places its output on an.
2. Lexical Analysis Prof. O. Nierstrasz
Lexical Analysis Recognize tokens and ignore white spaces, comments
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
CPSC 388 – Compiler Design and Construction
 We are given the following regular definition: if -> if then -> then else -> else relop -> |>|>= id -> letter(letter|digit)* num -> digit + (.digit.
Chapter 3 Lexical Analysis
Topic #3: Lexical Analysis
Lexical Analysis Natawut Nupairoj, Ph.D.
CS308 Compiler Principles Lexical Analyzer Fan Wu Department of Computer Science and Engineering Shanghai Jiao Tong University Fall 2012.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analyzer (Checker)
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analyzer in Perspective
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability.
1.  It is the first phase of compiler.  In computer science, lexical analysis is the process of converting a sequence of characters into a sequence.
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Fall 2003CS416 Compiler Design1 Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. Normally a lexical.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Lexical Analysis.
1st Phase Lexical Analysis
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Compilers Lexical Analysis 1. while (y < z) { int x = a + b; y += x; } 2.
COMP 3438 – Part II - Lecture 3 Lexical Analysis II Par III: Finite Automata Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analyzer in Perspective
CS510 Compiler Lecture 2.
Scanner Scanner Introduction to Compilers.
Chapter 3 Lexical Analysis.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
CSc 453 Lexical Analysis (Scanning)
Two issues in lexical analysis
Lexical Analysis Why separate lexical and syntax analyses?
פרק 3 ניתוח לקסיקאלי תורת הקומפילציה איתן אביאור.
Lexical Analysis and Lexical Analyzer Generators
Review: Compiler Phases:
Recognition of Tokens.
Scanner Scanner Introduction to Compilers.
4b Lexical analysis Finite Automata
Finite Automata & Language Theory
Scanner Scanner Introduction to Compilers.
4b Lexical analysis Finite Automata
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003

Lexical Analyzer and Parser

Why Separate? Reasons to separate lexical analysis from parsing: –Simpler design –Improved efficiency –Portability Tools exist to help implement lexical analyzers and parsers independently

Tokens, Lexemes, and Patterns Tokens include keywords, operators, identifiers, constants, literal strings, punctuation symbols A lexeme is a sequence of characters in the source program representing a token A pattern is a rule describing a set of lexemes that can represent a particular token

Attributes Attributes provide additional information about tokens Technically speaking, lexical analyzers usually provide a single attribute per token (might be pointer into symbol table)

Buffer Most lexical analyzers use a buffer Often buffers are divided into two N character halves Two pointers used to indicate start and end of lexeme If pointer walks past end of either half of buffer, other half of buffer is reloaded A sentinel character can be used to decrease number of checks necessary

Strings and Languages Alphabet – any finite set of symbols (e.g. ASCII, binary alphabet, or a set of tokens) String – A finite sequence of symbols drawn from an alphabet Language – A set of strings over a fixed alphabet Other terms relating to strings: prefix; suffix; substring; proper prefix, suffix, or substring (non- empty, not entire string); subsequence

Operations on Languages Union: Concatenation: Kleene closure: – –Zero or more concatenations Positive closure: – –One or more concatenations

Regular Expressions Defined over an alphabet Σ ε represents { ε }, the set containing the empty string If a is a symbol in Σ, then a is a regular expression denoting { a }, the set containing the string a If r and s are regular expressions denoting the languages L(r) and L(s), then: –(r)|(s) is a regular expression denoting L(r) U L(s) –(r)(s) is a regular expression denoting L(r)L(s) –(r) * is a regular expression denoting (L(r)) * –(r) is a regular expression denoting L(r) Precedence: * (left associative), then concatenation (left associative), then | (left associative)

Regular Definitions Can give “names” to regular expressions Convention: names in boldface (to distinguish them from symbols) letter  A|B|…|Z|a|b|…|z digit  0|1|…|9 id  letter (letter | digit)*

Notational Shorthands One or more instances: r + denotes rr * Zero or one Instance: r? denotes r|ε Character classes: [a-z] denotes [a|b|…|z] digit  [0-9] digits  digit + optional_fraction  (. digits )? optional_exponent  (E(+|-)? digits )? num  digits optional_fraction optional_exponent

Limitations Can not describe balanced or nested constructs –Example, all valid strings of balanced parentheses –This can be done with CFG Can not describe repeated strings –Example: { wcw|w is a string of a ’s and b ’s} –Can not denote with CFG either!

Grammar Fragment (Pascal) stmt  if expr then stmt | if expr then stmt else stmt | ε expr  term relop term | term term  id | num

Related Regular Definitions if  if then  then else  else relop  | > | >= id  letter ( letter | digit ) * num  digit + (. digit + )? (E(+|-)? digit + )? delim  blank | tab | newline ws  delim +

Tokens and Attributes Regular ExpressionTokenAttribute Value ws-- if - then - else - id pointer to entry num pointer to entry <relopLT <=relopLE =relopEQ <>relopNE >relopGT =>relopGE

Transition Diagrams A stylized flowchart Transition diagrams consist of states connected by edges Edges leaving a state s are labeled with input characters that may occur after reaching state s Assumed to be deterministic There is one start state and at least one accepting (final) state Some states may have associated actions At some final states, need to retract a character

Transition Diagram for “relop”

Identifiers and Keywords Share a transition diagram –After reaching accepting state, code determines if lexeme is keyword or identifier –Easier than encoding exceptions in diagram Simple technique is to appropriately initialize symbol table with keywords

Numbers

Order of Transition Diagrams Transition diagrams tested in order Diagrams with low numbered start states tried before diagrams with high numbered start states Order influences efficiency of lexical analyzer

Trying Transition Diagrams int next_td(void) { switch (start) { case 0: start = 9; break; case 9: start = 12; break; case 12: start = 20; break; case 20: start = 25; break; case 25: recover(); break; default: error("invalid start state"); } /* Possibly additional actions here */ return start; }

Finding the Next Token token nexttoken(void) { while (1) { switch (state) { case 0: c = nextchar(); if (c == ' ' || c=='\t' || c == '\n') { state = 0; lexeme_beginning++; } else if (c == '<') state = 1; else if (c == '=') state = 5 else if (c == '>') state = 6 else state = next_td(); break; … /* 27 other cases here */

The End of a Token token nexttoken(void) { while (1) { switch (state) { … /* First 19 cases */ case 19: retract(); install_num(); return(NUM); break; … /* Final 8 cases */

Finite Automata Generalized transition diagrams that act as “recognizer” for a language Can be nondeterministic (NFA) or deterministic (DFA) –NFAs can have ε-transitions, DFAs can not –NFAs can have multiple edges with same symbol leaving a state, DFAs can not –Both can recognize exactly what regular expressions can denote

NFAs A set of states S A set of input symbols Σ (input alphabet) A transition function move that maps state, symbol pairs to a set of states A single start state s 0 A set of accepting (or final) states F An NFA accepts a string s if and only if there exists a path from the start state to an accepting state such that the edge labels spell out s

Transition Tables State Input Symbol ab 0{0,1}{0} 1---{2} 2---{3}

DFAs No state has an ε-transition For each state s and input symbol a, there as at most one edge labeled a leaving s

Thompson’s Construction Method of converting a regular expression into an NFA Start with two simple rules –For ε, construct NFA: –For each a in Σ, construct NFA: Next will inductively apply a more complex rule until entire we obtain NFA for entire expression

Complex Rule, Part 1 For the regular expression s|t, such that N(s) and N(t) are NFAs for s and t, construct the following NFA N(s|t) :

Complex Rule, Part 2 For the regular expression st, construct the composite NFA N(st) : N(S)N(T)

Complex Rule, Part 3 For the regular expression s *, construct the composite NFA N(s * ) : 

Complex Rule, Part 4 For the parenthesized regular expression (s), use N(s) itself as the NFA

Example: r = (a|b) * abb

Functions ε-closure and move ε-closure(s) is the set of NFA states reachable from NFA state s on ε- transitions alone ε-closure(T) is the set of NFA states reachable from any NFA state s in T on ε- transitions alone move(T,a) is the set of NFA states to which there is a transition on input a from any NFA state s in T

Computing ε-closure push all states in T onto stack initialize ε-closure(T) to T while stack is not empty pop t from top of stack for each state u with an ε-transition from t if u is not in ε-closure(T) then add u to ε-closure(T) push u onto stack

Subset Construction (NFA to DFA) initialize Dstates to unmarked ε-closure(s 0 ) while there is an unmarked state T in Dstates mark T for each input symbol a U := ε-closure(move(T,a)) if U is not in Dstates add U as unmarked state to Dstates Dtran[T,a] := U

Constructed DFA

Simulating a DFA s := s 0 c := nextchar while c != eof do s := move(s, c) c := nextchar end if s is in F then return “yes” else return “no”

Simulating an NFA S := ε-closure({s 0 }) a := nextchar while a != eof do S := ε-closure(move(S,a)) a := nextchar if S ∩ F != Ø return “yes” else return “no”

Space/Time Tradeoff (Worst Case) SpaceTime NFAO(|r|)O(|r|*|x|) DFAO(2 |r| )O(|x|)

First use Thompson’s Construction to convert RE to NFA Then there are two choices: –Use subset construction to convert NFA to DFA, then simulate the DFA –Simulate the NFA directly You won’t have to worry about any of this while programming, Lex will take care of it! Simulating a Regular Expression