Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.

Slides:



Advertisements
Similar presentations
4b Lexical analysis Finite Automata
Advertisements

COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Chapter 2 Lexical Analysis Nai-Wei Lin. Lexical Analysis Lexical analysis recognizes the vocabulary of the programming language and transforms a string.
LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006.
Lexical Analyzer Second lecture. Compiler Construction Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
1 Chapter 2: Scanning 朱治平. Scanner (or Lexical Analyzer) the interface between source & compiler could be a separate pass and places its output on an.
2. Lexical Analysis Prof. O. Nierstrasz
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
Chapter 3 Lexical Analysis
Topic #3: Lexical Analysis
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
Lexical Analysis Natawut Nupairoj, Ph.D.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Other Issues - § 3.9 – Not Discussed More advanced algorithm construction – regular expression to DFA directly.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analyzer (Checker)
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
SCRIBE SUBMISSION GROUP 8 Date: 7/8/2013 By – IKHAR SUSHRUT MEGHSHYAM 11CS10017 Lexical Analyser Constructing Tokens State-Transition Diagram S-T Diagrams.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
1 Languages and Compilers (SProg og Oversættere) Lexical analysis.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
Pembangunan Kompilator.  A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that language, and.
By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability.
By Neng-Fa Zhou Programming language syntax 4 Three aspects of languages –Syntax How are sentences formed? –Semantics What does a sentence mean? –Pragmatics.
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Fall 2003CS416 Compiler Design1 Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. Normally a lexical.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Lexical Analysis.
1st Phase Lexical Analysis
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Scribing K SAMPATH KUMAR 11CS10022 scribing. Definition of a Regular Expression R is a regular expression if it is: 1.a for some a in the alphabet ,
Chapter 5 Finite Automata Finite State Automata n Capable of recognizing numerous symbol patterns, the class of regular languages n Suitable for.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Compilers Lexical Analysis 1. while (y < z) { int x = a + b; y += x; } 2.
Department of Software & Media Technology
COMP 3438 – Part II - Lecture 3 Lexical Analysis II Par III: Finite Automata Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analyzer in Perspective
Finite automate.
Chapter 3 Lexical Analysis.
Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012.
Lexical analysis Finite Automata
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
Two issues in lexical analysis
Recognizer for a Language
Lexical Analysis Why separate lexical and syntax analyses?
Review: Compiler Phases:
Lexical Analysis Lecture 3-4 Prof. Necula CS 164 Lecture 3.
Recognition of Tokens.
Finite Automata.
4b Lexical analysis Finite Automata
Chapter 3. Lexical Analysis (2)
Other Issues - § 3.9 – Not Discussed
4b Lexical analysis Finite Automata
Presentation transcript:

Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis

Winter 2007SEG2101 Chapter 82 Contents The role of the lexical analyzer Specification of tokens Finite state machines From a regular expressions to an NFA Convert NFA to DFA Transforming grammars and regular expressions Transforming automata to grammars Language for specifying lexical analyzers

Winter 2007SEG2101 Chapter 83 The Role of Lexical Analyzer Lexical analyzer is the first phase of a compiler. Its main task is to read input characters and produce as output a sequence of tokens that parser uses for syntax analysis.

Winter 2007SEG2101 Chapter 84 Issues in Lexical Analysis There are several reasons for separating the analysis phase of compiling into lexical analysis and parsing: –Simpler design –Compiler efficiency –Compiler portability Specialized tools have been designed to help automate the construction of lexical analyzer and parser when they are separated.

Winter 2007SEG2101 Chapter 85 Tokens, Patterns, Lexemes A lexeme is a sequence of characters in the source program that is matched by the pattern for a token. A lexeme is a basic lexical unit of a language comprising one or several words, the elements of which do not separately convey the meaning of the whole. The lexemes of a programming language include its identifier, literals, operators, and special words. A token of a language is a category of its lexemes. A pattern is a rule describing the set of lexemes that can represent as particular token in source program.

Winter 2007SEG2101 Chapter 86 Examples of Tokens const pi = ; The substring pi is a lexeme for the token “identifier.”

Winter 2007SEG2101 Chapter 87 Lexeme and Token semicolon; int_literal17 plus_op+ identifierCount multi_op* int_literal2 equal_sign= IdentifierIndex TokensLexemes Index = 2 * count +17;

Winter 2007SEG2101 Chapter 88 Lexical Errors Few errors are discernible at the lexical level alone, because a lexical analyzer has a very localized view of a source program. Let some other phase of compiler handle any error. Panic mode Error recovery

Winter 2007SEG2101 Chapter 89 Specification of Tokens Regular expressions are an important notation for specifying patterns. Operation on languages Regular expressions Regular definitions Notational shorthands

Winter 2007SEG2101 Chapter 810 Operations on Languages

Winter 2007SEG2101 Chapter 811 Regular Expressions Regular expression is a compact notation for describing string. In Pascal, an identifier is a letter followed by zero or more letter or digits  letter(letter|digit)* | : or * : zero or more instance of a(a|d) *

Winter 2007SEG2101 Chapter 812 Rules  is a regular expression that denotes {  }, the set containing empty string. If a is a symbol in , then a is a regular expression that denotes {a}, the set containing the string a. Suppose r and s are regular expressions denoting the language L(r) and L(s), then –(r) |(s) is a regular expression denoting L(r)  L(s). –(r)(s) is regular expression denoting L (r) L(s). –(r) * is a regular expression denoting (L (r) )*. –(r) is a regular expression denoting L (r).

Winter 2007SEG2101 Chapter 813 Precedence Conventions The unary operator * has the highest precedence and is left associative. Concatenation has the second highest precedence and is left associative. | has the lowest precedence and is left associative. (a)|(b)*(c)  a|b*c

Winter 2007SEG2101 Chapter 814 Example of Regular Expressions

Winter 2007SEG2101 Chapter 815 Properties of Regular Expression

Winter 2007SEG2101 Chapter 816 Regular Definitions If  is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form: d 1  r 1 d 2  r 2... d n  r n where each d i is a distinct name, and each r i is a regular expression over the symbols in  {d 1,d 2,…,d i-1 }, i.e., the basic symbols and the previously defined names.

Winter 2007SEG2101 Chapter 817 Examples of Regular Definitions Example 3.5. Unsigned numbers

Winter 2007SEG2101 Chapter 818 Notational Shorthands

Winter 2007SEG2101 Chapter 819 Finite Automata A recognizer for a language is a program that takes as input a string x and answer “yes” if x is a sentence of the language and “no” otherwise. We compile a regular expression into a recognizer by constructing a generalized transition diagram called a finite automaton. A finite automaton can be deterministic or nondeterministic, where nondeterministic means that more than one transition out of a state may be possible on the same input symbol.

Winter 2007SEG2101 Chapter 820 Nondeterministic Finite Automata (NFA) A set of states S A set of input symbols  A transition function move that maps state- symbol pairs to sets of states A state s 0 that is distinguished as the start (initial) state A set of states F distinguished as accepting (final) states.

Winter 2007SEG2101 Chapter 821 NFA An NFA can be represented diagrammatically by a labeled directed graph, called a transition graph, in which the nodes are the states and the labeled edges represent the transition function. (a|b)*abb

Winter 2007SEG2101 Chapter 822 NFA Transition Table The easiest implementation is a transition table in which there is a row for each state and a column for each input symbol and , if necessary.

Winter 2007SEG2101 Chapter 823 Example of NFA

Winter 2007SEG2101 Chapter 824 Deterministic Finite Automata (DFA) A DFA is a special case of a NFA in which –no state has an  -transition –for each state s and input symbol a, there is at most one edge labeled a leaving s.

Winter 2007SEG2101 Chapter 825 Simulating a DFA

Winter 2007SEG2101 Chapter 826 Example of DFA

Winter 2007SEG2101 Chapter 827 Conversion of an NFA into DFA Subset construction algorithm is useful for simulating an NFA by a computer program. In the transition table of an NFA, each entry is a set of states; in the transition table of a DFA, each entry is just a single state. The general idea behind the NFA-to-DFA construction is that each DFA state corresponds to a set of NFA states. The DFA uses its state to keep track of all possible states the NFA can be in after reading each input symbol.

Winter 2007SEG2101 Chapter 828 Subset Construction - constructing a DFA from an NFA Input: An NFA N. Output: A DFA D accepting the same language. Method: We construct a transition table Dtran for D. Each DFA state is a set of NFA states and we construct Dtran so that D will simulate “in parallel” all possible moves N can make on a given input string.

Winter 2007SEG2101 Chapter 829 Subset Construction (II) s represents an NFA state T represents a set of NFA states.

Winter 2007SEG2101 Chapter 830 Subset Construction (III)

Winter 2007SEG2101 Chapter 831 Subset Construction (IV) (ε-closure computation)

Winter 2007SEG2101 Chapter 832 Example

Winter 2007SEG2101 Chapter 833 Example (II)

Winter 2007SEG2101 Chapter 834 Example (III)

Winter 2007SEG2101 Chapter 835 Minimizing the number of states in DFA Minimize the number of states of a DFA by finding all groups of states that can be distinguished by some input string. Each group of states that cannot be distinguished is then merged into a single state. Algorithm 3.6 Aho page 142

Winter 2007SEG2101 Chapter 836 Minimizing the number of states in DFA (II)

Winter 2007SEG2101 Chapter 837 Construct New Partition An example in class

Winter 2007SEG2101 Chapter 838 From Regular Expression to NFA Thompson’s construction - an NFA from a regular expression Input: a regular expression r over an alphabet . Output: an NFA N accepting L(r)

Winter 2007SEG2101 Chapter 839 Method First parse r into its constituent subexpressions. Construct NFA’s for each of the basic symbols in r. –for  –for a in 

Winter 2007SEG2101 Chapter 840 Method (II) For the regular expression s|t, For the regular expression st,

Winter 2007SEG2101 Chapter 841 Method (III) For the regular expression s*, For the parenthesized regular expression (s), use N(s) itself as the NFA. Every time we construct a new state, we give it a distinct name.

Winter 2007SEG2101 Chapter 842 Example - construct N(r) for r=(a|b)*abb

Winter 2007SEG2101 Chapter 843 Example (II)

Winter 2007SEG2101 Chapter 844 Example (III)

Winter 2007SEG2101 Chapter 845 Regular Expressions  Grammars More in class …

Winter 2007SEG2101 Chapter 846 Grammars  Regular Expressions More in class …

Winter 2007SEG2101 Chapter 847 Automata  Grammars More in class …

Winter 2007SEG2101 Chapter 848 A Language for Specifying Lexical Analyzer yylex()

Winter 2007SEG2101 Chapter 849 Simple Lex Example int num_lines = 0, num_chars = 0; % \n ++num_lines; ++num_chars;. ++num_chars; % main() { yylex(); printf( "# of lines = %d, # of chars = %d\n", num_lines, num_chars ); }

Winter 2007SEG2101 Chapter 850 %{ #include /* need this for the call to atof() below */ #include /* need this for printf(), fopen() and stdin below */ %} DIGIT [0-9] ID [a-z][a-z0-9]* % {DIGIT}+ { printf("An integer: %s (%d)\n", yytext, atoi(yytext)); } {DIGIT}+"."{DIGIT}* { printf("A float: %s (%g)\n", yytext, atof(yytext)); } if|then|begin|end|procedure|function { printf("A keyword: %s\n", yytext); } {ID} printf("An identifier: %s\n", yytext); "+"|"-"|"*"|"/" printf("An operator: %s\n", yytext); "{"[^}\n]*"}" /* eat up one-line comments */ [ \t\n]+ /* eat up white space */. printf("Unrecognized character: %s\n", yytext); % int main(int argc, char *argv[]){ ++argv, --argc; /* skip over program name */ if (argc > 0) yyin = fopen(argv[0], "r"); else yyin = stdin; yylex(); }