Chapter2 : Lexical Analysis

Slides:



Advertisements
Similar presentations
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Advertisements

1 Compilers Nai-Wei Lin Department of Computer Science and Information Engineering National Chung Cheng University.
Chapter 3 Lexical Analysis Yu-Chen Kuo.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
1 Chapter 2: Scanning 朱治平. Scanner (or Lexical Analyzer) the interface between source & compiler could be a separate pass and places its output on an.
ISBN Chapter 4 Lexical and Syntax Analysis The Parsing Problem Recursive-Descent Parsing.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
Yu-Chen Kuo1 Chapter 2 A Simple One-Pass Compiler.
2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.
Topic #3: Lexical Analysis
Lexical Analysis Natawut Nupairoj, Ph.D.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis Hira Waseem Lecture
1 Chapter 1 Introduction to the Theory of Computation.
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
Lexical and Syntax Analysis
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Chapter 3. Lexical Analysis (1). 2 Interaction of lexical analyzer with parser.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analyzer in Perspective
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
CPS 506 Comparative Programming Languages Syntax Specification.
1.  It is the first phase of compiler.  In computer science, lexical analysis is the process of converting a sequence of characters into a sequence.
IN LINE FUNCTION AND MACRO Macro is processed at precompilation time. An Inline function is processed at compilation time. Example : let us consider this.
CSc 453 Lexical Analysis (Scanning)
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Overview of Previous Lesson(s) Over View  Syntax-directed translation is done by attaching rules or program fragments to productions in a grammar. 
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Overview of Previous Lesson(s) Over View  In our compiler model, the parser obtains a string of tokens from the lexical analyzer & verifies that the.
The Role of Lexical Analyzer
Lexical Analysis (Scanning) Lexical Analysis (Scanning)
LECTURE 4 Syntax. SPECIFYING SYNTAX Programming languages must be very well defined – there’s no room for ambiguity. Language designers must use formal.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Spring 16 CSCI 4430, A Milanova 1 Announcements HW1 will be out this evening Due Monday, 2/8 Submit in HW Server AND at start of class on 2/8 A review.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Syntax Analysis Or Parsing. A.K.A. Syntax Analysis –Recognize sentences in a language. –Discover the structure of a document/program. –Construct (implicitly.
Last Chapter Review Source code characters combination lexemes tokens pattern Non-Formalization Description Formalization Description Regular Expression.
Lexical Analyzer in Perspective
A Simple Syntax-Directed Translator
Scanner Scanner Introduction to Compilers.
Chapter 3 Lexical Analysis.
Lexical and Syntax Analysis
CSc 453 Lexical Analysis (Scanning)
Chapter 4 Syntax Analysis.
Syntax Analysis Chapter 4.
PROGRAMMING LANGUAGES
Compiler Construction
פרק 3 ניתוח לקסיקאלי תורת הקומפילציה איתן אביאור.
Chapter 3: Lexical Analysis
Review: Compiler Phases:
R.Rajkumar Asst.Professor CSE
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
CSc 453 Lexical Analysis (Scanning)
Faculty of Computer Science and Information System
Presentation transcript:

Chapter2 : Lexical Analysis

Intermediate Code Generator Source Program Target Program Semantic Analyser Intermediate Code Generator Code Optimizer Code Generator Syntax Analyser Lexical Analyser Symbol Table Manager Error Handler Lexical Analysis 

Languages An alphabet (Σ) is a finite set of symbols . {a, b, c} A symbol is an element of an alphabet. a A word is a finite sequence of symbols drawn from the alphabet Σ. abcaa A language (over alphabet Σ) is a set of words. {abcaa, abc, b, caa}

bmz is a string of length 3 Languages Σ* denotes the set of all words over the alphabet Σ. | s | denotes the length of string ε denotes the word of length 0, the empty word.  denotes the empty set, or {ε} Note1: In language theory the terms sentence and word are often used as synonyms for the term string Note2: A language (over alphabet Σ) is a set of string (over alphabet Σ). For example: Σ = {a}; one possible language is L = { ε, a; aa; aaa}. bmz is a string of length 3

Terms for parts of a string DEFINITION prefix of s A string obtained by removing zero or more trailing symbols of string s; ban is a prefix of banana. suffix of s A string formed by deleting zero or more of the leading symbols of s; nana is a suffix of banana. substring of s A string obtained by deleting a prefix and a suffix from s; nan is a substring of banana. Every prefix and every suffix of s is a substring of s, but not every substring of s is a prefix or a suffix of s. For every string s, both s and e are prefixes, suffixes, and substrings of s. proper prefix, suffix, or substring of s Any nonempty string x that is, respectively, a prefix, suffix, or substring of s such that s  x. subsequence of s Any string formed by deleting zero or more not necessarily contiguous symbols from s; baaa is a subsequence of banana.

Terms for parts of a string (examples) Let us take this string: banana prefix: ε, b, ba, ban, ..., banana suffix: ε, a, na, ana, ..., banana substring: ε, b, a, n, ba, an, na, ..., banana subsequence: ε, b, a, n, ba, bn, an, aa, na, nn, ..., banana

Operations on Strings Concatenation: Concatenation of words is denoted by juxtaposition. If x and y are strings, then the concatenation of x and y is xy If x=dog and y= house, then xy=doghouse x(yz) = (xy)z x ε = ε x = x Concatenation is not symmetric Exponentiation s0 = ε s1 = s s2 = ss

Operations on Languages Union of L and M, L  M L  M = { s | s  L or s  M} Concatenation of L and M, LM LM = {st | s  L and t  M} Kleene closure of L, L* L* = Positive closure of L, L+ L+ =

Example L is the set {A, B, . . ., Z, a, b, . . . , z} and D the set {0, 1, . . . , 9}. Since a symbol can be regarded as a string of length one, the sets L and D are each finite languages. The following are some examples of new languages created from L and D 1. L U D is the set of letters and digits. 2. LD is the set of strings consisting of a letter followed by a digit. 3. L4 is the set of all four-letter strings. 4. L* is the set of all strings of letters, including ε, the empty string. 5. L(L U D)* is the set of all strings of letters and digits beginning with a letter. 6. D+ is the set of all strings of one or more digits.

Operator Associativity Grammar rules may influence operator Associativity How to specify operator Associativity for: 1. Multiplication operator (left associative) in FORTRAN: 2. Exponentiation (right associative) in FORTRAN: 1 * 2 * 3 (1 * 2) * 3 3 * 1 * 5 (3 * 1) * 5 X ** Y ** Z X ** (Y ** Z)

(We shall use this assumption in this course) Example1 9 – 5 + 2 9 – (5 + 2) (9 – 5) + 2 right-associativity left-associativity The choice relies with the language designer, who must take into account intuitions and convenience. By convention, most arithmetic operations use left-associativity. (We shall use this assumption in this course)

Example2 right  letter = right | letter letter  a | b | c | … | z list  list + digit | list – digit | digit digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 left-associativity 9 – 5 – 2 right-associativity a = b = c

Specifying Operator Associativity * For left associative, rewrite grammar rule: LHS appears at the beginning of its RHS - this rule is also known as (aka) left recursive * For right associative, rewrite grammar rule: LHS appears at the end of its RHS - this rule is aka right recursive

Draw the parse tree for A = B * A + C Operator Precedence Operator precedence defines the order in which an expression evaluates when several different operators are present Grammar rules may influence operator precedence assign ident = expr ident  A | B | C expr  ident + expr | ident * expr | ( expr ) | ident Draw the parse tree for A = B * A + C Operators generated lower in the parse tree is evaluated first, therefore, higher precedence than operators generated higher up in the parse tree

Precedence Levels ( ) higher ^ * / + - lower exp const 5 2 9 + * <exp> ::= <exp> + <exp> | <exp> * <exp> | <const> <const> ::= 0..9 9+5*2 exp const 5 2 9 + *

Specifying Operator Precedence Grammar rules can be made to exhibit operator precedence by introducing additional nonterminals and new rules. assign  ident = expr ident  A | B | C expr  expr + term | term term  term * factor | factor factor  ( expr ) | ident Draw the parse tree for A = B * A + C

Postfix Notation The posy notation for an expression E can be defined inductively as follows: 1. If E is a variable or constant, then the postfix notation for E is E itself. 2. If E is an expression of the form E1 op E2, where op is any binary operator, then the postfix notation for E is El' E2' op, where El' and E2' are the postfix notations for El and E2, respectively. 3. If E is an expression of the form ( E1 ), then the postfix notation for E1, is also the postfix notation for E. the postfix notation for (9-5) +2 is 95-2+ the postfix notation for 9- ( 5+2 ) is 952+-

The Role of the Lexical Analyzer The lexical Analyzer is the first phase of a compiler The Main Task: is to read the input characters and produce as output a sequence of tokens that the parser uses for syntax analysis source program lexical analyzer parser symbol table token get next token

The Role of the Lexical Analyzer (continued) The lexical analyzer is the part of the compiler that reads the source text. The Secondary Tasks: 1. Eliminating the following from the source program: a. comments // global variables b. whitespace a=1 + 4; 1. tab write ( a); 2. newline characters write (a, a*2);

The Role of the Lexical Analyzer (continued) 2. Correlating error messages from the compiler with the source program. It may keep track of the number of newline characters seen, so that a line number can be associated with an error message. 3. Making a copy of source program with errors marked (in some compilers)

The Role of the Lexical Analyzer (continued) Note: lexical analyzer is divided into a cascade of two phases (in some compilers): Scanning: The scanner is responsible for doing simple tasks Lexical Analysis: lexical analyzer is responsible for doing more complex operations FORTRAN Compiler, uses a scanner to eliminate blanks from the input. R.W. Do num 5 id I Do 5 I = 1,25 Enter a Number ==> 13 2 The number is 132 Do 5 I = 1.25 id Do5I

Advantages for Separating the Analysis Phase The advantages for separating the analysis phase of compiling into lexical analysis and parsing: 1. Simpler Design: Separate lexical analysis from syntax analysis simplifies one or the other of these phases. (comments and white space) 2. Improved Efficiency: Large amount of time in a compiler is spent reading source and partitioning into tokens. Specialized buffering techniques for reading input characters and processing tokens can significantly speed up the performance of a compiler. 3. Enhanced Portability: Input alphabet peculiarities and other device specific anomalies can be restricted to the lexical analyzer. Representation of non-standard symbols can be isolated in the lexical analyzer

Symbol Table * It is a Data Structure used to store information about various source language constructs. - During lexical analysis, the character string or lexeme forming an identifier is saved in a symbol table entry. * Later phases of the compiler might add to this entry information such as the type of the identifier, its usage (variable or label) and its position in storage (address).

Tokens, Patterns, and Lexemes Lexeme: a string matched by the pattern of a token Token: a set of strings Pattern: is a rule associated with token that describes the set of strings

Attributes of Tokens Attributes are used to distinguish different lexemes in a token E = M * C ** 2 <id, pointer to symbol-table entry for E> <assign_op, > <id, pointer to symbol-table entry for M > <mult_op, > <id, pointer to symbol-table entry for C> <exp-op, > <num, integer value 2> Tokens affect syntax analysis & Attributes affect semantic analysis

Describing Tokens * We use regular expressions to describe programming language tokens. * A regular expression (RE) is defined inductively a ordinary character stands for itself ε empty string R|S either R or S (alteration), where R,S = RE RS R followed by S (concatenation) R* concatenation of R 0 or more times

Language A regular expression R describes a set of strings of characters denoted L(R) L(R) = the language defined by R L(abc) = { abc } L(hello|goodbye) = { hello, goodbye } L(1(0|1)*) = all binary numbers that start with a 1 Each token can be defined using a regular expression

Lexical Errors Few errors are detectable at the lexical level, because the lexical analyzer has a very localized view of a source program fi(a==x) … Error-Recovery Actions: 1. Panic Mode Recovery: we delete successive characters from the remaining input until the lexical analyzer can find a well-formed token.   2. Deleting an extraneous character 3. Inserting a missing character 4. Replacing an incorrect character by a correct character 5. Transposing two adjacent characters (o0O)

Input Buffering There are three general approaches to implement lexical analyzer: 1. Use a lexical-analyzer generator, such as the Lex compiler to produce the lexical analyzer from a regular-expression-based specification. In this case, the generator provides routines for reading and buffering the input. 2. Write the lexical analyzer in a conventional systems-programming language, using the I/O facilities of that language to read the input. 3. Write the lexical analyzer in assembly language and explicitly manage the reading of input.

End