Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.

Slides:



Advertisements
Similar presentations
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Advertisements

CSc 453 Lexical Analysis (Scanning)
Chapter 3 Lexical Analysis Yu-Chen Kuo.
Chapter 3 Lexical Analysis. Definitions The lexical analyzer produces a certain token wherever the input contains a string of characters in a certain.
CS-338 Compiler Design Dr. Syed Noman Hasany Assistant Professor College of Computer, Qassim University.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
Lexical Analysis Recognize tokens and ignore white spaces, comments
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source program) – divides it into tokens.
CPSC 388 – Compiler Design and Construction
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
 We are given the following regular definition: if -> if then -> then else -> else relop -> |>|>= id -> letter(letter|digit)* num -> digit + (.digit.
Chapter 3 Lexical Analysis
2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.
Topic #3: Lexical Analysis
1 Flex. 2 Flex A Lexical Analyzer Generator  generates a scanner procedure directly, with regular expressions and user-written procedures Steps to using.
Lexical Analysis Natawut Nupairoj, Ph.D.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis Hira Waseem Lecture
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analyzer (Checker)
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
CS30003: Compilers Lexical Analysis Lecture Date: 05/08/13 Submission By: DHANJIT DAS, 11CS10012.
Fall 2007CMPS 450 Lexical Analysis CMPS 450 J. Moloney.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analyzer in Perspective
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
1.  It is the first phase of compiler.  In computer science, lexical analysis is the process of converting a sequence of characters into a sequence.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Scanner Introduction to Compilers 1 Scanner.
Overview of Previous Lesson(s) Over View  Syntax-directed translation is done by attaching rules or program fragments to productions in a grammar. 
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
The Role of Lexical Analyzer
Lexical Analysis (Scanning) Lexical Analysis (Scanning)
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
1st Phase Lexical Analysis
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
System Software Theory (5KS03).
Lexical Analyzer in Perspective
CS510 Compiler Lecture 2.
A Simple Syntax-Directed Translator
Lexical Analysis.
Scanner Scanner Introduction to Compilers.
Chapter 3 Lexical Analysis.
Lecture 5 Transition Diagrams
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
CSc 453 Lexical Analysis (Scanning)
Compiler Construction
Regular Definition and Transition Diagrams
Chapter 3: Lexical Analysis
Review: Compiler Phases:
Recognition of Tokens.
CS 3304 Comparative Languages
Scanner Scanner Introduction to Compilers.
CS 3304 Comparative Languages
Specification of tokens using regular expressions
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Lexical Analysis S. M. Farhad

Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many situations where we need to look at least one additional character ahead.

Input Buffering For instance, we cannot be sure we’ve seen the end of an identifier until we see a character that is not a letter or digit, and therefore is not part of the lexeme for id. In C, single-character operators like -, =, or < could also be the beginning of a two-character operator like ->, ==, or <=. A a two-buffer scheme that handles large lookaheads safely. We then consider an improvement involving “sentinels” that saves time checking for the ends of buffers.

Buffer Pairs The amount of time taken is high to process characters of a large source program. Specialized buffering techniques have been developed to reduce the amount of overhead required to process a single input character. An important scheme involves two buffers that are alternately reloaded, as suggested in figure.

Buffer Pairs Each buffer is of the same size N. N is usually the size of a disk block, e.g., 4096 bytes. Using one system read command we can read N characters into a buffer, rather than using one system call per character. Buffer Pairs

If fewer than N characters remain in the input file, then a special character, represented by eof, marks the end of the source file. This eof is different from any possible character of the source program. Buffer Pairs

Two pointers to the input are maintained:  Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are attempting to determine.  Pointer forward scans ahead until a pattern match is found.

Buffer Pairs Once the next lexeme is determined, forward is set to the character at its right end. Then, after the lexeme is recorded as an attribute value of a token returned to the parser, lexemeBegin is set to the character immediately after the lexeme just found. In figure, we see forward has passed the end of the next lexeme, ** (the Fortran exponentiation operator), and must be retracted one position to its left.

Buffer Pairs Advancing forward requires that we first test whether we have reached the end of one of the buffers. If so, we must reload the other buffer from the input. And move forward to the beginning of the newly loaded buffer.

Sentinels If we use the previous scheme as described, we must check, each time we advance forward, that we have not moved off one of the buffers. If we do, then we must also reload the other buffer. Thus, for each character read, we make two tests:  one for the end of the buffer.  And one to determine what character is read (the latter may be a multiway branch).

Sentinels Two tests can be simplified using additional sentinels We can combine the buffer-end test with the test for the current character if we extend each buffer to hold a sentinel character at the end. The sentinel is a special character that cannot be part of the source program, and a natural choice is the character eof.

Sentinels Figure shows the same arrangement as previous, but with the sentinels added. Note that eof retains its use as a marker for the end of the entire input. Any eof that appears other than at the end of a buffer means that the input is at an end.

Implementing Multiway Branches The algorithm for advancing forward. Test is simplified

String and Languages An alphabet is any finite set of symbols  The set {0,1) is the binary alphabet. A string over an alphabet is a finite sequence of symbols drawn from that alphabet. The empty string, denoted Ɛ, is the string of length zero. A language is any countable set of strings over some fixed alphabet.

Operations on Languages

L U D is the set of letters and digits - strictly speaking the language with 62 strings of length one, each of which strings is either one letter or one digit. LD is the set of 520 strings of length two, each consisting of one letter followed by one digit. L4 is the set of all 4-letter strings. L* is the set of all strings of letters, including Ɛ, the empty string. L(L U D)* is the set of all strings of letters and digits beginning with a letter. D+ is the set of all strings of one or more digits.

Regular Expression Specification of Tokens A regular expression is a specific pattern that provides concise and flexible means to "match" (specify and recognize) strings of text

The C Identifiers What will be the C Identifiers?

Unsigned Numbers What will be C Unsigned Numbers? 2380, , 6.34E34, 12.3E-12

Extensions of Regular Expressions Kleene closure and Positive closure: one or more instances  r* = r+| Ɛ and r + = rr* = r*r Zero or one instance: r? is equivalent to r l Ɛ Character classes. [a-z] is shorthand for a|b|... |z

Using the Extension

Recognition of Tokens Our discussion will make use of the following running example.

Recognition of Tokens For relop, we use the comparison operators of languages like Pascal or SQL, where = is “equals” and <> is “not equals,” because it presents an interesting structure of lexemes. The terminals of the grammar, which are if, then, else, relop, id, and number, are the names of tokens as far as the lexical analyzer is concerned

Example The patterns for these tokens are described using regular definitions.

Example To simplify matters, we make the common assumption that keywords are also reserved words. They are not identifiers, even though their lexemes match the pattern for identifiers.

Example Lexical analyzer stripes out white- space, by recognizing the “token” ws defined by: ws → (blank | tab | newline)+ Here, blank, tab, and newline are abstract symbols that we use to express the ASCII characters of the same names. ws is not returned to the parser. We rather restart the lexical analysis from the character that follows the whitespace.

Our goal for the lexical analyzer is summarized in figure.

Question?