Compiler Construction Lexical Analysis. The word lexical means textual or verbal or literal. The lexical analysis implemented in the “SCANNER” module.

Slides:



Advertisements
Similar presentations
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Advertisements

COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
1 Pass Compiler 1. 1.Introduction 1.1 Types of compilers 2.Stages of 1 Pass Compiler 2.1 Lexical analysis 2.2. syntactical analyzer 2.3. Code generation.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture4: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
CS5371 Theory of Computation
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
College of Computer Science & Technology Compiler Construction Principles & Implementation Techniques -1- Compiler Construction Principles & Implementation.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary  Quoted string in.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Compiler1 Chapter V: Compiler Overview: r To study the design and operation of compiler for high-level programming languages. r Contents m Basic compiler.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
1 Languages and Compilers (SProg og Oversættere) Lexical analysis.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a compiler PART II:
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
CSc 453 Lexical Analysis (Scanning)
CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 4.
CMSC 330: Organization of Programming Languages Finite Automata NFAs  DFAs.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Com Functional Programming Lexical Analysis Marian Gheorghe Lecture 15 Module homepage Mole & ©University of Sheffieldcom2010.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Exercise 1 Consider a language with the following tokens and token classes: ID ::= letter (letter|digit)* LT ::= " " shiftL ::= " >" dot ::= "." LP ::=
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
Chapter 5 Compilation of Imperative, Functional, Logical and Object Oriented Languages.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Automata & Formal Languages, Feodor F. Dragan, Kent State University 1 CHAPTER 3 The Church-Turing Thesis Contents Turing Machines definitions, examples,
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Finite Automata A simple model of computation. 2 Finite Automata2 Outline Deterministic finite automata (DFA) –How a DFA works.
Deterministic Finite Automata Nondeterministic Finite Automata.
Chapter 2-II Scanning Sung-Dong Kim Dept. of Computer Engineering, Hansung University.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
June 13, 2016 Prof. Abdelaziz Khamis 1 Chapter 2 Scanning – Part 2.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
1 Chapter 2 Finite Automata (part a) Hokkaido, Japan.
Department of Software & Media Technology
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Lecture 2 Lexical Analysis
Chapter 3 Lexical Analysis.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lexical Analysis (Sections )
Non Deterministic Automata
CSc 453 Lexical Analysis (Scanning)
Lexical analysis Jakub Yaghob
Recognizer for a Language
Review: Compiler Phases:
Compiler Construction
Lexical Analysis - An Introduction
High-Level Programming Language
Non Deterministic Automata
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Compiler Construction Lexical Analysis

The word lexical means textual or verbal or literal. The lexical analysis implemented in the “SCANNER” module decomposes the source program, read in from a file as a string of characters, into a sequence of lexical units, called “SYMBOL”. The scanner reads this character string from left to right. If the work of the scanner, the screener and the parser is interleaved the parser calls the scanner-screener combination to obtain the next symbol. The scanner begins the analysis with the character following the end of the last symbol found and searches for the longest string at the beginning of the remaining input that is a symbol of the language. It returns a representation of this symbol to the screener, which determines whether this symbol is relevant for the parser or should be ignored. If it is not relevant, the screener triggers the scanner again. Otherwise, it returns a (possibly altered) representation of the symbol to the parser.

In general, the scanner should be able to recognize infinitely many or at least, very many different symbols. It deliberately divides this set into a finite number of classes. Symbols with a related structure (for example, the same syntactic role) fall into the same “SYMBOL CLASS”. Thus, we now distinguish between: Symbols or words over an alphabet of characters, , for example, xyz12, 125, “abc”. Symbols Classes or set of symbols such as the set of identifiers, the set of integer constants and that of character strings identified by the names id, intconst, string, and Representations of Symbols. For example, the scanner might pass the word xyz12 to the screener in the representation (id, xyz12), which the latter enters in its symbol table as (1,17) and passes to the parser, where the code for the symbol class id is 1 and xyz12 is the 17th identifier found. Theoretical Foundations: Words and languages. We briefly review a number of important basic terms relating to formal languages, where  denotes an arbitrary alphabet, that is, a finite non-empty set of characters:

Regular Languages, Regular Expressions and Finite Automata: The lexical units recognized by the a scanner from a non-empty regular language. Regular languages can be described by regular expressions. Thus, these form the basis for all languages for specifying the lexical analysis. Regular languages can be recognized by finite automata. These terms will now be introduces. Note that, we shall always assume an underlying alphabet . Regular Language: The regular language are defined over  by: , {  } are regular languages over  For all a  , { a } is a regular language. If R1 and R2 are regular languages over  then so are R1 U R2, R1 R2 and R1*.

Regular Language: RE over  and the regular languages they describe can also be defined inductively:  is a regular expression over  and describes the regular language .  is a regular expression over  and describes the regular language {  } a (for a   ) is a regular expression over  and describes the regular language {a} If r1 and r2 are regular expressions, which describe the regular language R1 and R2 then (r1 | r2)is a regular expression over  and describes the regular language R1 U R2 and (r1 r2) is a regular expression over  and describes the regular language R1R2, and (r1)* is a regular expression over  describes the regular language R1* There are no other regular expressions.

Non-Deterministic Finite Automata (NFA): NFA is a tuple: M = ( , Q, , q0, F) Where:  is an alphabet, the input alphabet, Q is a finite set of states, Q0  Q is the initial state. F  Q is the set of final states   Q x (  U {  }) x Q is transition relation. We now explain how an NFA used as a scanner work. An NFA checks whether or not input words are in a given language. It accepts a word if it lands in a final state after reading the whole word. A finite automaton used as a scanner decomposes an input word piece by piece into sub-words of the given language. Thus, each sub-word takes it from its initial state into a final state. It may have problems determining the end of the sub-word. The finite automaton is started in its initial state. Its read head is then at the beginning of the input tape. When a finite automaton is used as a scanner it begins with the first character that has not yet been “consumed”. Then it takes a sequence of steps. Each step depends on the actual state and possibly on the next input character.

This involves entering a new state and, when the input character has been read, moving the read head to the next character. The automaton accepts the input word when the input is exhausted and the actual state is a final state. The Scanner reports that it has found a symbol when it is in a final state and has no transition to the next input character. If it has no transition from the actual state, and the actual state is not a final state, it must backtrack to the last final state it passes through. If there is no such state for the actual symbol the an error has occurred. The future behavior of an NFA is determined by the actual state and the remainder of the input. These two together form the actual configuration of the automaton. A language for specifying the lexical analysis: The regular expressions provide the main description formalism for the lexical analysis. A specification of the lexical analysis should enable us to combine sets of characters into classes, if they can be exchanged in symbols without the resulting symbols being assigned to different symbol classes. For example: le = a-z, A-Z di = 0-9 or = |

open = / | { close = / | } star = * We can now give the usual definition of the symbol class of identifiers: id = le ( le | di )* In the character class definitions, we manage with only three meta- characters, namely ‘=’, ‘-‘ and the space character ‘ ‘. The Screener: According to the distribution of tasks, the screener knows the set of reserved names or keywords. This presupposes that the scanner has one or more symbol classes containing these symbols. This is the case when, as for example in Pascal, C and Ada, the keywords have the same structure as identifiers. In the task as described above, for every identifier, whether reserved or not, the scanner will report the presence of an identifier. The screener will then determine whether it is a reserved symbol. This distribution of tasks keeps the set of states and the number of transitions of the scanner automaton small. However, the screener must have an efficient means of recognizing keywords.

Symbol Classes: Symbol classes are sets of symbols that are equivalent for the “consumer” in the compiler, that is, the parser. Two symbols are equivalent if in every state the parser makes the same transition (takes the same decision) under each of the symbols. Typical symbol classes include the various classes of constants, the identifiers (without the reserved symbols), comments, arithmetic operators of the same precedence and relational operators. The designer of a scanner-screener combination will define such classes. A well-defined class code will be assigned to each class either explicitly by the designer or implicitly by the generator. This class code is passes to the parser when a symbol of the class is found by the generated scanner. For example: Character classes: le=a-z di=0-9

Symbol classes: AddOP =+ | - MulOp=* | / | % CompOP= | = | = | != (enumerated classes) Id=le(le | di)* IntConst=di di* (defined by regular expression with iteration “infinite class”) For semantic analysis and for code generation, it is absolutely necessary to know which element of a symbol class has been found. Thus, in addition to the class code, the scanner/screener also passes on a relative code for the symbol found, which is generally not used by the parser, but noted for later use. If there exist different, but syntactically and semantically equivalent symbols, these may be combined within a symbols class definition. For example: CompOp=(,gt) | (=,eq) | (!=,neq) | (>=,ge) | (<=,le)