The Role of Lexical Analyzer

Slides:



Advertisements
Similar presentations
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Advertisements

From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
CSc 453 Lexical Analysis (Scanning)
Chapter 3 Lexical Analysis Yu-Chen Kuo.
 Lex helps to specify lexical analyzers by specifying regular expression  i/p notation for lex tool is lex language and the tool itself is refered to.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
Yu-Chen Kuo1 Chapter 2 A Simple One-Pass Compiler.
Lexical Analysis Recognize tokens and ignore white spaces, comments
BİL744 Derleyici Gerçekleştirimi (Compiler Design)1.
Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source program) – divides it into tokens.
Chapter 3 Lexical Analysis
2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.
Topic #3: Lexical Analysis
1 Flex. 2 Flex A Lexical Analyzer Generator  generates a scanner procedure directly, with regular expressions and user-written procedures Steps to using.
Lexical Analysis Natawut Nupairoj, Ph.D.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis Hira Waseem Lecture
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Lexical Analyzer (Checker)
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
CS30003: Compilers Lexical Analysis Lecture Date: 05/08/13 Submission By: DHANJIT DAS, 11CS10012.
LEX (04CS1008) A tool widely used to specify lexical analyzers for a variety of languages We refer to the tool as Lex compiler, and to its input specification.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analyzer in Perspective
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
1.  It is the first phase of compiler.  In computer science, lexical analysis is the process of converting a sequence of characters into a sequence.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
IN LINE FUNCTION AND MACRO Macro is processed at precompilation time. An Inline function is processed at compilation time. Example : let us consider this.
CSc 453 Lexical Analysis (Scanning)
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Scanner Introduction to Compilers 1 Scanner.
Overview of Previous Lesson(s) Over View  Syntax-directed translation is done by attaching rules or program fragments to productions in a grammar. 
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Lexical Analysis (Scanning) Lexical Analysis (Scanning)
Lexical Analysis.
1st Phase Lexical Analysis
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Chapter2 : Lexical Analysis
System Software Theory (5KS03).
Lexical Analyzer in Perspective
Compiler Design (40-414) Main Text Book:
Lexical and Syntax Analysis
A Simple Syntax-Directed Translator
Constructing Precedence Table
Scanner Scanner Introduction to Compilers.
Chapter 3 Lexical Analysis.
Lecture 5 Transition Diagrams
Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012.
Lexical Analysis (Sections )
CSc 453 Lexical Analysis (Scanning)
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
CSc 453 Lexical Analysis (Scanning)
Compiler Construction
Lexical analysis Jakub Yaghob
Chapter 3: Lexical Analysis
CS 3304 Comparative Languages
Scanner Scanner Introduction to Compilers.
CS 3304 Comparative Languages
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
CSc 453 Lexical Analysis (Scanning)
Compiler Design 3. Lexical Analyzer, Flex
Presentation transcript:

The Role of Lexical Analyzer Compiler Design Lexical Analysis

Outline Lexical Analysis vs. Parsing Tokens, Patterns and Lexemes Attributes for Tokens Lexical Errors

Lexical Analysis Manual approach – by hand To identify the occurrence of each lexeme To return the information about the identified token Automatic approach - lexical-analyzer generator Compiles lexeme patterns into code that functions as a lexical analyzer e.g. Lex, Flex, … Steps Regular expressions - notation for lexeme patterns Nondeterministic automata Deterministic automata Driver - code which simulates automata

The Role of the Lexical Analyzer Read input characters To group them into lexemes Produce as output a sequence of tokens input for the syntactical analyzer Interact with the symbol table Insert identifiers

The Role of the Lexical Analyzer to strip out comments whitespaces: blank, newline, tab, … other separators to correlate error messages generated by the compiler with the source program to keep track of the number of newlines seen to associate a line number with each error message

Lexical Analyzer Processes Scanning to not require input tokenization deletion of comments compaction of consecutive white spaces into one Lexical analysis to produce sequence of tokens as output

Lexical Analysis vs. Parsing Simplicity of design Separation of lexical from syntactical analysis -> simplify at least one of the tasks e.g. parser dealing with white spaces -> complex Cleaner overall language design Improved compiler efficiency Liberty to apply specialized techniques that serves only lexical tasks, not the whole parsing Speedup reading input characters using specialized buffering techniques Enhanced compiler portability Input device peculiarities are restricted to the lexical analyzer

Tokens, Patterns, Lexemes Token - pair of: token name – abstract symbol representing a kind of lexical unit keyword, identifier, … optional attribute value Pattern description of the form that the lexeme of a token may take e.g. for a keyword the pattern is the character sequence forming that keyword for identifiers the pattern is a complex structure that is matched by many strings Lexeme a sequence of characters in the source program matching a pattern for a token

Examples of Tokens Token Informal Description Sample Lexemes if characters i, f else characters e, l, s, e comparison < or > or <= or >= or == or != <=, != id Letter followed by letters and digits pi, score, D2 number Any numeric constant 3.14159, 0, 02e23 literal Anything but “, surrounded by “ “core dumped”

Examples of Tokens One token for each keyword Tokens for operators Keyword pattern = keyword itself Tokens for operators Individually or in classes One token for all identifiers One or more tokens for constants Numbers, literal strings Tokens for each punctuation symbol ( ) , ;

Attributes for Tokens more than one lexeme can match a pattern token number matches 0, 1, 100, 77,… lexical analyzer must return Not only the token name Also an attribute value describing the lexeme represented by the token token id may have associated information like lexeme type location – in order to issue error messages token id attribute pointer to the symbol table for that identifier

Tricky Problems in Token Recognition Fortran 90 example assignment DO 5 I = 1.25 DO5I = 1.25 do loop DO 5 I = 1,25

Example of Attribute Values E = M * C ** 2 <id, pointer to symbol table entry for E> <assign_op> <id, pointer to symbol-table entry for M> <mult_op> <id, pointer to symbol-table entry for C> <exp_op> <number, integer value 2>

Lexical Errors can not be detected by the lexical analyzer alone fi (a == f(x) ) … lexical analyzer is unable to proceed none of the patterns matches any prefix of the remaining input “panic mode” recovery strategy delete one/successive characters from the remaining input insert a missing character into the remaining input replace a character transpose two adjacent characters

Outline Input Buffering Buffer Pairs Sentinels

Input Buffering How to speed the reading of source program ? to look one additional character ahead e.g. to see the end of an identifier you must see a character which is not a letter or a digit not a part of the lexeme for id in C - ,= , < ->, ==, <= two buffer scheme that handles large lookaheads safely sentinels – improvement which saves time checking buffer ends

Buffer Pairs Buffer size N N = size of a disk block (4096) read N characters into a buffer one system call not one call per character read < N characters we encounter eof two pointers to the input are maintained lexemeBegin – marks the beginning of the current lexeme forward – scans ahead until a pattern match is found

Sentinels forward pointer sentinel to test if it is at the end of the buffer to determine what character is read (multiway branch) sentinel added at each buffer end can not be part of the source program character eof is a natural choice retains the role of entire input end when appears other than at the end of a buffer it means that the input is at an end

Lookahead Code with Sentinels switch(*forward++) { case eof: if(forward is at the end of the first buffer) reload second buffer; forward = beginning of the second buffer; } elseif(forward is at the end of the second buffer) reload first buffer; forward = beginning the first buffer; else /* eof within a buffer marks the end of input */ terminate lexical analysis; break; cases for the other characters

Potential Problems running out of buffer space long lookahead N ~ 4 x 1000 long character strings > N concatenation of components (Java + operator) long lookahead languages where keywords are not reserved in PL/I: DECLARE (ARG1, ARG2,…) ambiguous identifier resolved by the parser keyword procedure name

Bibliography Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman – Compilers, Principles, Tehcniques and Tools, Second Edition, 2007