CSc 453 Lexical Analysis (Scanning)

Slides:



Advertisements
Similar presentations
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Advertisements

From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
CSc 453 Lexical Analysis (Scanning)
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source program) – divides it into tokens.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Overview of Previous Lesson(s) Over View  Strategies that have been used to implement and optimize pattern matchers constructed from regular expressions.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Lexical Analyzer in Perspective
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
1.  It is the first phase of compiler.  In computer science, lexical analysis is the process of converting a sequence of characters into a sequence.
CSc 453 Lexical Analysis (Scanning)
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Scanner Introduction to Compilers 1 Scanner.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
The Role of Lexical Analyzer
Lexical Analysis (Scanning) Lexical Analysis (Scanning)
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Lecture 6 Lexical Analysis Xiaoyin Wang CS5363 Programming Languages and Compilers.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
1 Compiler Construction Vana Doufexi office CS dept.
Deterministic Finite Automata Nondeterministic Finite Automata.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Department of Software & Media Technology
Lexical Analyzer in Perspective
Lecture 2 Lexical Analysis
Scanner Scanner Introduction to Compilers.
Chapter 3 Lexical Analysis.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lexical Analysis (Sections )
CSc 453 Lexical Analysis (Scanning)
Finite-State Machines (FSMs)
CSc 453 Lexical Analysis (Scanning)
Finite-State Machines (FSMs)
Compiler Construction
Lexical and Syntax Analysis
Chapter 3: Lexical Analysis
Lexical Analysis - An Introduction
Review: Compiler Phases:
Scanner Scanner Introduction to Compilers.
CS 3304 Comparative Languages
Scanner Scanner Introduction to Compilers.
Lexical Analysis - An Introduction
Lexical Analysis.
Scanner Scanner Introduction to Compilers.
Compiler Construction
Scanner Scanner Introduction to Compilers.
Lecture 5 Scanning.
Scanner Scanner Introduction to Compilers.
Presentation transcript:

CSc 453 Lexical Analysis (Scanning) Saumya Debray The University of Arizona Tucson

Overview tokens lexical analyzer (scanner) syntax analyzer (parser) source program symbol table manager Main task: to read input characters and group them into “tokens.” Secondary tasks: Skip comments and whitespace; Correlate error messages with source program (e.g., line number of error). CSc 453: Lexical Analysis

Overview (cont’d) CSc 453: Lexical Analysis

Implementing Lexical Analyzers Different approaches: Using a scanner generator, e.g., lex or flex. This automatically generates a lexical analyzer from a high-level description of the tokens. (easiest to implement; least efficient) Programming it in a language such as C, using the I/O facilities of the language. (intermediate in ease, efficiency) Writing it in assembly language and explicitly managing the input. (hardest to implement, but most efficient) CSc 453: Lexical Analysis

Lexical Analysis: Terminology token: a name for a set of input strings with related structure. Example: “identifier,” “integer constant” pattern: a rule describing the set of strings associated with a token. Example: “a letter followed by zero or more letters, digits, or underscores.” lexeme: the actual input string that matches a pattern. Example: count CSc 453: Lexical Analysis

Examples Input: count = 123 Tokens: identifier : Rule: “letter followed by …” Lexeme: count assg_op : Rule: = Lexeme: = integer_const : Rule: “digit followed by …” Lexeme: 123 CSc 453: Lexical Analysis

Attributes for Tokens If more than one lexeme can match the pattern for a token, the scanner must indicate the actual lexeme that matched. This information is given using an attribute associated with the token. Example: The program statement count = 123 yields the following token-attribute pairs: identifier, pointer to the string “count” assg_op,  integer_const, the integer value 123 CSc 453: Lexical Analysis

Specifying Tokens: regular expressions Terminology: alphabet : a finite set of symbols string : a finite sequence of alphabet symbols language : a (finite or infinite) set of strings. Regular Operations on languages: Union: R  S = { x | x  R or x  S} Concatenation: RS = { xy | x  R and y  S} Kleene closure: R* = R concatenated with itself 0 or more times = {}  R  RR  RRR  = strings obtained by concatenating a finite number of strings from the set R. CSc 453: Lexical Analysis

Regular Expressions A pattern notation for describing certain kinds of sets over strings: Given an alphabet :  is a regular exp. (denotes the language {}) for each a  , a is a regular exp. (denotes the language {a}) if r and s are regular exps. denoting L(r) and L(s) respectively, then so are: (r) | (s) ( denotes the language L(r)  L(s) ) (r)(s) ( denotes the language L(r)L(s) ) (r)* ( denotes the language L(r)* ) CSc 453: Lexical Analysis

Common Extensions to r.e. Notation One or more repetitions of r : r+ A range of characters : [a-zA-Z], [0-9] An optional expression: r? Any single character: . Giving names to regular expressions, e.g.: letter = [a-zA-Z_] digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ident = letter ( letter | digit )* Integer_const = digit+ CSc 453: Lexical Analysis

Recognizing Tokens: Finite Automata A finite automaton is a 5-tuple (Q, , T, q0, F), where:  is a finite alphabet; Q is a finite set of states; T: Q    Q is the transition function; q0  Q is the initial state; and F  Q is a set of final states. CSc 453: Lexical Analysis

Finite Automata: An Example A (deterministic) finite automaton (DFA) to match C-style comments: CSc 453: Lexical Analysis

Formalizing Automata Behavior To formalize automata behavior, we extend the transition function to deal with strings: Ŧ : Q  *  Q Ŧ(q, ) = q Ŧ(q, aw) = Ŧ(r, w) where r = T(q, a) The language accepted by an automaton M is L(M) = { w | Ŧ(q0, w)  F }. A language L is regular if it is accepted by some finite automaton. CSc 453: Lexical Analysis

Finite Automata and Lexical Analysis The tokens of a language are specified using regular expressions. A scanner is a big DFA, essentially the “aggregate” of the automata for the individual tokens. Issues: What does the scanner automaton look like? How much should we match? (When do we stop?) What do we do when a match is found? Buffer management (for efficiency reasons). CSc 453: Lexical Analysis

Structure of a Scanner Automaton CSc 453: Lexical Analysis

How much should we match? In general, find the longest match possible. E.g., on input 123.45, match this as num_const(123.45) rather than num_const(123), “.”, num_const(45). CSc 453: Lexical Analysis

Input Buffering Scanner performance is crucial: This is the only part of the compiler that examines the entire input program one character at a time. Disk input can be slow. The scanner accounts for ~25-30% of total compile time. We need lookahead to determine when a match has been found. Scanners use double-buffering to minimize the overheads associated with this. CSc 453: Lexical Analysis

Buffer Pairs Use two N-byte buffers (N = size of a disk block; typically, N = 1024 or 4096). Read N bytes into one half of the buffer each time. If input has less than N bytes, put a special EOF marker in the buffer. When one buffer has been processed, read N bytes into the other buffer (“circular buffers”). CSc 453: Lexical Analysis

Buffer pairs (cont’d) Code: if (fwd at end of first half) reload second half; set fwd to point to beginning of second half; else if (fwd at end of second half) reload first half; set fwd to point to beginning of first half; else fwd++; it takes two tests for each advance of the fwd pointer. CSc 453: Lexical Analysis

Buffer pairs: Sentinels Objective: Optimize the common case by reducing the number of tests to one per advance of fwd. Idea: Extend each buffer half to hold a sentinel at the end. This is a special character that cannot occur in a program (e.g., EOF). It signals the need for some special action (fill other buffer-half, or terminate processing). CSc 453: Lexical Analysis

Buffer pairs with sentinels (cont’d) Code: fwd++; if ( *fwd == EOF ) { /* special processing needed */ if (fwd at end of first half) . . . else if (fwd at end of second half) else /* end of input */ terminate processing. } common case now needs just a single test per character. CSc 453: Lexical Analysis

Handling Reserved Words Hard-wire them directly into the scanner automaton: harder to modify; increases the size and complexity of the automaton; performance benefits unclear (fewer tests, but cache effects due to larger code size). Fold them into “identifier” case, then look up a keyword table: simpler, smaller code; table lookup cost can be mitigated using perfect hashing. CSc 453: Lexical Analysis

Implementing Finite Automata 1 Encoded as program code: each state corresponds to a (labeled code fragment) state transitions represented as control transfers. E.g.: while ( TRUE ) { … state_k: ch = NextChar(); /* buffer mgt happens here */ switch (ch) { case … : goto ...; /* state transition */ } state_m: /* final state */ copy lexeme to where parser can get at it; return token_type; CSc 453: Lexical Analysis

Direct-Coded Automaton: Example int scanner() { char ch; while (TRUE) { ch = NextChar( ); state_1: switch (ch) { /* initial state */ case ‘a’ : goto state_2; case ‘b’ : goto state_3; default : Error(); } state_2: … state_3: switch (ch) { default : return SUCCESS; } /* while */ CSc 453: Lexical Analysis

Implementing Finite Automata 2 Table-driven automata (e.g., lex, flex): Use a table to encode transitions: next_state = T(curr_state, next_char); Use one bit in state no. to indicate whether it’s a final (or error) state. If so, consult a separate table for what action to take. T next input character Current state CSc 453: Lexical Analysis

Table-Driven Automaton: Example #define isFinal(s) ((s) < 0) int scanner() { char ch; int currState = 1; while (TRUE) { ch = NextChar( ); if (ch == EOF) return 0; /* fail */ currState = T [currState, ch]; if (IsFinal(currState)) { return 1; /* success */ } } /* while */ T input a b state 1 2 3 -1 CSc 453: Lexical Analysis

What do we do on finding a match? A match is found when: The current automaton state is a final state; and No transition is enabled on the next input character. Actions on finding a match: if appropriate, copy lexeme (or other token attribute) to where the parser can access it; save any necessary scanner state so that scanning can subsequently resume at the right place; return a value indicating the token found. CSc 453: Lexical Analysis