Scanner 2015.03.16. Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the.

Slides:



Advertisements
Similar presentations
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
Advertisements

COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
0 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the program well-formed.
Lexical Analysis — Introduction Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
Lexical Analysis - Scanner Computer Science Rensselaer Polytechnic Compiler Design Lecture 2.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
1 Languages and Finite Automata or how to talk to machines...
2. Lexical Analysis Prof. O. Nierstrasz
Compiler Construction
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary  Quoted string in.
Topics Automata Theory Grammars and Languages Complexities
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction
Regular Languages A language is regular over  if it can be built from ;, {  }, and { a } for every a 2 , using operators union ( [ ), concatenation.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
2. Lexical Analysis Prof. O. Nierstrasz Thanks to Jens Palsberg and Tony Hosking for their kind permission to reuse and adapt the CS132 and CS502 lecture.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
Chapter 3 Chang Chi-Chung The Role of the Lexical Analyzer Lexical Analyzer Parser Source Program Token Symbol Table getNextToken error.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis Constructing a Scanner from Regular Expressions.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
1 Languages and Compilers (SProg og Oversættere) Lexical analysis.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Chapter 3 Chang Chi-Chung The Role of the Lexical Analyzer Lexical Analyzer Parser Source Program Token Symbol Table getNextToken error.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Compiler Introduction 1 Kavita Patel. Outlines 2  1.1 What Do Compilers Do?  1.2 The Structure of a Compiler  1.3 Compilation Process  1.4 Phases.
CMSC 330: Organization of Programming Languages Theory of Regular Expressions Finite Automata.
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
1 Introduction to the Theory of Computation Regular Expressions.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
1 Compiler Construction Vana Doufexi office CS dept.
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
I.E. LEXICAL ANALYSIS I.E. LINEAR ANALYSIS
Finite automate.
Chapter 3 Lexical Analysis.
Lexical Analysis (Sections )
Finite-State Machines (FSMs)
Recognizer for a Language
Lexical Analysis - An Introduction
Review: Compiler Phases:
Compiler Construction
Lexical Analysis - An Introduction
Compiler Construction
Lecture 5 Scanning.
Presentation transcript:

Scanner

Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the program well-formed (semantically) ? Build an IR version of the code for the rest of the compiler The front end deals with form (syntax) & meaning (semantics) Source code Front End Errors Machine code Back End IR

The Front End Implementation Strategy Source code Scanner IR Parser Errors tokens ScanningParsing Specify Syntaxregular expressions context-free grammars Implement Recognizer deterministic finite automaton push-down automaton Perform WorkActions on transitions in automaton

The Front End Why separate the scanner and the parser? Scanner classifies words Parser constructs grammatical derivations Parsing is harder and slower Separation simplifies the implementation Scanners are simple Scanner leads to a faster, smaller parser token is a pair stream of characters Scanner IR + annotations Parser Errors stream of tokens microsyntaxsyntax Scanner is only pass that touches every character of the input.

Scanner Generator Why study automatic scanner construction? Avoid writing scanners by hand Harness the theory from classes like COMP 481 Goals: To simplify specification & implementation of scanners To understand the underlying techniques and technologies Comp 412, Fall Scanner Generator specifications Scanner source codeparts of speech & words Specifications written as “regular expressions” Represent words as indices into a global table tables or code design time compile time

Strings and Languages Alphabet An alphabet  is a finite set of symbols (characters) String A string is a finite sequence of symbols from    s  denotes the length of string s   denotes the empty string, thus  = 0 Language A language is a countable set of strings over some fixed alphabet   Abstract Language Φ  {ε}

String Operations Concatenation ( 連接 ) The concatenation of two strings x and y is denoted by xy Identity ( 單位元素 ) The empty string is the identity under concatenation.  s = s  = s Exponentiation Define s 0 =  s i = s i-1 s for i > 0 By Define s 1 = s s 2 = ss

Language Operations Union L  M = { s  s  L or s  M } Concatenation L M = { xy  x  L and y  M} Exponentiation L 0 = {  } L i = L i-1 L Kleene closure ( 封閉包 ) L * = ∪ i=0,…,  L i Positive closure L + = ∪ i=1,…,  L i

Regular Expressions A convenient means of specifying certain simple sets of strings. We use regular expressions to define structures of tokens. Tokens are built from symbols of a finite vocabulary. Regular Sets The sets of strings defined by regular expressions.

Regular Expressions Basis symbols:  is a regular expression denoting language L(  ) = {  } a   is a regular expression denoting L(a) = {a} If r and s are regular expressions denoting languages L(r) and M(s) respectively, then r  s is a regular expression denoting L(r)  M(s) rs is a regular expression denoting L(r)M(s) r * is a regular expression denoting L(r) * (r) is a regular expression denoting L(r) A language defined by a regular expression is called a regular set.

Operator Precedence OperatorPrecedenceAssociative *highestleft concatenationSecondleft |lowestleft

Algebraic Laws for Regular Expressions LawDescription r | s = s | r| is commutative r | ( s | t ) = ( r | s ) | t| is associative r(st) = (rs)tconcatenation is associative r(s|t) = rs | rt (s|t)r = sr | tr concatenation distributes over | εr = rε = rε is the identity for concatenation r* = ( r |ε)*ε is guaranteed in a closure r** = r** is idempotent

Examples of Regular Expressions Identifiers : Letter  (a|b|c| … |z|A|B|C| … |Z) Digit  (0|1|2| … |9) Identifier  Letter ( Letter | Digit ) * Numbers : Integer  (+|-|  ) (0| (1|2|3| … |9)(Digit * ) ) Decimal  Integer. Digit * Real  ( Integer | Decimal ) E (+|-|  ) Digit * Complex  ( Real, Real ) Numbers can get much more complicated! 13 underlining indicates a letter in the input stream shorthand for (a|b|c| … |z|A|B|C| … |Z) ( (a|b|c| … |z|A|B|C| … |Z) | (0|1|2| … |9) ) * Using symbolic names does not imply recursion

Finite Automata Finite Automata are recognizers. FA simply say “Yes” or “No” about each possible input string. A FA can be used to recognize the tokens specified by a regular expression Use FA to design of a Lexical Analyzer Generator Two kind of the Finite Automata Nondeterministic finite automata (NFA) Deterministic finite automata (DFA) Both DFA and NFA are capable of recognizing the same languages.

NFA Definitions NFA = { S, , , s 0, F } A finite set of states S A set of input symbols Σ  input alphabet, ε is not in Σ A transition function    : S    S A special start state s 0 A set of final states F, F  S (accepting states)

Transition Graph for FA is a state is a transition is a the start state is a final state

Example a bc c a This machine accepts abccabc, but it rejects abcab. This machine accepts (abc + ) +.

Transition Table 0 start a bb a b STATEabε 0{0, 1}{0}- 1-{2}- 2-{3} The mapping  of an NFA can be represented in a transition table  (0, a ) = {0,1}  (0, b ) = {0}  (1, b ) = {2}  (2, b ) = {3}

DFA DFA is a special case of an NFA There are no moves on input ε For each state s and input symbol a, there is exactly one edge out of s labeled a. Both DFA and NFA are capable of recognizing the same languages.

NFA vs DFA 0 start a bb a b S = {0,1,2,3}  = { a, b } s 0 = 0 F = {3} abb b a a a (a | b)*abb

Concept