Scanner 中正理工學院 電算中心副教授 許良全. Copyright © 1998 by LCH Compiler Design Overview of Scanning n The purpose of a scanner is to group input characters into.

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

4b Lexical analysis Finite Automata
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
 Lex helps to specify lexical analyzers by specifying regular expression  i/p notation for lex tool is lex language and the tool itself is refered to.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
1 Chapter 2: Scanning 朱治平. Scanner (or Lexical Analyzer) the interface between source & compiler could be a separate pass and places its output on an.
Tools for building compilers Clara Benac Earle. Tools to help building a compiler C –Lexical Analyzer generators: Lex, flex, –Syntax Analyzer generator:
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary  Quoted string in.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
Scanner Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the.
Regular Languages A language is regular over  if it can be built from ;, {  }, and { a } for every a 2 , using operators union ( [ ), concatenation.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
Compiler Construction Lexical Analysis. The word lexical means textual or verbal or literal. The lexical analysis implemented in the “SCANNER” module.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Review: Regular expression: –How do we define it? Given an alphabet, Base case: – is a regular expression that denote { }, the set that contains the empty.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analyzer (Checker)
1 Chapter 3 Scanning – Theory and Practice. 2 Overview of scanner A scanner transforms a character stream of source file into a token stream. It is also.
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Introduction to Lex Fan Wu
1 Lex & Yacc. 2 Compilation Process Lexical Analyzer Source Code Syntax Analyzer Symbol Table Intermed. Code Gen. Code Generator Machine Code.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
C Chuen-Liang Chen, NTUCS&IE / 35 SCANNING Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University Taipei,
Exercise 1 Consider a language with the following tokens and token classes: ID ::= letter (letter|digit)* LT ::= " " shiftL ::= " >" dot ::= "." LP ::=
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
Donghyun (David) Kim Department of Mathematics and Physics North Carolina Central University 1 Chapter 1 Regular Languages Some slides are in courtesy.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
CS 536 © CS 536 Spring Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Set, Alphabets, Strings, and Languages. The regular languages. Clouser properties of regular sets. Finite State Automata. Types of Finite State Automata.
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Department of Software & Media Technology
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
CS314 – Section 5 Recitation 2
Lecture 2 Lexical Analysis
Lexical Analysis.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lexical analysis Finite Automata
פרק 3 ניתוח לקסיקאלי תורת הקומפילציה איתן אביאור.
Review: Compiler Phases:
4b Lexical analysis Finite Automata
Finite Automata & Language Theory
4b Lexical analysis Finite Automata
Lexical Analysis.
Lecture 5 Scanning.
Announcements - P1 part 1 due Today - P1 part 2 due on Friday Feb 1st
Presentation transcript:

Scanner 中正理工學院 電算中心副教授 許良全

Copyright © 1998 by LCH Compiler Design Overview of Scanning n The purpose of a scanner is to group input characters into tokens. n A scanner is sometimes called a lexical analyzer n A precise definition of tokens is necessary to ensure that lexical rules are properly enforced. u Scanners normally seek to make a token as long as possible. E.g. ABC is scanned as one identifier rather than three n All scanners perform much the same function u using scanner generator is to limit the effort in building a scanner from scratch

Copyright © 1998 by LCH Compiler Design Finite State Systems n The finite state automaton is a mathematical model of a system, with discrete input and outputs

Copyright © 1998 by LCH Compiler Design Examples of Finite State Systems n Elevators u do not remember all previous requests for service but only the current floor, the direction of motion, and the collection of not yet satisfied requests for service n Vending machines u insert enough coins and you’ll get a Pepsi eventually n Computers u the state of the CPU, main memory, and auxiliary storage at any time is one of a very large but finite number of states n Human brains  2 35 cells or neurons at most

Copyright © 1998 by LCH Compiler Design Definition of Finite Automata n A finite automaton (FA) is an idealized 5- tuple computer that recognizes strings belonging to regular sets. (Q, , ,q 0,F) u A finite set of states, Q u A finite input alphabet, , or vocabulary, V. u A special start, or initial state, q 0. q 0  Q. u A set of final, or accepting states, F. F  Q. u A transition function, , that maps Q×F to Q.

Copyright © 1998 by LCH Compiler Design FA and Transition Diagrams

Copyright © 1998 by LCH Compiler Design FA and Transition Tables

Copyright © 1998 by LCH Compiler Design Regular Expressions n The languages accepted by finite automata are easily described by simple expressions called regular expressions. n Strings are built from characters in V via catenation  e.g., !=, for, while n An empty or null string, denoted by, is allowed The characters, (, ), ‘, *, +, and | are called meta- characters. They must be be quoted when used in order to avoid ambiguity. E.g. Delim = (‘(‘|’)’|:=|;|,|’+’|-|’*’|/|=|$$$)

Copyright © 1998 by LCH Compiler Design Definition of Regular Expression n A regular expression denotes a set of strings: u  is a regular expression denoting the empty set (the set containing no strings). u is a regular expression denoting the set that contains only the empty string. F Note that this set contains one element.  A string s is a regular expression denoting a set containing only s. If s contains meta-characters, s can be quoted to avoid ambiguity.  If A and B are regular expressions, then A|B, AB, and A * are also regular expressions, corresponding to alternation, catenation, and Kleene closure respectively.

Copyright © 1998 by LCH Compiler Design Properties of Regular Expressions Let P and Q be a set of strings  The string s  (P|Q) iff s  P or s  Q  The string s  P * iff s can be broken into zero or more pieces: s = s 1 s 2 s 3 …s n such that each s i  P.  P + denotes all strings consisting one or more strings in P catenated together  P * = (P + | ) and P + = PP * = P * P  If A is a set of characters, Not(A) denotes (V-A)  all characters in V not included in A.  If k is a constant, the set A k represents all strings formed by catenating k strings from A, i.e., A k = (AAA…) ( k copies)

Copyright © 1998 by LCH Compiler Design Examples of Regular Expressions Let D = (0|…|9), L = (A|…|Z) n A comment that begins with -- and ends with Eol  Comment = --Not(Eol) * Eol n A fixed decimal literal u Lit = D +.D + n An identifier, composed of letters, digits, and underscores, that begins with a letter, ends with a letter or digit, and contains no consecutive underscores u ID = L(L|D) * (_(L|D) + ) *

Copyright © 1998 by LCH Compiler Design Using a Scanner Generator: Lex n Lex is a lexical analyzer generator developed by Lesk and Schmidt of AT&T Bell Lab, written in C, running under UNIX. n Lex produces an entire scanner module that can be compiled and linked with other compiler modules. n Lex associates regular expressions with arbitrary code fragments. When an expression is matched, the code segment is executed. A typical lex program contains three sections separated by % delimiters.

Copyright © 1998 by LCH Compiler Design First Section of Lex n The first section define character classes and auxiliary regular expression. (Fig. 3.5 on p. 67)  [] delimits character classes  - denotes ranges: [xyz] = = [x-z]  \ denotes the escape character: as in C.  ^ complements a character class, ( Not ):  [^xy] denotes all characters except x and y.  |, *, and + (alternation, Kleene closure, and positive closure) are provided.  () can be used to control grouping of subexpressions.  (expr)? = = (expr)|, i.e. matches Expr zero times or once.  {} signals the macroexpansion of a symbol defined in the first section.

Copyright © 1998 by LCH Compiler Design First Section of Lex, cont. n Catenation is specified by the juxtaposition of two expressions; no explicit operator is used. u [ab][cd] will match any of ad, ac, bc, and bd. begin = = “begin” = = [b][e][g][i][n]

Copyright © 1998 by LCH Compiler Design Second Section of Lex n The second section of lex defines a table of regular expressions and corresponding commands. u When an expression is matched, its associated command is executed. F Auxiliary functions may be defined in the third section.  Input that is matched is stored in the string variable yytext whose length is yyleng.  Lex creates an integer function yylex() that may be called from the parser. F The value returned is usually the token code of the token scanned by Lex.  When yylex() encounters end of file, it calls a use- supplied integer function named yywrap() to wrap up input processing.

Copyright © 1998 by LCH Compiler Design Dealing with Multiple Input Files yylex() uses three user-defined functions to handle character I/O:  input() : retrieve a single character, 0 on EOF  output(c) : write a single character to the output  unput(c) : put a single character back on the input to be re-read

Copyright © 1998 by LCH Compiler Design Translating Regular Expressions into Finite Automata n Remember the relationship between RE and FA. n The main job of a scanner generator program is to transform a regular expression definition into an equivalent (D)FA. n A regular expression is first translated into a nondeterministic finite automaton (NFA), then translated from NFA into DFA. (2 steps) n An NFA, when reading a particular input is not required to make a unique (deterministic) choice of which state to visit.

Copyright © 1998 by LCH Compiler Design Translating RE into NFA n Any regular expression can be transformed into an NFA with the following properties: u There is a unique final state u The final state has no successors u Every other state has either one or two successors Regular expressions are built out of the atomic regular expressions a (where a is a character in V ) and by using the three operations AB, A|B, and A *.

Copyright © 1998 by LCH Compiler Design NFA for a and

Copyright © 1998 by LCH Compiler Design An NFA for A|B

Copyright © 1998 by LCH Compiler Design An NFA for A B

Copyright © 1998 by LCH Compiler Design An NFA for A *

Copyright © 1998 by LCH Compiler Design Translating NFA into DFA Each state of DFA ( M ) corresponds to a set of states of NFA ( N ) u transforming N to M is done by subset construction M will be in state { x,y,z } after reading a given input string if and only if N could be in any of the states x, y, or z, depending on the transitions it chooses.  M keeps track of all the possible routes N might take and runs them in parallel.