CS 536 Fall 20001 Scanner Construction  Given a single string, automata and regular expressions retuned a Boolean answer: a given string is/is not in.

Slides:



Advertisements
Similar presentations
1 Week 2 Questions / Concerns Schedule this week: Homework1 & Lab1a due at midnight on Friday. Sherry will be in Klamath Falls on Friday Lexical Analyzer.
Advertisements

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
Context-Free Grammars Lecture 7
Prof. Hilfinger CS 164 Lecture 21 Lexical Analysis Lecture 2-4 Notes by G. Necula, with additions by P. Hilfinger.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
1 Non-Deterministic Automata Regular Expressions.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source program) – divides it into tokens.
CS 536 Spring Learning the Tools: JLex Lecture 6.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analyzer (Checker)
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
SCRIBE SUBMISSION GROUP 8 Date: 7/8/2013 By – IKHAR SUSHRUT MEGHSHYAM 11CS10017 Lexical Analyser Constructing Tokens State-Transition Diagram S-T Diagrams.
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Pembangunan Kompilator.  A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that language, and.
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Scanner Introduction to Compilers 1 Scanner.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
C Chuen-Liang Chen, NTUCS&IE / 35 SCANNING Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University Taipei,
Exercise 1 Consider a language with the following tokens and token classes: ID ::= letter (letter|digit)* LT ::= " " shiftL ::= " >" dot ::= "." LP ::=
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
Exercise Solution for Exercise (a) {1,2} {3,4} a b {6} a {5,6,1} {6,2} {4} {3} {5,6} { } b a b a a b b a a b a,b b b a.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Prof. Necula CS 164 Lecture 31 Lexical Analysis Lecture 3-4.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
CSE 311 Foundations of Computing I Lecture 24 FSM Limits, Pattern Matching Autumn 2011 CSE 3111.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Chapter 2-II Scanning Sung-Dong Kim Dept. of Computer Engineering, Hansung University.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
June 13, 2016 Prof. Abdelaziz Khamis 1 Chapter 2 Scanning – Part 2.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Department of Software & Media Technology
Lecture 2 Lexical Analysis
Scanner Scanner Introduction to Compilers.
Lecture 5 Transition Diagrams
Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Finite-State Machines (FSMs)
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
Nondeterministic Finite Automata (NFA)
RegExps & DFAs CS 536.
Finite-State Machines (FSMs)
Two issues in lexical analysis
Recognition of Tokens.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Lecture 5 Scanning.
Announcements - P1 part 1 due Today - P1 part 2 due on Friday Feb 1st
Scanner Scanner Introduction to Compilers.
Presentation transcript:

CS 536 Fall Scanner Construction  Given a single string, automata and regular expressions retuned a Boolean answer: a given string is/is not in a language  In contrast …  Given an input (an EOF-terminated “ long ” string), a scanner returns a series of tokens finds the longest lexeme, and returns the corresponding token

CS 536 Fall Putting it all together Regular expressions NFA DFA Lexical Specification Table-driven Implementation of DFA

CS 536 Fall Let ’ s build a scanner for a very simple language: The language of assignment statements:LHS = RHS … left-hand side of assignment is a Pascal identifier:  a letter followed by one or more letters or digits right-hand side is one of the following:  ID + ID  ID * ID  ID == ID

CS 536 Fall Step 1: Define tokens  Our language has five tokens, they can be defined by five regular expressions: TokenRegular Expression

CS 536 Fall Step 2: Convert REs to NFAs “=” letter “+” “*” “=” ASSIGN: ID: PLUS: TIMES: EQUALS: letter | digit “=”

CS 536 Fall Step 4: Combining per-token DFAs  Goal of a scanner: find the longest prefix of the current input that corresponds to a token.  This has two consequences: lookahead:  Examine if the next input character can “ extend ” the current token. If yes, keep building a larger token. a real scanner cannot get stuck:  What if we get stuck building the larger token? Solution: return characters back to input.

CS 536 Fall Furthermore …  In general the input can correspond to a series of tokens (lexemes), not just a single token. Problem: It is no longer correct to run the FSM until it gets stuck or whole string is consumed. So, how to partition the input into lexemes? Solution: a token must be returned when a regular expression is matched.  Some lexemes (like whitespace and comments) do not correspond to tokens. Problem: how to “ discard ” these lexemes? Solution: after finding such a lexeme, the scanner simply starts again and tries to match another regular expression.

CS 536 Fall Extend the DFA  modify the DFA so that an edge can have an associated action to  "put back one character" or  "return token XXX", such DFA is called a transducer  we must combine the DFAs for all of the tokens in to a single DFA, and

CS 536 Fall Step 4: Example of extending the DFA The DFA that recognizes Pascal identifiers must be modified as follows:  recall that scanner is called by parser (one token is return per each call)  hence action return puts the scanner into state S letter any char except letter or digit letter | digit action: put back 1 char return ID S

CS 536 Fall Implementing the extended DFA  The table-driven technique works, with a few small modifications: Include a column for end-of-file  e.g., to find an identifier when it is the last token in the input. besides ‘ next state ’, a table entry includes  an (optional) action: put back n characters, return token Instead of repeating  "read a character; update the state variable" until the machine gets stuck or the entire input is read,  "read a character; update the state variable; perform the action" (eventually, the action will be to return a value, so the scanner code will stop).

CS 536 Fall Step 4: Example: Combined DFA for our language letter any char except letter or digit letter | digit put back 1 char; return ID S “*” TMP F4F4 F3F3 F1F1 F5F5 F3F3 ID return EQUALS “=” any char except “=” put back 1 char; return ASSIGN return TIMES “+” return PLUS “=”

CS 536 Fall Transition Table (part 1) +*= S F3, return PLUS F4, return TIMES TMP ID F2, put back 1 char; return ID TMPTMP F1, put back 1 char; return ASSIGN F5, return EQUALS

CS 536 Fall Transition Table (part 2) letterdigitEOF ID F2, put back 1 char; return ID F1, put back 1 char; return ASSIGN

CS 536 Fall TEST YOURSELF #1 Augment the "combined" finite-state machine to:  Ignore white-spaces between tokens white-spaces are spaces, tabs and newlines  Give an error message if a character other than +, *, =, letter, or digit occurs in the input, or a digit is seen as the first character in the current input (in both cases, ignore the bad character).  Return an EOF token when there are no more tokens in the input.