Lecture 5 Scanning.

Slides:



Advertisements
Similar presentations
4b Lexical analysis Finite Automata
Advertisements

Regular Expressions and DFAs COP 3402 (Summer 2014)
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
June 13, 2016 Prof. Abdelaziz Khamis 1 Chapter 2 Scanning – Part 2.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Department of Software & Media Technology
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
More on scanning: NFAs and Flex
Intro to compilers Based on end of Ch. 1 and start of Ch. 2 of textbook, plus a few additional references.
Kleene’s Theorem and NFA
Finite automate.
Lecture 2 Lexical Analysis
Chapter 3 Lexical Analysis.
CS 404 Introduction to Compiler Design
Chapter 2 :: Programming Language Syntax
Lexical Analysis (Sections )
Finite-State Machines (FSMs)
Lexical analysis Finite Automata
Non Deterministic Automata
What Are They? Who Needs ‘em? An Example: Scoring in Tennis
Two issues in lexical analysis
Recognizer for a Language
Chapter 2 FINITE AUTOMATA.
Department of Software & Media Technology
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Non-Deterministic Finite Automata
Review: Compiler Phases:
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Non-Deterministic Finite Automata
Non Deterministic Automata
CS 3304 Comparative Languages
What Are They? Who Needs ‘em? An Example: Scoring in Tennis
Converting NFAs to DFAs How Lex is constructed
Finite Automata.
4b Lexical analysis Finite Automata
Finite Automata & Language Theory
CS 3304 Comparative Languages
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Subject Name: FORMAL LANGUAGES AND AUTOMATA THEORY
Compiler Construction
Chapter 2 :: Programming Language Syntax
4b Lexical analysis Finite Automata
Instructor: Aaron Roth
Chapter 2 :: Programming Language Syntax
Chapter 1 Regular Language
Compiler Construction
What is it? The term "Automata" is derived from the Greek word "αὐτόματα" which means "self-acting". An automaton (Automata in plural) is an abstract self-propelled.
Presentation transcript:

Lecture 5 Scanning

Syntax analysis We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages: Scanning: Identifying and verifying tokens in a program. Parsing: Identifying and verifying the patterns of tokens in a program.

Lexical analysis In the last lecture, we discussed how valid tokens may be specified by regular expressions. In this lecture, we will discuss how we can go about identifying tokens that are specified by regular expressions.

Recognizing tokens A recognizer for a language is a program that takes a string x as input and answers “yes” if x is a sentence of the language and “no” otherwise. In the context of lexical analysis, given a string and a regular expression, a recognizer of the language specified by the regular expression answers “yes” if the string is in the language. How can we recognize a regular expression (int) ? What about (int | for)? We could, for example, write an ad hoc scanner that contained simple conditions to test, the ability to peek ahead at the next token, and loops for numerous characters of the same type.

Recognizing tokens For example, to recognize (int | for): if cur_char == ‘i’ : peak at next character if it is ‘n’: peak at next character if it is ‘t’: return int if cur_char == ‘f’: peak at next character if it is ‘o’: peak at next character if it is ‘r’: return for else error

Finite automata A set of regular expressions can be compiled into a recognizer automatically by constructing a finite automaton using scanner generator tools (lex, for example). The advantage of using an automatically generated scanner over an ad-hoc implementation is ease-of-modification. If there are ever any changes to my token definitions, I only need to update my regular expressions and regenerate my scanner. Where performance is a concern, hand-written scanners or optimized scanners may be necessary but for development scanner generators are most useful.

Finite automata A finite automaton is a simple idealized machine that is used to recognize patterns within some input. A finite automaton will accept or reject an input depending on whether the pattern defined by the finite automaton occurs in the input. The elements of a finite automaton, given a set of input characters, are A finite set of states (or nodes). A specially-denoted start state. A set of final (accepting) states. A set of labeled transitions (or arcs) from one state to another.

Finite Automata Here’s an example finite automaton that accepts any sequence of 1’s and 0’s which has an odd number of 1’s. s f 1 Starting state Final state Transitions labeled with input

Finite Automata We accept a string only if its characters provide a path from the starting state to some final state. s f 1 Starting state Final state Transitions labeled with input

Finite automata Finite automata come in two flavors. Deterministic Never any ambiguity. For any given state and any given input, only one possible transition. Non-deterministic There may be more than one transition from any given state for any given character. There may be epsilon transitions – transitions labeled by the empty string. There is no obvious algorithm for converting regular expressions to DFAs.

finite automata Typically scanner generators create DFAs from regular expressions in the following way: Create NFA equivalent to regular expression. Construct DFA equivalent to NFA. Minimize the number of states in the DFA.

Constructing NFAs Let’s say we have some regular expression that specifies the tokens we allow in our programming language. How do we turn this into an NFA? Well, there are a couple of basic building blocks we can use. Let’s say our regular expression is simply ‘a’, meaning we only accept one instance of the character ‘a’ as a token. a s f Similarly, we could have a two state NFA that accepts the empty string 𝜖.

Constructing NFAs Concatenation: ab Alternation: a|b a s f b f s 𝜖 a b

Constructing NFAs Kleene Closure: a* 𝜖 a s f 𝜖 𝜖 𝜖

Constructing NFAs There are three important properties of our NFAs that we should take note of: There are no transitions back into the initial state. There is a single final state. There are no transitions out of the final state. Because of these invariant properties, we can combine smaller NFAs to create larger NFAs. When translating a regular expression into an NFA, start with small components of the regular expression and build on top of them.

Constructing NFAs s f a 𝜖 b Let’s put this all together. Let’s say I have the following regular expression: ( ( a | b ) c )* which describes the set {𝜖, ac, bc, acac, acbc, bcac, bcbc, acacac….} Let’s start with ( a | b ). We already know the corresponding NFA is: f s 𝜖 a b

Constructing NFAs s f a c 𝜖 b Let’s put this all together. Let’s say I have the following regular expression: ( ( a | b ) c )* which describes the set {𝜖, ac, bc, acac, acbc, bcac, bcbc, acacac….} Concatenating ( a | b ) with c gives us: f s 𝜖 a b c

Constructing NFAs s f a c 𝜖 b Let’s put this all together. Let’s say I have the following regular expression: ( ( a | b ) c )* which describes the set {𝜖, ac, bc, acac, acbc, bcac, bcbc, acacac….} Adding in the Kleene closure gives us the NFA for ( ( a | b ) c )*: f s 𝜖 a b c

From NFAs to DFAs If we can easily convert our regular expressions into an NFA, then why do we need to further convert it to a DFA? Well, NFAs admit a number of transitions from a single state for the same input. For example, in the last NFA, we can move to a state that admits ‘a’ or a state that admits ‘b’ on an empty string. To implement an NFA, we’d need to explore all possible transitions to find the right path. Instead, we implement a DFA which, for a given input, moves to a state that represents the set of states we could arrive at in an equivalent NFA.

From NFAs to DFAs Let’s look at our NFA again. Notice we’ve added labels to identify individual states. 𝜖 a b c 1 2 3 4 5 6 7 8 9

From NFAs to dfas Before consuming any input, we could be in the state 1. But we could also be in the states 2, 3, 5 and 9 via epsilon transitions. A [1,2,3,5,9] Lets make a starting state called A which is the set of all possible starting states.

From NFAs to dfas a A [1,2,3,5,9] B [4,7] If we receive an input ‘a’, from A we could transition to 4 or 7 (via epsilon transitions). a A [1,2,3,5,9] B [4,7]

From NFAs to dfas a A [1,2,3,5,9] B [4,7] b C [6,7] If we receive an input ‘b’, from A we could transition to 6 or 7 (via epsilon transitions). a A [1,2,3,5,9] B [4,7] b C [6,7]

From NFAs to dfas a A [1,2,3,5,9] B [4,7] b c C [6,7] D [2,3,5,8,9] c If we receive an input ‘c’, we cannot go anywhere from A, but both B and C can go to 2,3,5,8,9. a A [1,2,3,5,9] B [4,7] b c C [6,7] D [2,3,5,8,9] c

From NFAs to dfas A [1,2,3,5,9] B [4,7] a C [6,7] b D [2,3,5,8,9] c From D, we can go to B on an input of ‘a’ or go to C on an input of ‘b’. A [1,2,3,5,9] B [4,7] a C [6,7] b D [2,3,5,8,9] c

From NFAs to dfas A [1,2,3,5,9] B [4,7] a C [6,7] b D [2,3,5,8,9] c Finally, we’ve graphed all possible scenarios. Let’s circle the final states. A [1,2,3,5,9] B [4,7] a C [6,7] b D [2,3,5,8,9] c

From NFAs to dfas A B a C b D c Here’s our final DFA. For reference, our regular expression was ( ( a | b ) c )* which describes the set {𝜖, ac, bc, acac, acbc, bcac, bcbc, acacac….}. A B a C b D c

From NFAs to dfas A B a C b D c It is possible that the resulting DFA can be minimized further. Start by combining all non-final states together and all final states together. A B a C b D c

From NFAs to dfas A B C D c a b Now combine all of the non-final states together. A B C D c a b

From NFAs to dfas A B C D c a b Now combine all of the non-final states together. A B C D c a b

From NFAs to dfas A B C D c a b Note that AD is our starting state as well as an accepting state. A B C D c a b

From NFAs to DFAs A B C digit . Here’s another simple example of a DFA that can be minimized: A B C digit .

From NFAs to dfas A B,C digit . This is a simplistic example, but we can minimize this DFA in the following way: A B,C digit . Be careful that your minimization doesn’t introduce ambiguity! Split states as needed until none remain.

From DFAs to scanners Now that we have our DFA, we can actually create a scanner. In the next lecture, we’ll talk about creating a scanner that models the structure of the DFA we created.