Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.

Slides:



Advertisements
Similar presentations
Lexical Analysis IV : NFA to DFA DFA Minimization
Advertisements

4b Lexical analysis Finite Automata
Finite Automata CPSC 388 Ellen Walker Hiram College.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
Compiler Construction
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
CS 310 – Fall 2006 Pacific University CS310 Finite Automata Sections:1.1 page 44 September 8, 2006.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
Lecture 3 Goals: Formal definition of NFA, acceptance of a string by an NFA, computation tree associated with a string. Algorithm to convert an NFA to.
Lecture 3 Goals: Formal definition of NFA, acceptance of a string by an NFA, computation tree associated with a string. Algorithm to convert an NFA to.
Compiler Construction
CS5371 Theory of Computation Lecture 4: Automata Theory II (DFA = NFA, Regular Language)
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
Scanner Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis Constructing a Scanner from Regular Expressions.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lexical Analysis III : NFA to DFA DFA Minimization Lecture 5 CS 4318/5331 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper.
May 31, May 31, 2016May 31, 2016May 31, 2016 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa Pacific University,
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
CMSC 330: Organization of Programming Languages Theory of Regular Expressions Finite Automata.
Brian Mitchell - Drexel University MCS680-FCS 1 Patterns, Automata & Regular Expressions int MSTWeight(int graph[][], int size)
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Chapter 5 Finite Automata Finite State Automata n Capable of recognizing numerous symbol patterns, the class of regular languages n Suitable for.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
1 Compiler Construction Vana Doufexi office CS dept.
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
CS314 – Section 5 Recitation 2
CS510 Compiler Lecture 2.
Lecture 2 Lexical Analysis
Lexical analysis Finite Automata
Recognizer for a Language
Review: Compiler Phases:
Lecture 4: Lexical Analysis II: From REs to DFAs
4b Lexical analysis Finite Automata
4b Lexical analysis Finite Automata
Chapter 1 Regular Language
Lecture 5 Scanning.
Announcements - P1 part 1 due Today - P1 part 2 due on Friday Feb 1st
Presentation transcript:

Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015

Announcements Assg 1 due this Friday at 11:59 PM Test instances on github No lecture at RRC this week

Lexical Analysis int main() { int i for (i = 0; i < MAX; i++) printf(“Hello World”); } Scanner What do we do if we encounter a missing semi-colon? Nothing!

Lexical Analysis int main() { int i; for (i = 0; i < MAX; i++) abcprintf(“Hello World”); } Scanner What do we do if we encounter an undefined function name? Nothing!

Lexical Analysis int main() { int i; for (i = 0; i < MAX; i++) abcprintf(“Hello World”); } Scanner What do we do if we encounter an undefined function name? Nothing!

Lexical Analysis intmain(){inti;for(i=0;i<MAX;i++)printf(“Hello World”);} Scanner Legal C program? Passes Scanner? No Yes

Lexical Analysis intmain(){inti;for(i=0;i<MAX;i++)printf(“Hello World”);} Scanner Legal C program? Passes Scanner? No Yes

Lexical Analysis int main() { int %$*&i; for (i = 0; i < MAX; i++) printf(“Hello World”); } Scanner What’s an illegal C program at the scanner phase? Very Few! C/C++ has become too large!

Breaking Down Lexical Analysis Further … 1.Specify patterns for tokens Look at language description and identify the types of tokens needed for the language usually trivial Use regular expressions to specify a pattern for each token patterns for some tokens are trivial 2.Recognize patterns in the input stream and generate tokens for the parser

Recognizing Tokens We can specify the regular expression while for the while keyword in C How do we recognize it if we see it in the input stream? Essentially a pattern-matching algorithm

Code for Recognizing while if (nextchar() == ‘w’) if (nextchar() == ‘h’) if (nextchar() == ‘i’) if (nextchar() == ‘l’) if (nextchar() == ‘e’) return KEYWORD_WHILE; else // do something else // do something else // do something else // do something else // do something This approach works for more complex REs as well while (nextchar() == ‘a’ || …) Need to decide what to do for strings like when Need to account for strings like whileabc Need to account for strings like abcwhile Can we generate this code automatically?

Code for Recognizing while if (nextchar() == ‘w’) if (nextchar() == ‘h’) if (nextchar() == ‘i’) if (nextchar() == ‘l’) if (nextchar() == ‘e’) return KEYWORD_WHILE; else // do something else // do something else // do something else // do something else // do something Each ‘if clause’ represents a state The state is determined solely based on what we have seen so far in the input stream No need to go back and rescan input At each state we make a decision to move to a new state based on the next input symbol This is exactly the idea behind (deterministic) finite state machines

Recognizing Tokens General idea Consume a character from the input stream Based on the value of the character move to a new state If the character just consumed produces a valid token and no more characters to consume then DONE leads to a valid token, move to a valid state produces an invalid token go to error state and finish Repeat above recognizes one token

Recognizing Tokens Need to construct a recognizer based on regular expressions A recognizer for a regular expression is a machine that recognizes the language described by the RE Given an input string constructed from the alphabet, the recognizer will Say “yes” if the string is in the language (ACCEPT) Say “no” if the string is not in the language (REJECT) Implications Must produce a yes or no answer on every input Cannot say yes when the string is not in the language (false positives)

RE and DFA For every RE there is a recognizer that recognizes the corresponding RL If you build it … it will be recognizable! The recognizers are called deterministic finite automata (DFAs) Kleene’s Theorem (1952)

Deterministic Finite Automata Formal mathematical construct Abstract state machines that can recognize regular languages A set of states with transitions defined on each input symbol on every state Formal definition in Text (Section 2.2.1) Convenient to reason about DFAs using state transition diagrams

DFA Diagram s0s2s1s3 int E initial state input error state final state error states sometimes implicit only one initial state can have multiple final states i n t

Acceptance Criteria for DFAs A DFA accepts a string if and only if the DFA ends up in a final state after consuming all input symbols Implications A DFA built to recognize int will _______ intmain A DFA built to recognize intmain will _______ int reject Easy fix if we want the machine to recognize int AND intmain

DFA Example : if s0 s1 if s2

DFA Example: int | if s0 s1 if s3 s2 n s4 t

DFA for if | int s0s1 i f s3 s4 n s2 i t s5 Non-determinism

DFA Example : Integers Σ = {0-9} Digit : 0|1|2|3|… |9 Integer : 0 | (1|2|3|… |9)(Digit)* s0s2 E s

REs and DFAs every RL has a DFA that recognizes it and every DFA has a corresponding RL there are algorithms that allow us to convert an RE to a DFA and vice versa we can automate scanning! to convert REs to DFAs we need to first look at non-deterministic finite automata (NFA)

Non-determinism DFAs do not allow non-determinism Must have a transition defined on every state on every possible input symbol Cannot move to a new state without consuming an input symbol Cannot have multiple transitions on the same input symbol

NFA DFAs with transitions To run NFAs, start at the initial state and guess the right transition at each step Always guess correctly If some sequence of correct guesses leads to a final state then accept Sounds dubious But works!

NFA for if | int s0s0 s1s1 i f s3s3 s4s4 n s2s2 i t s5s5 NFA, multiple transitions on i in state s 0

NFA and DFA Although NFAs allow non-determinism it has been shown that NFAs and DFAs are equivalent! Scott and Rabin (1959) DFAs are just specialized forms of NFAs NFAs and DFAs both recognize the same set of languages Can simulate a DFA with an NFA Can construct corresponding DFAs for any NFA Implication For every RE there is also an NFA Relatively easy to construct an NFA from an RE

RE to NFA : Empty String 1. is a regular expression that denotes { }, the set that contains the empty string s0s1

RE to NFA : Symbol 2. For each, a is a regular expression denoting {a}, the set containing the string a. s0s1 a

RE to NFA : Union 3. r | s is an RE denoting L(r) U L(s) e.g., RE = a | b L(RE) = {a, b} s0s1 b s0s1 a s3 a s2s4 b s5 s0

RE to NFA : Concatenation 4. rs is an RE denoting L(r)L(s) e.g., RE = ab L(RE) = {ab} s0s1 b s0s1 a s3s0 ab s2

RE to NFA : Closure 5. r* is an RE denoting L(r)* e.g., RE = a* L(RE) = {, a, aa, aaa, aaaa, …} s1s3s0 a s2 s0s1 a

RE to NFA The algorithm for converting REs to NFAs is known as Thompson’s construction Repeated application of the five conversion rules! Named after Ken Thompson (1968)

Example : NFA for a(b|c)* Work inside parentheses b|c s0s1 c s0s1 b s0 s5

Example : NFA for a(b|c)* Work inside parentheses b|c s2s4 c s1s3 b s0 s5 Adjust final states Rename states

Example : NFA for a (b|c)* Step 3: * (closure) (b | c)* s1s3 b s2s4 c s5s0s5s0

Example : NFA for a (b|c)* Step 3: * (closure) (b | c)* s2s4 b s3s5 c s6s0s7s1

Example : NFA for a (b|c)* Step 4: concatenation s4s5 b s6s7 c s8 s1 s9 s3s2 s0 a

Cycle of Construction RE Minimized DFA NFA Code Thompson’s Construction Subset Construction Hopcroft’s Algorithm