RegExps & DFAs CS 536.

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

Compiler construction in4020 – lecture 2 Koen Langendoen Delft University of Technology The Netherlands.
4b Lexical analysis Finite Automata
Compiler Baojian Hua Lexical Analysis (II) Compiler Baojian Hua
Nondeterministic Finite Automata CS 130: Theory of Computation HMU textbook, Chapter 2 (Sec 2.3 & 2.5)
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical.
Tools for building compilers Clara Benac Earle. Tools to help building a compiler C –Lexical Analyzer generators: Lex, flex, –Syntax Analyzer generator:
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Automata and Regular Expression Discrete Mathematics and Its Applications Baojian Hua
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Nondeterministic Finite Automata CS 130: Theory of Computation HMU textbook, Chapter 2 (Sec 2.3 & 2.5)
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
CS 536 Fall Scanner Construction  Given a single string, automata and regular expressions retuned a Boolean answer: a given string is/is not in.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a compiler PART II:
Introduction to Lex Ying-Hung Jiang
Compiler Construction Sohail Aslam Lecture 9. 2 DFA Minimization  The generated DFA may have a large number of states.  Hopcroft’s algorithm: minimizes.
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
Exercise Solution for Exercise (a) {1,2} {3,4} a b {6} a {5,6,1} {6,2} {4} {3} {5,6} { } b a b a a b b a a b a,b b b a.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Prof. Necula CS 164 Lecture 31 Lexical Analysis Lecture 3-4.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
More yacc. What is yacc – Tool to produce a parser given a grammar – YACC (Yet Another Compiler Compiler) is a program designed to compile a LALR(1) grammar.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Deterministic Finite Automata Nondeterministic Finite Automata.
Department of Software & Media Technology
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
CS 3304 Comparative Languages
More on scanning: NFAs and Flex
Lecture 2 Lexical Analysis
NFAs, scanners, and flex.
Tutorial On Lex & Yacc.
CS 326 Programming Languages, Concepts and Implementation
Chapter 2 :: Programming Language Syntax
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lexical Analysis (Sections )
Finite-State Machines (FSMs)
Lexical analysis Finite Automata
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
Nondeterministic Finite Automata (NFA)
Finite-State Machines (FSMs)
Two issues in lexical analysis
Recognizer for a Language
Department of Software & Media Technology
Chapter 3: Lexical Analysis
Building a Scanner CS164, Fall 2004 Ras Bodik, CS 164, Fall 2004.
Lecture 5: Lexical Analysis III: The final bits
Lexical Analysis Lecture 3-4 Prof. Necula CS 164 Lecture 3.
CS 3304 Comparative Languages
Lecture 4: Lexical Analysis & Chomsky Hierarchy
4b Lexical analysis Finite Automata
CS 3304 Comparative Languages
Chapter 2 :: Programming Language Syntax
4b Lexical analysis Finite Automata
Yacc Yacc.
Chapter 2 :: Programming Language Syntax
Lecture 5 Scanning.
Announcements - P1 part 1 due Today - P1 part 2 due on Friday Feb 1st
NFAs, scanners, and flex.
Faculty of Computer Science and Information System
Presentation transcript:

RegExps & DFAs CS 536

Pre-class warm up Write the regexp for Fortran real literals An optional sign (‘+’ or ‘-‘) An integer or: 1 or more digits followed by a ‘.’ followed by 0 or more digits or: A ‘.’ followed by one or more digits (‘+’|’-’|ε)(digit+(‘.’|ε))| (digit*’.’digit+))

Last time Explored NFAs Introduce regular languages / expressions for every NFA there is an equivalent DFA epsilon edges add no expressive power Introduce regular languages / expressions

Today Convert regexps to DFAS From language recognizers to tokenizers

Regexp to NFAs Literals/epsilon correspond to simple DFAs Operators correspond to methods of joining DFAs x^n, where n is even or divisible by 3

Regexp to NFA rules Rules for operands

Regexp to NFA rules Rules for alternation A|B q’,ε → qA q’,ε → qB Make new start state q’ and new final state f’ Make original final states non-final Add to δ: q’,ε → qA q’,ε → qB Fa,ε→f’ Fb,ε→f’

Regexp to NFA rules Rule for catenation A.B Make new start state q’ and new final state f’ Make original final states non-final Add to δ: q’,ε → qA fA,ε → qB fb,ε→f’

Regexp to NFA rules Rule for iteration A* Make new start state q’ and new final state f’ Make original final states non-final Add to δ: q’,ε → qA q’,ε→f’ f’,ε→qA

Regexp operator precedence Analogous math operator | low addition . medium multiplication * high exponentiation

Tree representation of a regexp Operator Precedence | low . medium * high

Bottom-up conversion

Bottom-up conversion

Bottom-up conversion

Bottom-up conversion

Bottom-up conversion

Bottom-up conversion

Bottom-up conversion

Bottom-up conversion

Bottom-up conversion

But what’s so great about DFAs? Regexp to DFAs We now have an NFA We need to go to DFA But what’s so great about DFAs?

Table-driven DFAs Recall that δ can be expressed as a table This leads to a very efficient array representation s = start state while (more input){ c = read char s = table[s][c] } if s is final, accept

FSMs for tokenization FSMs only check for language membership of a string the scanner needs to recognize a stream of many different tokens using the longest match the scanner needs to know what was matched Idea: imbue states with actions that will fire when state is reached

BAD: maybe we needed that A first cut at actions Consider the language of Pascal identifiers BAD: not longest match Accounting for longest matches BAD: maybe we needed that character

A second take at actions Give our FSMs ability to put chars back Since we’re allowing our FSM to peek at characters past the end of a valid token, it’s also convenient to add an EOF symbol

Our first scanner Consider a language with two statements assignments: ID = expr increments: ID += expr where expr is of the form ID + ID ID ^ ID ID < ID ID <= ID Identifiers ID follow C conventions

Combined DFA

do{ read char perform action / update state if (action was to return a token){ start again in start state } } (while not EOF or stuck);

Lexical analyzer generators aka scanner generators The transformation from regexp to scanner is formally defined Can write tools to synthesize a lexer automatically Lex: unix scanner generator Flex: fast lex JLex: Java version of Lex

JLex Declarative specification tell it what you want scanned, it will figure out the rest Input: set of regexps + associated actions xyz.jlex file Output: Java source code for a scanner xyz.jlex.java source code of scanner

jlex format 3 sections separated by %% user code section directives regular expressions + actions

Rules section Format is <regex>{code} where regex is a regular expression for a single token can use macros from the directive sections in regex, surround with curly braces Conventions chars represent themselves (except special characters) chars inside “” represent themselves (except \) Regexp operators | * + ? () . Character class operators - range ^ not \ escape

`