12. Automata and Regular Expressions

Slides:

Advertisements

Similar presentations

4b Lexical analysis Finite Automata

Advertisements

1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.

Applied Computer Science II Chapter 1 : Regular Languages Prof. Dr. Luc De Raedt Institut für Informatik Albert-Ludwigs Universität Freiburg Germany.

Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.

CS 310 – Fall 2006 Pacific University CS310 Finite Automata Sections:1.1 page 44 September 8, 2006.

1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary  Quoted string in.

Topics Automata Theory Grammars and Languages Complexities

Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.

Languages and Machines Unit two: Regular languages and Finite State Automata.

Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.

CPSC 388 – Compiler Design and Construction

Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.

CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.

Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.

Finite-State Machines with No Output

Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,

1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.

1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.

1 Unit 1: Automata Theory and Formal Languages Readings 1, 2.2, 2.3.

Regular Expressions and Finite State Automata  Themes  Finite State Automata (FSA)  Describing patterns with graphs  Programs that keep track of state.

Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.

Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski

Regular Expressions and Finite State Automata Themes –Finite State Automata (FSA) Describing patterns with graphs Programs that keep track of state –Regular.

1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.

Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.

1 Computability Five lectures. Slides available from my web page There is some formality, but it is gentle,

1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a compiler PART II:

Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions'

Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.

Brian Mitchell - Drexel University MCS680-FCS 1 Patterns, Automata & Regular Expressions int MSTWeight(int graph[][], int size)

Donghyun (David) Kim Department of Mathematics and Physics North Carolina Central University 1 Chapter 1 Regular Languages Some slides are in courtesy.

Modeling Computation: Finite State Machines without Output

UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.

using Deterministic Finite Automata & Nondeterministic Finite Automata

Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.

Chapter 5 Finite Automata Finite State Automata n Capable of recognizing numerous symbol patterns, the class of regular languages n Suitable for.

CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.

LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:

Deterministic Finite Automata Nondeterministic Finite Automata.

Theory of Computation Automata Theory Dr. Ayman Srour.

Department of Software & Media Technology

Finite-State Machines (FSM) Chuck Cusack Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5 th edition, by Kenneth Rosen.

COMP 3438 – Part II - Lecture 3 Lexical Analysis II Par III: Finite Automata Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.

Finite automate.

Finite State Machines Dr K R Bond 2009

Lecture 2 Lexical Analysis

Chapter 3 Lexical Analysis.

Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.

Finite-State Machines (FSMs)

Lexical analysis Finite Automata

Non Deterministic Automata

Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:

CSc 453 Lexical Analysis (Scanning)

Finite-State Machines (FSMs)

G. Pullaiah College of Engineering and Technology

Two issues in lexical analysis

Chapter 2 FINITE AUTOMATA.

Some slides by Elsa L Gunter, NJIT, and by Costas Busch

4b Lexical analysis Finite Automata

Chapter Five: Nondeterministic Finite Automata

4b Lexical analysis Finite Automata

Compiler Structures 2. Lexical Analysis Objectives

Chapter 1 Regular Language

1.5 Regular Expressions (REs)

Discrete Maths 13. Grammars Objectives

Lecture 5 Scanning.

Formal Languages and Automata Theory BODDEDA HARITHA LAKSHMI,

CSc 453 Lexical Analysis (Scanning)

Automata theory and formal languages COS 3112 – AUTOMATA THEORY PRELIM PERIOD WEEK 1 AND 2.

Presentation transcript:

12. Automata and Regular Expressions Discrete Maths 242/240-213, Semester 2, 2017-2018 12. Automata and Regular Expressions Recognizing input using: automata: a graph-based technique regular expressions: an algebraic technique equivalent to automata

Overview Regular Expressions UNIX Regular Expressions Introduction to Automata Representing Automata The ‘aeiou’ Automaton Generating Output Deterministic and Nondeterministic Automata Regular Expressions UNIX Regular Expressions From REs to Automata More Information

1. Introduction to Automata A finite state automaton represents a problem as a series of states and transitions between the states the automaton starts in an initial state input causes a transition from the current state to another; a state may be accepting the automaton can terminate successfully when it enters an accepting state (if it wants to)

1.1. The ‘even-odd’ Automaton b b start a evenA oddA a The states are the ovals. The transitions are the arrows labelled with the input that ‘trigger’ them The ‘oddA’ state is accepting. continued

Execution Sequence b a b a a evenA b a b a a evenA b a b a a oddA Input Move to State b a b a a evenA initial state b a b a a evenA the automaton could choose to terminate here b a b a a oddA b a b a a oddA b a b a a evenA stops since no more input b a b a a oddA

1.2. The Light Switch Automaton start press off on press

1.3. Simplified TCP Automaton start 1.3. Simplified TCP Automaton

1.4 Game Playing Automaton start

1.5. Wall Bouncing Robot start

1.5. Why are Automata Useful? Automata are a very good way of modeling finite-state systems which change state due to input. Examples: text editors, compilers, UNIX tools like grep communications protocols (e.g. TCP) game states digital hardware components e.g. adders, RAM robots very different applications

2. Representing Automata Automata have a mathematical basis which allows them to be analysed, e.g.: prove that they accept correct input prove that they do not accept incorrect input Automata can be manipulated to simplify them, and they can be automatically converted into code.

2.1. A Mathematical Coding We can represent an automaton in terms of sets and mathematical functions. The ‘even-odd’ automaton is: startSet = { evenA } acceptSet = { oddA } nextState(evenA, b) => evenA nextState(evenA, a) => oddA nextState(oddA, b) => oddA nextState(oddA, a) => evenA continued

Analysis of the mathematical form can show that the ‘even-odd’ automaton only accepts strings which: contain an odd number of ‘a’s e.g. babaa abb abaab aabba aaaaba …

2.2. Automaton in Code It is easy to (automatically) translate an automaton into code, but ... an automaton graph does not contain all the details needed for a program The main extra coding issues: what to do when we enter an accepting state? what to do when the input cannot be processed? e.g. abzz is entered

Encoding the ‘even-odd’ Automaton enum state {evenA, oddA}; // possible states enum state currState = evenA; // start state int isAccepting = 0; // false int ch; while ((ch = getchar()) != EOF)) { currState = nextState(currState, ch); isAccepting = acceptable(currState); } if (isAccepting) printf(“accepted\n); else printf(“not accepted\n”); accepting state only used at end of input continued

simple handling of incorrect input continued enum state nextState(enum state s, int ch) { if ((s == evenA) && (ch == ‘b’)) return evenA; if ((s == evenA) && (ch == ‘a’)) return oddA; if ((s == oddA) && (ch == ‘b’)) return oddA; if ((s == oddA) && (ch == ‘a’)) return evenA; printf(“Illegal Input”); exit(1); } simple handling of incorrect input continued

int acceptable(enum state s) { if (s == oddA) return 1; // oddA is an accepting state return 0; }

3. The ‘aeiou’ Automaton What English words contain the five vowels (a, e, i, o, u) in order? Some words that match: abstemious facetious sacrilegious

3.1. Automaton Graph L = all letters L - a L - e L - i L - o L - u start a e i o u 1 2 3 4 5

3.2. Execution Sequence (1) f a c e t i o u s f a c e t i o u s 1 Input Move to State f a c e t i o u s f a c e t i o u s f a c e t i o u s 1 f a c e t i o u s 1 continued

f a c e t i o u s 2 f a c e t i o u s 2 f a c e t i o u s 3 Input Move to State f a c e t i o u s 2 f a c e t i o u s 2 f a c e t i o u s 3 f a c e t i o u s 4 the automaton can terminate here; no need to process more input f a c e t i o u s 5

Execution Sequence (2) a n d r e w a n d r e w 1 a n d r e w 1 Input Move to State a n d r e w a n d r e w 1 a n d r e w 1 a n d r e w 1 continued

Input Move to State a n d r e w 1 a n d r e w 2 a n d r e w 2, and end of input means failure

3.3. Translation to Code stop processing when the accepting enum state {0, 1, 2, 3, 4, 5}; // poss. states enum state currState = 0; // start state int isAccepting = 0; // false int ch; while ((ch = getchar()) != EOF) && !isAccepting) { currState = nextState(currState, ch); isAccepting = acceptable(currState); } if (isAccepting) printf(“accepted\n); else printf(“not accepted\n”); stop processing when the accepting state is entered continued

enum state nextState(enum state s, int ch) { if (s == 0) { if (ch == ‘a’) return 1; else return 0; // input is L-a } if (s == 1) { if (ch == ‘e’) return 2; else return 1; // input is L-e } if (s == 2) { if (ch == ‘i’) return 3; else return 2; // input is L-i } : continued

simple handling of incorrect input : if (s == 3) { if (ch == ‘o’) return 4; else return 3; // input is L-o } if (s == 4) { if (ch == ‘u’) return 5; else return 4; // input is L-u } printf(“Illegal Input”); exit(1); } // end of nextState() simple handling of incorrect input

int acceptable(enum state s) { if (s == 5) return 1; // 5 is an accepting state return 0; }

4. Generating Output One possible extension to the basic automaton idea is to allow output: when a transition is ‘triggered’ there can be optional output as well Automata which generate output are sometimes called Finite State Machines (FSMs).

4.1. ‘even-odd’ with Output b b a/1 start evenA oddA a When the ‘a’ transition is triggered out of the evenA state, then a ‘1’ is output.

4.2. Mathematical Coding Add an ‘output’ mathematical function to the automaton representation: output( evenA, a ) => 1

4.3. Extending the C Coding The while loop for ‘even-odd’ will become: : while ((ch = getchar()) != EOF)) { output(currState, ch); currState = nextState(currState, ch); isAccepting = acceptable(currState); } : continued

The output() C function: void output(enum state s, int ch) { if ((s == evenA) && (ch == ‘a’)) putchar(‘1’); }

5. Deterministic and Nondeterministic Automata w We have been writing deterministic automata so far: for an input read by a state there is at most one transition that can be fired state ‘s’ can process input ‘a’ and ‘w’, and fails for anything else

Nondeterministic Automata V a x S T x U A nondeterministic (ND) automaton can have 2 or more transitions with the same label leaving a state. Problem: if state S sees input ‘x’, then which transition should it use?

5.1. The ‘man’ Automaton Accept all strings that contain “man” this is hard to write as a deterministic automaton. The following has bugs: L - m WRONG start m a n 1 2 3 L - a L - n continued

The input string command will get stuck at state 0: 1 c o m m a n d the problem starts here

5.2. A ND Automaton Solution start m a n 1 2 3 It is nondeterministic because an ‘m’ input in state 0 can be dealt with by two transitions: a transition back to state 0, or a transition to state 1 continued

Processing command input: c o a n d m m 1 2 3 accepting state a n fail: reject the input 1 m

5.3. Executing a ND Automata It is difficult to code ND automata in conventional languages, such as C. Two different coding approaches: 1. When an input arrives, execute all transitions in parallel. See which succeeds. 2. When an input arrives, try one transition. If it leads to failure then backtrack and try another transition.

5.4. Why use ND Automata? With nondeterminism, some problems are easier to solve/model. Nondeterminism is common in some application areas, such as AI, graph search, and compilers. continued

It is possible to translate a ND automaton into a (larger, complex) deterministic one. In mathematical terms, ND automata and determinstic automata are equivalent they can be used to model all the same problems

6. Regular Expressions (REs) REs are an algebraic way of specifying how to recognise input ‘algebraic’ means that the recognition pattern is defined using RE operands and operators REs are equivalent to automata REs and automata can be used on all the same problems

6.1. REs in grep grep searches input lines, a line at a time. If the line contains a string that matches grep's RE (pattern), then the line is output. input lines (e.g. from a file) output matching lines (e.g. to a file) grep "RE" hello andy my name is andy my bye byhe continued

Examples grep "and" grep –E "an|my" "|" means "or" continued hello andy my name is andy my bye byhe hello andy my name is andy grep –E "an|my" hello andy my name is andy my bye byhe hello andy my name is andy my bye byhe "|" means "or" continued

grep "hel*" "*" means "0 or more" hello andy my name is andy my bye byhe hello andy my bye byhe "*" means "0 or more"

6.2. Why use REs? They are very useful for expressing patterns that recognise textual input. For example, REs are used in: editors compilers web-based search engines communication protocols

6.3. The RE Language The RE language is an algebraic way of specifying how to recognise input ‘algebraic’ means that the recognition pattern is defined using RE operands and operators

RE Operands There are 4 basic kinds of operands: characters (e.g. ‘a’, ‘1’, ‘(‘) the symbol e (means an empty string ‘’) the symbol {} (means the empty set) variables, which can be assigned a RE variable = RE

RE Operators There are three basic operators: union ‘|’ concatenation closure *

Concatenation S T What a string is matched by a RE "abc" this RE will use the S RE followed by the T RE to match against strings What a string is matched by a RE "abc" it is equivalent to: 'a' followed by 'b' followed by 'c'

6.4. REs for C Identifiers We define two RE variables, letter and digit: letter = A | B | C | D ... Z | a | b | c | d .... z digit = 0 | 1 | 2 | ... 9 ident is defined using letter and digit: ident = letter ( letter | digit )* continued

Strings matched by ident include: ab345 w h5g Strings not matched: 2 $abc ****

7. UNIX Regular Expressions Different UNIX tools use slightly different extensions of the basic RE language vi, awk, sed, grep, egrep, etc. Extra operands include: character classes line start ‘^’ and end ‘$’ symbols the wild card symbol ‘.’ additional operators, R? and R+

7.1. Character Classes The character class [a1 a2 ... an] stands for a1 | a2 | ... | an a1- an stands for the set of characters between a1 and an e.g. [A-Z] [a-z0-9]

7.2. Line Start and End The ‘^’ matches the beginning of the line, ‘$’ matches the end e.g. grep ‘^andr’ /usr/share/dict/words grep '^[washingto]*$' /usr/share/dict/words

/usr/share/dict/words Example as a Diagram grep "^andr" A A's AOL AOL's : androgen androgen's androgynous android android's androids /usr/share/dict/words

7.3. Wild Card Symbol The ‘.’ stands for any character except the newline e.g. grep ‘^a..b.$’ chapter1.txt grep ‘t.*t.*t’ manual

/usr/share/dict/words grep "^a..b.$" A A's AOL AOL's : adobe alibi ameba /usr/share/dict/words

7.4. R? and R+ R? stands for e | R (0 or 1 R) R+ stands for R | RR | RRR | ... which can also be written as R R* one or more occurrences of R

8. From REs to Automata e-NFA  ND automaton The translation uses a special kind of ND automata which uses e-transitions. Automata of this type are sometimes called e-NFAs. The translation steps are: RE  e-NFA e-NFA  ND automaton ND automaton  deterministic automaton deterministic automaton  code

e-NFAs A e-NFA allows a transition to use a e label. A transition using an e label can be triggered without having to match any input.

e-NFA Example a*b | b*a is accepted by the following e-NFA: b a 2 3 e start nondeterminism occurs here 6 1 e e 4 5 b Example input: "bbba" a

9. More Information Johnsonbaugh, R. 1997. Discrete Mathematics, Prentice Hall, chapter 10. Discrete Mathematics and its Applications Kenneth H. Rosen McGraw Hill, 2007, 7th edition chapter 13, sections 13.2 – 13.3