LECTURE NOTES On FINITE AUTOMATA
Mathematical Preliminaries Symbol A character, glyph, mark. An abstract entity that has no meaning by itself, often called uninterpreted. Letters from various alphabets, digits and special characters are the most commonly used symbols. Alphabet A finite set of symbols. An alphabet is often denoted by sigma , yet can be given any name. B = {0, 1} Says B is an alphabet of two symbols, 0 and 1. C = {a, b, c} Says C is an alphabet of three symbols, a, b and c. Sometimes space and comma are in an alphabet while other times they are meta symbols used for descriptions.
String / Word A finite sequence of symbols from an alphabet. 01110 and 111 are strings from the alphabet B above. aaabccc and b are strings from the alphabet C above. A null string is a string with no symbols, usually denoted by epsilon . The null string has length zero. The null string is usually denoted epsilon . Vertical bars around a string indicate the length of a string expressed as a natural number. For example |00100| = 5, |aab| = 3, | epsilon | = 0 Power of an alphabet n Represented as 0 = { }, 2 = { 00, 01, 10, 11 } for = { 0, 1 } Concatenation of Strings denoted as x . y
Formal Language, also called a Language A set of strings from an alphabet. The set may be empty, finite or infinite. There are many ways to define a language and there are many classifications for languages. Because a language is a set of strings, the words language and set are often used interchangeably in talking about formal languages. L(M) is the notation for a language defined by a machine M. The machine M accepts a certain set of strings, thus a language. L(G) is the notation for a language defined by a grammar G. The grammar G recognizes a certain set of strings, thus a language. M(L) is the notation for a machine that accepts a language.
Formal Language, also called a Language(Contd.,) The language L is a certain set of strings. G(L) is the notation for a grammar that recognizes a language. The union of two languages is a language. L = L1 L2 The intersection of two languages is a language. L = L1 L2 The complement of a language is a language. L = * - L1 The difference of two languages is a language. L = L1 - L2 Thus a set of strings all of which are chosen from some *, where is a particular alphabet is called a language If is an alphabet and L *, then L is a language over . Example: for = { a, b } { a, aba, aa } is a finite language ------ is an infinite language
Formal Language, also called a Language(Contd.,) Typical examples: L = { x | x is a bit string with two zeros } L = { anbn | n N } L = {1n | n is prime}
For example, let A = {0,00} then A•A = { 00, 000, 0000 } with |A•A|=3, Formal Language, also called a Language(Contd.,) Do not confuse the concatenation of languages with the Cartesian product of sets. For example, let A = {0,00} then A•A = { 00, 000, 0000 } with |A•A|=3, AA = { (0,0), (0,00), (00,0), (00,00) } with |AA|=4
M Let L be a language S a machine M recognizes L if Recognizing Languages Let L be a language S a machine M recognizes L if “accept” if and only if xL xS M “reject” if and only if xL
Language Definition Problem How to precisely define language Layered structure of language definition Start with a set of letters in language Lexical structure - identifies “words” in language (each word is a sequence of letters) Syntactic structure - identifies “sentences” in language (each sentence is a sequence of words) Semantics - meaning of program (specifies what result should be for each input) lexical and syntactic structures
Specifying Formal Languages Huge Triumph of Computer Science Beautiful Theoretical Results Practical Techniques and Applications Two Dual Notions Generative approach (grammar or regular expression) Recognition approach (automaton) Lots of theorems about converting one approach automatically to another
Specifying Lexical Structure Using Regular Expressions Have some alphabet = set of letters Regular expressions are built from: - empty string Any letter from alphabet r1r2 – regular expression r1 followed by r2 (sequence) r1| r2 – either regular expression r1 or r2 (choice) r* - iterated sequence and choice | r | rr | … Parentheses to indicate grouping/precedence
Concept of Regular Expression Generating a String Rewrite regular expression until have only a sequence of letters (string) left Example (0 | 1)*.(0|1)* (0 | 1)(0 | 1)*.(0|1)* 1(0|1)*.(0|1)* 1.(0|1)* 1.(0|1)(0|1)* 1.(0|1) 1.0 General Rules 1) r1| r2 r1 2) r1| r2 r2 3) r* rr* 4) r*
Nondeterminism in Generation Rewriting is similar to equational reasoning But different rule applications may yield different final results Example 1 (0 | 1)*.(0|1)* (0 | 1)(0 | 1)*.(0|1)* 1(0|1)*.(0|1)* 1.(0|1)* 1.(0|1)(0|1)* 1.(0|1) 1.0 Example 2 (0 | 1)*.(0|1)* (0 | 1)(0 | 1)*.(0|1)* 0(0|1)*.(0|1)* 0.(0|1)* 0.(0|1)(0|1)* 0.(0|1) 0.1
Concept of Language Generated by Regular Expressions Set of all strings generated by a regular expression is language of regular expression In general, language may be (countably) infinite String in language is often called a token
Examples of Languages and Regular Expressions = { 0, 1, . } (0 | 1)*.(0|1)* - Binary floating point numbers (00)* - even-length all-zero strings 1*(01*01*)* - strings with even number of zeros = { a,b,c, 0, 1, 2 } (a|b|c)(a|b|c|0|1|2)* - alphanumeric identifiers (0|1|2)* - trinary numbers
Language Definition Problem How to precisely define language Layered structure of language definition Start with a set of letters in language Lexical structure - identifies “words” in language (each word is a sequence of letters) Syntactic structure - identifies “sentences” in language (each sentence is a sequence of words) Semantics - meaning of program (specifies what result should be for each input) lexical and syntactic structures
Specifying Formal Languages Huge Triumph of Computer Science Beautiful Theoretical Results Practical Techniques and Applications Two Dual Notions Generative approach (grammar or regular expression) Recognition approach (automaton) Lots of theorems about converting one approach automatically to another
Specifying Lexical Structure Using Regular Expressions Have some alphabet = set of letters Regular expressions are built from: - empty string Any letter from alphabet r1r2 – regular expression r1 followed by r2 (sequence) r1| r2 – either regular expression r1 or r2 (choice) r* - iterated sequence and choice | r | rr | … Parentheses to indicate grouping/precedence
Concept of Regular Expression Generating a String Rewrite regular expression until have only a sequence of letters (string) left Example (0 | 1)*.(0|1)* (0 | 1)(0 | 1)*.(0|1)* 1(0|1)*.(0|1)* 1.(0|1)* 1.(0|1)(0|1)* 1.(0|1) 1.0 General Rules 1) r1| r2 r1 2) r1| r2 r2 3) r* rr* 4) r*
Nondeterminism in Generation Rewriting is similar to equational reasoning But different rule applications may yield different final results Example 1 (0 | 1)*.(0|1)* (0 | 1)(0 | 1)*.(0|1)* 1(0|1)*.(0|1)* 1.(0|1)* 1.(0|1)(0|1)* 1.(0|1) 1.0 Example 2 (0 | 1)*.(0|1)* (0 | 1)(0 | 1)*.(0|1)* 0(0|1)*.(0|1)* 0.(0|1)* 0.(0|1)(0|1)* 0.(0|1) 0.1
Concept of Language Generated by Regular Expressions Set of all strings generated by a regular expression is language of regular expression In general, language may be (countably) infinite String in language is often called a token
Examples of Languages and Regular Expressions = { 0, 1, . } (0 | 1)*.(0|1)* - Binary floating point numbers (00)* - even-length all-zero strings 1*(01*01*)* - strings with even number of zeros = { a,b,c, 0, 1, 2 } (a|b|c)(a|b|c|0|1|2)* - alphanumeric identifiers (0|1|2)* - trinary numbers
Alternate Abstraction Finite-State Automata Alphabet Set of states with initial and accept states Transitions between states, labeled with letters (0 | 1)*.(0|1)* . Start state 1 1 Accept state
Automaton Accepting String Conceptually, run string through automaton Have current state and current letter in string Start with start state and first letter in string At each step, match current letter against a transition whose label is same as letter Continue until reach end of string or match fails If end in accept state, automaton accepts string Language of automaton is set of strings it accepts
. Example 11.0 Current state Start state 1 1 Accept state Accept state 11.0 Current letter
. Example 11.0 Current state Start state 1 1 Accept state Accept state 11.0 Current letter
. Example 11.0 Current state Start state 1 1 Accept state Accept state 11.0 Current letter
. Example 11.0 Current state Start state 1 1 Accept state Accept state 11.0 Current letter
. Example 11.0 Current state Start state 1 1 Accept state Accept state 11.0 Current letter
. Example 11.0 String is accepted! Current state Start state 1 1 Accept state 11.0 String is accepted! Current letter
Finite Automaton The most simple machine that is not just a finite list of words. “Read once”, “no write” procedure. Typical is its limited memory. Think cell-phone, elevator door, etc.
A Simple Automaton (0) transition rules states starting state q1 q2 q3 1 0,1 starting state accepting state
A Simple Automaton (1) start accept on input “0110”, the machine goes: q1 q2 q3 1 0,1 start accept on input “0110”, the machine goes: q1 q1 q2 q2 q3 = “reject”
A Simple Automaton (2) on input “101”, the machine goes: q1 q2 q3 1 0,1 on input “101”, the machine goes: q1 q2 q3 q2 = “accept”
A Simple Automaton (3) 010: reject 11: accept 010100100100100: accept q1 q2 q3 1 0,1 010: reject 11: accept 010100100100100: accept 010000010010: reject : reject
Finite Automaton (def.) A deterministic finite automaton (DFA) M is defined by a 5-tuple M=(Q,,,q0,F) Q: finite set of states : finite alphabet : transition function :QQ q0Q: start state FQ: set of accepting states
M = (Q,,,q,F) q1 q2 q3 1 0,1 states Q = {q1,q2,q3} 0,1 states Q = {q1,q2,q3} alphabet = {0,1} start state q1 accept states F={q2} transition function :
Generative Versus Recognition Regular expressions give you a way to generate all strings in language Automata give you a way to recognize if a specific string is in language Philosophically very different Theoretically equivalent (for regular expressions and automata) Standard approach Use regular expressions when define language Translated automatically into automata for implementation
From Regular Expressions to Automata Construction by structural induction Given an arbitrary regular expression r, Assume we can convert r to an automaton with One start state One accept state Show how to convert all constructors to deliver an automaton with
Basic Constructs -Thompson’s construction Start state Accept state a a
Sequence r1r2 r1 r2 Old start state Start state Old accept state
Choice r1 r1|r2 r2 Old start state Start state Old accept state Accept state r1 r1|r2 r2
Kleene Star r r* Old start state Start state Old accept state
NFA vs. DFA DFA No transitions At most one transition from each state for each letter NFA – neither restriction a a NOT OK OK b a
Conversions Our regular expression to automata conversion produces an NFA Would like to have a DFA to make recognition algorithm simpler Can convert from NFA to DFA (but DFA may be exponentially larger than NFA)
NFA to DFA Construction DFA has a state for each subset of states in NFA DFA start state corresponds to set of states reachable by following transitions from NFA start state DFA state is an accept state if an NFA accept state is in its set of NFA states To compute the transition for a given DFA state D and letter a Set S to empty set Find the set N of D’s NFA states For all NFA states n in N Compute set of states N’ that the NFA may be in after matching a Set S to S union N’ If S is nonempty, there is a transition for a from D to the DFA state that has the set S of NFA states Otherwise, there is no transition for a from D
NFA to DFA Example for (a|b)*.(a|b)* a 3 5 . 11 13 1 2 7 8 9 10 15 16 4 6 12 14 b b . a a 5,7,2,3,4,8 . 13,15,10,11,12,16 a a a a 1,2,3,4,8 9,10,11,12,16 b b . b b 6,7,2,3,4,8 14,15,10,11,12,16 b b
Regular Languages The language recognized by a finite automaton M is denoted by L(M). A regular language is a language for which there exists a recognizing finite automaton.
Two DFA Questions Given the description of a finite automaton M = (Q,,,q,F), what is the language L(M) that it recognizes? In general, what kind of languages can be recognized by finite automata? (What are the regular languages?)
Union of Two Languages Theorem 1.12: If A1 and A2 are regular languages, then so is A1 A2. (The regular languages are ‘closed’ under the union operation.) Proof idea: A1 and A2 are regular, hence there are two DFA M1 and M2, with A1=L(M1) and A2=L(M2). Out of these two DFA, we will make a third automaton M3 such that L(M3) = A1 A2.
Proof Union-Theorem (1) M1=(Q1,,1,q1,F1) and M2=(Q2,,2,q2,F2) Define M3 = (Q3,,3,q3,F3) by: Q3 = Q1Q2 = {(r1,r2) | r1Q1 and r2Q2} 3((r1,r2),a) = (1(r1,a), 2(r2,a)) q3 = (q1,q2) F3 = {(r1,r2) | r1F1 or r2F2}
Proof Union-Theorem (2) The automaton M3 = (Q3,,3,q3,F3) runs M1 and M2 in ‘parallel’ on a string w. In the end, the final state (r1,r2) ‘knows’ if wL1 (via r1F1?) and if wL2 (via r2F2?) The accepting states F3 of M3 are such that wL(M3) if and only if wL1 or wL2, for: F3 = {(r1,r2) | r1F1 or r2F2}.
Concatenation of L1 and L2 Definition: L1• L2 = { xy | xL1 and yL2 } Example: {a,b} • {0,11} = {a0,a11,b0,b11} Theorem 1.13: If L1 and L2 are regular languages, then so is L1•L2. (The regular languages are ‘closed’ under concatenation.)
Proving Concatenation Thm. Consider the concatenation: {1,01,11,001,011,…} • {0,000,00000,…} (That is: the bit strings that end with a “1”, followed by an odd number of 0’s.) Problem is: given a string w, how does the automaton know where the L1 part stops and the L2 substring starts? We need an M with ‘lucky guesses’.
Nondeterminism Nondeterministic machines are capable of being lucky, no matter how small the probability. A nondeterministic finite automaton has transition rules/possibilities like q2 1 q1 q2 q1 q3 1
A Nondeterministic Automaton 0,1 0,1 1 0, 1 q1 q2 q3 q4 This automaton accepts “0110”, because there is a possible path that leads to an accepting state, namely: q1 q1 q2 q3 q4 q4
A Nondeterministic Automaton 0,1 0,1 1 0, 1 q1 q2 q3 q4 The string 1 gets rejected: on “1” the automaton can only reach: {q1,q2,q3}.
Nondeterminism ~ Parallelism For any (sub)string w, the nondeterministic automaton can be in a set of possible states. If the final set contains an accepting state, then the automaton accepts the string. “The automaton processes the input in a parallel fashion. Its computational path is no longer a line, but a tree.”
Nondeterministic FA (def.) A nondeterministic finite automaton (NFA) M is defined by a 5-tuple M=(Q,,,q0,F), with Q: finite set of states : finite alphabet : transition function :QP (Q) q0Q: start state FQ: set of accepting states
Nondeterministic :QP (Q) The function :QP (Q) is the crucial difference. It means: “When reading symbol “a” while in state q, one can go to one of the states in (q,a)Q.” The in = {} takes care of the empty string transitions.
Recognizing Languages (def) A nondeterministic FA M = (Q,,,q,F) accepts a string w = w1…wn if and only if we can rewrite w as y1…ym with yi and there is a sequence r0…rm of states in Q such that: 1) r0=q0 2) ri+1 (ri,yi+1) for all i=0,…,m–1 3) rm F
Building a Lexical Analyzer • Token = Pattern • Pattern = Regular Expression • Regular Expression = NFA • NFA = DFA • DFA = Lexical Analyzer
Lexical Structure in Languages Each language typically has several categories of words. In a typical programming language: Keywords (if, while) Arithmetic Operations (+, -, *, /) Integer numbers (1, 2, 45, 67) Floating point numbers (1.0, .2, 3.337) Identifiers (abc, i, j, ab345) Typically have a lexical category for each keyword and/or each category Each lexical category defined by regexp
Lexical Categories Example If Keyword = if While Keyword = while Operator = +|-|*|/ Integer = [0-9] [0-9]* Float = [0-9]*. [0-9]* Identifier = [a-z]([a-z]|[0-9])* Note that [0-9] = (0|1|2|3|4|5|6|7|8|9) [a-z] = (a|b|c|…|y|z) Will use lexical categories in next level
Grammar and Automata Correspondence Regular Grammar Context-Free Grammar Context-Sensitive Grammar Automaton Finite-State Automaton Push-Down Automaton Turing Machine
Context-Free Grammars Grammar Characterization (T,NT,S,P) Finite set of Terminals T Finite set of Nonterminals NT Start Nonterminal S (goal symbol, start symbol) Finite set of Productions P: NT (T | NT)* RHS of production can have any sequence of terminals or nonterminals
Push-Down Automata DFA Plus a Stack (S,A,V, F,s0,sF) Finite set of states S Finite Input Alphabet A, Stack Alphabet V Transition relation F : S (A U{})V S V* Start state s0 Final states sF Each configuration consists of a state, a stack, and remaining input string
CFG Versus PDA CFGs and PDAs are of equivalent power Grammar Implementation Mechanism: Translate CFG to PDA, then use PDA to parse input string Foundation for bottom-up parser generators
Context-Sensitive Grammars and Turing Machines Context-Sensitive Grammars Allow Productions to Use Context P: (T.NT)+ (T.NT)* Turing Machines Have Finite State Control Two-Way Tape Instead of A Stack
Exercises (1) Let A = {x,y,z} and B = {x,y}, answer: Is A a subset of B? Is B a subset of A? What is AB? What is AB? What is AB? What is P (Q)?
Exercises (2) Give NFAs with the specified number of states that recognize the following languages over the alphabet ={0,1}: { w | w ends with 00}, three states {0}; two states { w | w contains even number of 0s, or exactly two 1s}, six states {0n | nN }, one state
Exercises (3) Proof the following result: “If L1 and L2 are regular languages, then is a regular language too.” Describe the language that is recognized by this nondeterministic automaton: 1 0,1 1 0, 1 q1 q2 q3 q4
Additional Practice Problems Write the formal descriptions of the sets containing … the numbers 1,10 and 100 … all integers greater than 5 … all natural numbers less than 5 … the string “aba” … the empty string … nothing Give a formal description of this NFA: Give DFA state diagrams for the following languages over ={0,1}: { w | w begins with 1 and ends with 0} { w | w does not contain substring 110} {} all strings except the empty string 1 0, 1 q1 q2 q3