CS 3813: Introduction to Formal Languages and Automata Chapter 3 Regular languages and expressions These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata, 3rd ed., by Peter Linz, published by Jones and Bartlett Publishers, Inc., Sudbury, MA, 2001. They are intended for classroom use only and are not a substitute for reading the textbook.
Operations on formal languages Let L1 = {10} and L2 = {011, 11}. Union: L1 L2 = {10, 011, 11} Concatenation: L1 L2 = {10011, 1011} Kleene Star: L1* = {λ, 10, 1010, 101010, … } Other operations: intersection, complement, difference
Definition Of Regular Languages A regular language over an alphabet is one that contains either a single string of length 0 or 1, or strings which can be obtained by using the operations of union, concatenation, or Kleene* on strings of length 0 or 1.
Alternative definition of regular languages The simplest possible regular languages are the empty set and languages consisting of a single string that is either the empty string or has length one. For example; if = {a,b}, the simplest languages are , {λ}, {a}, and {b}. A regular language is a language that can be built from these simple languages, by using the three operations of union, concatenation, and Kleene star.
Regular Languages correspond to Regular Expressions L = Ø RE is Ø L = {} RE = L = {a} RE = a L = L1 L2 RE = (r1 + r2) L = L1 L2 RE = (r1r2) L = L1* RE = (r1*)
Regular expressions A useful shorthand for describing regular languages. Compare to arithmetic expressions, such as (x + 3)/2. An arithmetic expression is constructed using arithmetic operators, such as addition and division. A regular expression is constructed using operations on languages, such as concatenation, union, and Kleene star. The value of an arithmetic expression is a number. The value of a regular expression is a language.
Recursive definition of a regular expression is a regular expression corresponding to the language . λ is a regular expression corresponding to the language {λ}. For each symbol a , a is a regular expression corresponding to the language {a}. For any regular expressions r and s, corresponding to the regular languages L(r) and L(s), respectively, each of the following is a regular expression: (r + s) corresponds to the language L(r) L(s) (r · s) or (rs) corresponds to the language L(r)L(s) (r*) corresponds to the language (L(r))*
Examples a*+ b = {λ, a, b, aa, aaa, aaaa, aaaaa, … } a*ba* = {w * : w has exactly one b} (a + b)* = any string of a’s and b’s (a + b)*aa (a + b)* = {w * : w contains aa} (a + b)*aa (a + b)* + (a + b)*bb (a + b)* = {w * : w contains aa or bb} (a + λ)b* = {abn : n 0} + {bn : n 0} As with arithmetic expressions, there is an order of precedence for operators -- unless you change it using parentheses. The order is: star closure first, then concatenation, then union.
More examples All strings containing no more than two a’s: (b + c)*(λ + a)(b + c)*(λ + a)(b + c)* All strings containing no runs of a’s of length greater than two: (b + c)*(λ + a + aa)(b + c)*((b + c)(b + c)*(λ + a + aa)(b + c)*)* All strings in which all runs of a’s have lengths that are multiples of three: (aaa + b + c)*
Hints for writing regular expressions Assume = {a, b, c}. Zero or more a’s: a* One or more a’s: aa* Any string at all: (a + b + c)* Any nonempty string: (a + b + c)(a + b + c)* Any string that does not contain a: (b + c)* Any string containing exactly one a: (b + c)*a(b + c)*
Practice Let = {a,b,c}. Give a regular expression for the following languages: (a) all strings containing exactly two a’s (b) all strings containing no more than three a’s
Practice Let = {a,b,c}. Give a regular expression for the following languages: (a) all strings containing exactly two a’s (b + c)*a(b + c)*a(b + c)* (b) all strings containing no more than three a’s (b + c)*(λ + a)(b + c)*(λ + a)(b + c)*(λ + a)(b + c)*
Practice What languages correspond to the following regular expressions? a*b (aaa + bba) (ab)*
More practice Give regular expressions for the following languages, where the alphabet is = {a, b, c}. --all strings ending in b --all strings containing no more than two a’s -- all strings of even length
More practice Give regular expressions for the following languages, where the alphabet is = {0, 1}. --all strings of one or more 0’s followed by a 1 --all strings of two or more symbols followed by three or more 0’s -- all strings that do not end with 01
Do these strings match the regular expression? Regular expression String (01* + 1) 0101 (a + λ)b b (ab)*a* λ (a + b)(ab) bb
Accepting (review) Let M = (Q, S, q0, d, A) be an FA. A string x S * is accepted by M if d*(q0, x) A The language accepted (or recognized) by M is the set L(M) = {x S * | x is accepted by M} A language L over the alphabet S is regular iff there is a Finite Automaton that accepts L.
Kleene’s theorem 1) For any regular expression r that represents language L(r), there is a finite automaton that accepts that same language. 2) For any finite automaton M that accepts language L(M), there is a regular expression that represents the same Therefore, the class of languages that can be represented by regular expressions is equivalent to the class of languages accepted by finite automata -- the regular languages.
NFA regular expression DFA Kleene’s theorem part 1 proved Kleene’s
Theorem 3.1 1st half of Kleene’s theorem: Let r be a regular expression. Then there exists some nondeterministic regular accepter that accepts L(r). Consequently, L(r) is a regular language. Proof strategy: for any regular expression, we show how to construct an equivalent NFA. Because regular expressions are defined recursively, the proof is by induction.
Base step: Give a NFA that accepts each of the simple or “base” languages, , {λ}, and {a} for each a . a
Inductive step: For each of the operations -- union, concatenation and Kleene star -- show how to construct an accepting NFA. Closure under union: M1 λ λ λ M2 λ
Closure under concatenation: M1 λ M2
Closure under Kleene Star: λ λ M1 λ λ
Closure properties of Regular Languages Union, concatenation, and Kleene star of two regular languages will result in a regular language, since we can write a regular expression for them. Intersection and difference (complement) of two regular languages will also produce a regular language. The class of regular languages is said to be closed under these operations. (More in Ch. 4.)
Exercise Use the construction of the first half of Kleene’s theorem to construct a NFA that accepts the language L(ab*aa + bba*ab).
Exercise Use the construction of the first half of Kleene’s theorem to construct a NFA that accepts the language L(ab*aa + bba*ab). λ λ FA accepting ab*aa q0 qf λ λ FA accepting bba*ab
Exercise Construct a NFA that accepts the language corresponding to the regular expression: ((b(a+b)*a) + a)
Theorem 3.2 Kleene’s theorem part 2: Let L be a regular language. Then there exists a regular expression r such that L = L(r). Any language accepted by a finite automaton can be represented by a regular expression. The proof strategy: For any DFA, we show how create an equivalent regular expression. In other words, we describe an algorithm for converting any DFA to a regular expression.
Expression diagram A labeled directed graph (similar to a finite state diagram) in which transitions are labeled by regular expressions Has a single start state with no incoming transitions Has a single accepting state with no outgoing transitions Example: (a+b) ab a*
Algorithm for converting a DFA into an equivalent regular expression Initial step: Change every transition labeled a,b to (a+b). Add a single start state with an outgoing λ-transition to the current start state, and add a single final state with incoming λ-transitions from every previous final state. Main step: Until expression diagram has only two states (initial state and final state), repeat the following: -- pick some non-start, non-final state -- remove it from the diagram and re-label transitions with regular expressions so that the same language is accepted
The key step is removing states and re-labeling transitions with regular expressions. Here are some examples of how to do this. b a a ab*a b ab*a a b ab*b a b ab*b ab*a b a a
Exercise a,b a a (a+b) λ λ b b Continue ...
Exercise a (a+b) λ λ b (a+b) λ a*b a*b (a+b)*
Exercise Find a regular expression that corresponds to the language accepted by the following DFA. a b
Exercise a b λ λ a*bb*a a*bb* λ (a*bb*a)*a*bb*
Homework Find a regular expression that corresponds to the language accepted by the following DFA. q1 q0 1 1 q2 1
Applications of regular expressions Validation checking that an input string is in valid format example 1: checking format of email address on WWW entry form example 2: UNIX regex command Search and selection looking for strings that match a certain pattern example: UNIX grep command Tokenization converting sequence of characters (a string) into sequence of tokens (e.g., keywords, identifiers) used in lexical analysis phase of compiler
Grammar A grammar G = (V, T, S, P) consists of the following quadruple: a set V of variables (non-terminal symbols), including a starting symbol S NT a set T of terminals (same as an alphabet, ) A start symbol S V a set P of production rules Example: S aS | A A bA | λ
Derivation Strings are “derived” from a grammar Example of a derivation S aS aaS aaA aabA aab At each step, a nonterminal is replaced by the sentential form on the right-hand side of a rule (a sentential form can contain nonterminals and/or terminals) Automata recognize languages; grammars generate languages
Context-free grammar A grammar is said to be context-free if every rule has a single non-terminal on the left-hand side This means you can apply the rule in any context. More complicated languages (such as English) have context-dependent rules. A language generated from a context-free grammar is called a context-free language
Regular grammar A grammar is said to be right-linear if all productions are of the form AxB or Ax, where A and B are variables and x is a string of terminals A grammar is said to be left-linear if all productions are of the form ABx or Ax A regular grammar is either right-linear or left-linear.
Linear grammar A grammar can be linear without being right- or left-linear. A linear grammar is a grammar in which at most one variable can occur on the right side of any production rule, without any restriction on the position of the variable. Example: S aS | A A Ab | λ
Another formalism for regular languages Every regular grammar generates a regular language, and every regular language can be generated by a regular grammar. A regular grammar is a simpler, special-case of a context-free grammar The regular languages are a proper subset of the context-free languages
Exercises Find a regular grammar that generates the language on = {a,b} consisting of all strings with no more than three a’s
Exercises Find a regular grammar that generates the language on = {a,b} consisting of all strings with no more than three a’s S bS | aA | λ A bA | aB | λ B bB | aC | λ C bC | λ
Exercises Find a regular grammar that generates the language consisting of even-length strings over {a,b}
Exercises Find a regular grammar that generates the language consisting of even-length strings over {a,b} S aaS | abS | baS | bbS | λ
Exercise What language is generated by the following context-free (but not regular) grammar? S aSa | bSb | a | b | λ
Exercise What language is generated by the following context-free grammar? S aSa | bSb | a | b | λ The odd/even palindrome language: L = {w(a+b+λ)wR}
Exercise Given a grammar, you should be able to say what language it generates Use set notation to define the language generated by the following grammars 1) S aaSB | λ B bB | b 2) S aSbb | A A cA | c
Exercise S aaSB | λ B bB | b It helps to list some of the strings that can be formed: S aaSB aaB aab S aaSB aaB aabB aabb S aaSB aaB aabB aabbB aabbb S aaSB aaB aabB aabbB aabbbB aabbbb S aaSB aaaaSBB aaaaBB aaaaBb aaaabb S aaSB aaaaSBB aaaaBB aaaaBbB aaaaBbb aaaabbb What is the pattern? L = {(aa)nbnb*}
Exercise Given a language, you should be able give a grammar that generates it. For example, give a regular (right-linear) grammar for the language consisting of all strings over {a, b, c} that begin with a, contain exactly two b’s, and end with cc.
Exercise Give a regular (right-linear) grammar for the language consisting of all strings over {a, b, c} that begin with a, contain exactly two b’s, and end with cc S aA A bB | aA | cA B bC | aB | cB C aC | cC | cD D c
Theorem 3.3 Every language generated by a right-linear grammar is regular. Proof: Specify a procedure for automatically constructing an NFA that mimics the derivations of a right-linear grammar.
Theorem 3.3 Justification: The sentential forms produced by a right linear grammar have exactly one variable, which occurs as the rightmost symbol. Assume that our grammar has a production rule D dE and that, during the derivation of a string, there is a step wcD wcdE We can construct an NFA which has states D and E, and an arc labeled d from D to E. NFAs can be converted to DFAs. All languages accepted by DFAs are regular.
Theorem 3.3 Construction: For each variable Vi in the grammar there will be a state in the automaton labeled Vi. The initial state of the automaton will be labeled V0 and will correspond to the S variable in the grammar. For each production rule Vi a1a2…amVj the automaton will have transitions such that δ*(Vi a1a2…am) = Vj For each production rule Vi a1a2…am the automaton will have transitions such that δ*(Vi a1a2…am) = Vfinal
Theorem 3.3 Construct an NFA that accepts the language generated by the grammar: S aA convert to: V0 aV1 A abS | b V1 abV0 | b a b V0 V1 Vf b a
Theorem 3.4 Every regular language can be generated by a right-linear grammar. Proof: Generate a DFA for the language. Specify a procedure for automatically constructing a right-linear grammar from the DFA.
Theorem 3.4 Given a regular language L, let M = (Q, , δ, q0, F) be a DFA that accepts L. Let Q = {q0, q1, …, qn} and = {a1, a2, …, am}. Construct the grammar G = (V, T, S, P) with: V = {q0, q1, …, qn} T = {a1, a2, …, am} S = q0. P = {} initially. P, the set of production rules, is constructed as follows:
Theorem 3.4 For each transition δ(qi, aj) = qk in the transition table of M, add to P the production: qi ajqk If qk is in F, then add to P the production: qk λ
Theorem 3.4 Example: Construct a right-linear grammar for the language L = L(aab*a) First, build an NFA for L: a a a q0 q1 q2 qf b
Theorem 3.4 a a a q0 q1 q2 qf b P = {} initially. Add to P a rule for each transition in the NFA: q0 aq1 q1 aq2 q2 bq2 q2 aqf Since qf is in F, add to P the production: qf λ
Theorem 3.4 a a a q0 q1 q2 qf b Now P = You can convert to normal grammar notation: {q0 aq1 S aA q1 aq2 A aB q2 bq2 B bB q2 aqf B aC qf λ } C λ
Theorem 3.5 A language L is regular if and only if there exists a left-linear grammar G such that L = L(G). Proof: The strategy here is a little tricky. We describe an algorithm to construct a right-linear grammar that generates the reverse of all the strings generated by the left-linear grammar.
Theorem 3.5 Given any left-linear grammar we can construct from it an right-linear grammar G’ by replacing productions of the form: A Bv with A vRB and A v with A vR Since L(G’) is generated by a right-linear grammar, it is regular. It can be demonstrated that L(G) = (L(G’))R. It can be proven that the reverse of any regular language is also regular (see exercise 12, section 2.3 in the Linz text). Hence, L is regular.
Theorem 3.6 A language L is regular if and only if there exists a regular grammar G such that L = L(G). Proof: Combine our definition of regular grammars, which includes the statement, “A regular grammar is either right-linear or left-linear”, with theorems 3.4 and 3.5
3 ways of specifying regular languages Regular expressions DFA NFA Regular grammars describe accept Regular languages generate