Presentation is loading. Please wait.

Presentation is loading. Please wait.

MA/CSSE 474 Theory of Computation Kleene's Theorem Practical Regular Expressions.

Similar presentations


Presentation on theme: "MA/CSSE 474 Theory of Computation Kleene's Theorem Practical Regular Expressions."— Presentation transcript:

1 MA/CSSE 474 Theory of Computation Kleene's Theorem Practical Regular Expressions

2 Kleene’s Theorem Finite state machines and regular expressions define the same class of languages. To prove this, we must show: Theorem: Any language that can be defined by a regular expression can be accepted by some FSM and so is regular. Theorem: Every regular language (i.e., every language that can be accepted by some DFSM) can be defined with a regular expression. Q1

3 For Every Regular Expression There is a Corresponding FSM We’ll show this by construction. An FSM for:  : A single element of  :  (  *): Q2

4 Union If  is the regular expression    and if both L(  ) and L(  ) are regular:

5 Concatenation If  is the regular expression  and if both L(  ) and L(  ) are regular:

6 Kleene Star If  is the regular expression  * and if L(  ) is regular:

7 An Example (b  ab )* An FSM for b An FSM for a An FSM for b An FSM for ab :

8 An Example ( b  ab )* An FSM for ( b  ab ):

9 An Example ( b  ab )* An FSM for ( b  ab )*:

10 The Algorithm regextofsm regextofsm(  : regular expression) = Beginning with the primitive subexpressions of  and working outwards until an FSM for all of  has been built do: Construct an FSM as described above.

11 For Every FSM There is a Corresponding Regular Expression We’ll show this by construction. The construction is different than the textbook's. Let M = ({q 1, …, q n }, , , q 1, A) be a DFSM. Define R ijk to be the set of all strings x   * such that (q i,x) |-M (q j,  ), and if (q i,y) |-M (q,  ), for any prefix y of x (except y=  and y=x), then  k That is, R ijk is the set of all strings that take us from q i to q j without passing through any intermediate states numbered higher than k. In this case, "passing through" means both entering and leaving. Note that either i or j (Or both) may be greater than k.

12 DFA  Reg. Exp. construction R ijk is the set of all strings that take M from q i to q j without passing through any intermediate states numbered higher than k. Note that R ijn is the set of all strings that take M from q i to q j. Also note that L(M) is the union of R 1jn over all q j in A. We will show that for all i,j  {1, …, n} and all k  {0, …, n}, R ijk is defined by a regular expression.

13 DFA  Reg. Exp. continued R ijk is the set of all strings that take M from q i to q j without passing through any intermediate states numbered higher than k. It can be computed recursively: Base cases (k = 0): –If i  j, R ij0 = {a  :  (q i, a) = q j } –If i = j, R ii0 = {a  :  (q i, a) = q i }  {  } Recursive case (k > 0): R ijk is R ijk-1  R ikk-1 (R kkk-1 )*R kjk-1 We show by induction that each R ijk is defined by some regular expression r ijk.

14 DFA  Reg. Exp. Proof pt. 1 Base case definition (k = 0): –If i  j, R ij0 = {a  :  (q i, a) = q j } –If i = j, R ii0 = {a  :  (q i, a) = q i }  {  } Base case proof: R ij0 is a finite set of symbols, each of which is either  or a single symbol from . So R ij0 can be defined by the reg. exp. r ij0 = a 1  a 2  …  a p (or a 1  a 2  …  a p  if i=j), where {a 1, a 2, …,a p } is the set of all symbols a such that  (q i, a) = q j. Note that if M has no direct transitions from q i to q j, then r ij0 is  (or {  } if i=j).

15 DFA  Reg. Exp. Proof pt. 2 Recursive definition (k > 0): R ijk is R ijk-1  R ikk-1 (R kkk-1 )*R kjk-1 Induction hypothesis: For each and, there is a regular expression r k-1 such that L(r k-1 )= R k-1. Induction step. By the recursive parts of the definition of regular expressions and the languages they define, and by the above recursive defintion of R ijk : R ijk = L(r ijk-1  r ikk-1 (r kkk-1 )*r kjk-1 )

16 DFA  Reg. Exp. Proof pt. 3 We showed by induction that each R ijk is defined by some regular expression r ijk. In particular, for all q j  A, there is a regular expression r 1jn that defines R 1jn. Then L(M) = L(r 1j 1 n  …  r 1j p n ), where A = {q j 1, …, q j p }

17 An Example Start q 1 q 2 q 3 0 0 1 1 0,1 k=0k=1k=2 r 11k   (00)* r 12k 000(00)* r 13k 110*1 r 21k 000(00)* r 22k    00(00)* r 23k 11  010*1 r 31k   (0  1)(00)*0 r 32k 0  1 0  1(0  1)(00)* r 33k     (0  1)0*1 Q3

18 A Special Case of Pattern Matching Suppose that we want to match a pattern that is composed of a set of keywords. Then we can write a regular expression of the form: (  * (k 1  k 2  …  k n )  *) + For example, suppose we want to match:  * finite state machine  FSM  finite state automaton  * We can use regextofsm to build an FSM. But … We can instead use buildkeywordFSM.

19 {cat, bat, cab} The single keyword cat:

20 {cat, bat, cab} Adding bat :

21 {cat, bat, cab} Adding cab :

22 Regular Expressions in Perl SyntaxNameDescription abcConcatenationMatches a, then b, then c, where a, b, and c are any regexs a | b | cUnion (Or)Matches a or b or c, where a, b, and c are any regexs a*a*Kleene starMatches 0 or more a’s, where a is any regex a+a+At least oneMatches 1 or more a’s, where a is any regex a?a?Matches 0 or 1 a’s, where a is any regex a{n, m}ReplicationMatches at least n but no more than m a’s, where a is any regex a*?ParsimoniousTurns off greedy matching so the shortest match is selected a+?  .Wild cardMatches any character except newline ^Left anchorAnchors the match to the beginning of a line or string $Right anchorAnchors the match to the end of a line or string [a-z][a-z] Assuming a collating sequence, matches any single character in range [^ a - z ] Assuming a collating sequence, matches any single character not in range \d\d Digit Matches any single digit, i.e., string in [ 0 - 9 ] \D\D Nondigit Matches any single nondigit character, i.e., [^ 0 - 9 ] \w\w Alphanumeric Matches any single “word” character, i.e., [ a - zA - Z0 - 9 ] \W\W Nonalphanumeric Matches any character in [^ a - zA - Z0 - 9 ] \s\s White spaceMatches any character in [space, tab, newline, etc.]

23 SyntaxNameDescription \S\S Nonwhite space Matches any character not matched by \ s \n\n NewlineMatches newline \r\r ReturnMatches return \t\t TabMatches tab \f\f FormfeedMatches formfeed \b\b BackspaceMatches backspace inside [] \b\b Word boundaryMatches a word boundary outside [] \B\B Nonword boundaryMatches a non-word boundary \0\0 NullMatches a null character \ nnn OctalMatches an ASCII character with octal value nnn \ x nn HexadecimalMatches an ASCII character with hexadecimal value nn \cX\cX ControlMatches an ASCII control character \ char QuoteMatches char ; used to quote symbols such as. and \ (a)(a)StoreMatches a, where a is any regex, and stores the matched string in the next variable \1VariableMatches whatever the first parenthesized expression matched \2Matches whatever the second parenthesized expression matched …For all remaining variables Regular Expressions in Perl

24 Simplifying Regular Expressions Regex’s describe sets: ● Union is commutative:    =   . ● Union is associative: (    )   =   (    ). ●  is the identity for union:    =    = . ● Union is idempotent:    = . Concatenation: ● Concatenation is associative: (  )  =  (  ). ●  is the identity for concatenation:   =   = . ●  is a zero for concatenation:   =   = . Concatenation distributes over union: ● (    )  = (   )  (   ). ●  (    ) = (   )  (   ). Kleene star: ●  * = . ●  * = . ●(  *)* =  *. ●  *  * =  *. ●(    )* = (  *  *)*.


Download ppt "MA/CSSE 474 Theory of Computation Kleene's Theorem Practical Regular Expressions."

Similar presentations


Ads by Google