Practical Regular Expressions

Slides:



Advertisements
Similar presentations
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Advertisements

Regular Languages and Regular Grammars Chapter 3.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture4: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Fall 2006Costas Busch - RPI1 Regular Expressions.
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Fall 2004COMP 3351 Single Final State for NFA. Fall 2004COMP 3352 Any NFA can be converted to an equivalent NFA with a single final state.
1 Single Final State for NFAs and DFAs. 2 Observation Any Finite Automaton (NFA or DFA) can be converted to an equivalent NFA with a single final state.
79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression.
1 A Single Final State for Finite Accepters. 2 Observation Any Finite Accepter (NFA or DFA) can be converted to an equivalent NFA with a single final.
Fall 2004COMP 3351 Regular Expressions. Fall 2004COMP 3352 Regular Expressions Regular expressions describe regular languages Example: describes the language.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Regular Expressions. Notation to specify a language –Declarative –Sort of like a programming language. Fundamental in some languages like perl and applications.
Regular Expressions and Finite State Automata  Themes  Finite State Automata (FSA)  Describing patterns with graphs  Programs that keep track of state.
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
MA/CSSE 474 Theory of Computation DFSM Canonical Form Proof of NDFSM  DFSM ALGORITHM (as much as we have time for) This version includes the "answers"
Automata, Computability, & Complexity by Elaine Rich ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Slides provided by author Slides edited for.
Regular Expressions Hopcroft, Motawi, Ullman, Chap 3.
MA/CSSE 474 Theory of Computation Kleene's Theorem Practical Regular Expressions.
MA/CSSE 474 Theory of Computation Decision Problems DFSMs.
1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a compiler PART II:
CSCI 2670 Introduction to Theory of Computing September 1, 2005.
MA/CSSE 474 Theory of Computation Regular Expressions Intro.
Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.
Regular Expressions Chapter 6 1. Regular Languages Regular Language Regular Expression Finite State Machine L Accepts 2.
January 9, 2015CS21 Lecture 31 CS21 Decidability and Tractability Lecture 3 January 9, 2015.
Regular Expressions Chapter 6.
Regular Expressions Chapter 6. Regular Languages Regular Language Regular Expression Finite State Machine Recognizes or Accepts Generates.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Regular Expressions Chapter 6. Regular Languages Regular Language Regular Expression Finite State Machine L Accepts.
CSCI 2670 Introduction to Theory of Computing September 13.
Regular Expressions Costas Busch - LSU.
Chapter 3 Regular Expressions, Nondeterminism, and Kleene’s Theorem Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction.
MA/CSSE 474 Theory of Computation Minimizing DFSMs.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
Regular Expressions CS 130: Theory of Computation HMU textbook, Chapter 3.
1 Introduction to the Theory of Computation Regular Expressions.
Regular Languages Chapter 1 Giorgi Japaridze Theory of Computability.
Complexity and Computability Theory I Lecture #5 Rina Zviel-Girshin Leah Epstein Winter
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
MA/CSSE 474 Theory of Computation Closure properties of Regular Languages Pumping Theorem.
MA/CSSE 474 Theory of Computation Regular Expressions Intro.
MA/CSSE 474 Theory of Computation Closure properties of Regular Languages Pumping Theorem.
3. Regular Expressions and Languages
Chapter 3 Lexical Analysis.
Formal Language & Automata Theory
Regular Expressions.
CSE 105 theory of computation
Complexity and Computability Theory I
Single Final State for NFA
Language Recognition (12.4)
Hierarchy of languages
Chapter Seven: Regular Expressions
Definitions Equivalence to Finite Automata
Definitions Equivalence to Finite Automata
CS21 Decidability and Tractability
MA/CSSE 474 Theory of Computation
Subject Name: FORMAL LANGUAGES AND AUTOMATA THEORY
Language Recognition (12.4)
CS21 Decidability and Tractability
Instructor: Aaron Roth
Definitions Equivalence to Finite Automata
MA/CSSE 474 Theory of Computation
MA/CSSE 474 DFSM to RE Theory of Computation
CSE 105 theory of computation
CSCI 2670 Introduction to Theory of Computing
CSE 105 theory of computation
Presentation transcript:

Practical Regular Expressions MA/CSSE 474 Theory of Computation Kleene's Theorem Practical Regular Expressions

Kleene’s Theorem Finite state machines and regular expressions define the same class of languages. To prove this, we must show: Theorem: Any language that can be defined by a regular expression can be accepted by some FSM and so is regular. Theorem: Every regular language (i.e., every language that can be accepted by some DFSM) can be defined with a regular expression. Q1

For Every Regular Expression There is a Corresponding FSM We’ll show this by construction. An FSM for: : A single element of :  (*): Q2

Union If  is the regular expression    and if both L() and L() are regular:

Concatenation If  is the regular expression  and if both L() and L() are regular:

Kleene Star If  is the regular expression * and if L() is regular:

An Example (b  ab)* An FSM for b An FSM for a An FSM for b An FSM for ab:

An Example (b  ab)* An FSM for (b  ab):

An Example (b  ab)* An FSM for (b  ab)*:

The Algorithm regextofsm regextofsm(: regular expression) = Beginning with the primitive subexpressions of  and working outwards until an FSM for all of  has been built do: Construct an FSM as described above.

For Every FSM There is a Corresponding Regular Expression We’ll show this by construction. The construction is different than the textbook's. Let M = ({q1, …, qn}, , , q1, A) be a DFSM. Define Rijk to be the set of all strings x  * such that (qi,x) |-M (qj, ), and if (qi,y) |-M (q𝓁, ), for any prefix y of x (except y=  and y=x), then 𝓁  k That is, Rijk is the set of all strings that take us from qi to qj without passing through any intermediate states numbered higher than k. In this case, "passing through" means both entering and leaving. Note that either i or j (Or both) may be greater than k. I do this construction because I think every math or CS major should see this kind of dynamic programming algorithm, and there is no required course that does it. Another algorithm that uses a similar approach is the Floyd-Warshall "all-pairs shortest paths algorithm", which solves a problem ) in n3 time that at first seems like it must be O(n4). Write "Floyd-Warshall on the board.

DFAReg. Exp. construction Rijk is the set of all strings that take M from qi to qj without passing through any intermediate states numbered higher than k. Note that Rijn is the set of all strings that take M from qi to qj. Also note that L(M) is the union of R1jn over all qj in A. We will show that for all i,j{1, …, n} and all k {0, …, n}, Rijk is defined by a regular expression. Second bullet: because all states are numbered  n.

DFAReg. Exp. continued Base cases (k = 0): Rijk is the set of all strings that take M from qi to qj without passing through any intermediate states numbered higher than k. It can be computed recursively: Base cases (k = 0): If i  j, Rij0 = {a : (qi, a) = qj} If i = j, Rii0 = {a : (qi, a) = qi}  {} Recursive case (k > 0): Rijk is Rijk-1  Rikk-1(Rkkk-1)*Rkjk-1 We show by induction that each Rijk is defined by some regular expression rijk. Recursive case: if a string is in Rijjk, it either takes us form state i to state j without passing through state k, or it does pass through k. It takes us to k for the first time, possibly does some loops that pass through k, then goes to j.

DFAReg. Exp. Proof pt. 1 Base case definition (k = 0): If i  j, Rij0 = {a : (qi, a) = qj} If i = j, Rii0 = {a : (qi, a) = qi}  {} Base case proof: Rij0 is a finite set of symbols, each of which is either  or a single symbol from . So Rij0 can be defined by the reg. exp. rij0 = a1a2…ap (or a1a2…ap if i=j), where {a1, a2, …,ap} is the set of all symbols a such that (qi, a) = qj. Note that if M has no direct transitions from qi to qj, then rij0 is  (or {} if i=j).

DFAReg. Exp. Proof pt. 2 Recursive definition (k > 0): Rijk is Rijk-1  Rikk-1(Rkkk-1)*Rkjk-1 Induction hypothesis: For each 𝓁 and 𝓂, there is a regular expression r𝓁𝓂k-1 such that L(r𝓁𝓂k-1 )= R𝓁𝓂k-1. Induction step. By the recursive parts of the definition of regular expressions and the languages they define, and by the above recursive defintion of Rijk : Rijk = L(rijk-1  rikk-1(rkkk-1)*rkjk-1)

DFAReg. Exp. Proof pt. 3 We showed by induction that each Rijk is defined by some regular expression rijk. In particular, for all qjA, there is a regular expression r1jn that defines R1jn. Then L(M) = L(r1j1n  … r1jpn ), where A = {qj1, …, qjp}

An Example Start q1 q2 q3 k=0 k=1 k=2 r11k   (00)* r12k 0 0 0(00)* 1 0,1 Start q1 q2 q3 k=0 k=1 k=2 r11k   (00)* r12k 0 0 0(00)* r13k 1 1 0*1 r21k 0 0 0(00)* r22k    00 (00)* r23k 1 1  01 0*1 r31k   (0  1)(00)*0 r32k 0  1 0  1 (0  1)(00)* r33k    (0  1)0*1 Look quickly at the k=0 cases. Tell students that for practice they should look at some of the others that we do not do together. Look together at these examples: r221 = r220 r210(r110)*r120 =   0()*0 = =   00 r132 = r131 r121(r221)*r231 = 1 0(  00)*(1  01) = 1 0(00)*(  0) 1 . Note that 0(00)*(  0) is equivalent to 0*, so we get 1 0*1 which is equivalent to 0*1. Have students (on the quiz) do r123 and r133 and simplify them. Compare notes with another student. r123 = r122  r132(r332)*r322 = 0(00)*  0*1( (0  1)0*1)*(0  1)(00)* = 0(00)*  0*1((0  1)0*1)*(0  1)(00)* r133 = r132  r132(r332)*r332 = 0*1 0*1( (0  1)0*1)*( (0  1)0*1= 0*1((0  1)0*1)* For the entire machine we get r123  r133 = 0(00)*  0*1((0  1)0*1)*(  (0  1)(00)*) Q3

A Special Case of Pattern Matching Suppose that we want to match a pattern that is composed of a set of keywords. Then we can write a regular expression of the form: (* (k1  k2  …  kn) *)+ For example, suppose we want to match: * finite state machine  FSM  finite state automaton* We can use regextofsm to build an FSM. But … We can instead use buildkeywordFSM.

{cat, bat, cab} The single keyword cat:

{cat, bat, cab} Adding bat:

{cat, bat, cab} Adding cab:

Regular Expressions in Perl Syntax Name Description abc Concatenation Matches a, then b, then c, where a, b, and c are any regexs a | b | c Union (Or) Matches a or b or c, where a, b, and c are any regexs a* Kleene star Matches 0 or more a’s, where a is any regex a+ At least one Matches 1 or more a’s, where a is any regex a? Matches 0 or 1 a’s, where a is any regex a{n, m} Replication Matches at least n but no more than m a’s, where a is any regex a*? Parsimonious Turns off greedy matching so the shortest match is selected a+?  . Wild card Matches any character except newline ^ Left anchor Anchors the match to the beginning of a line or string $ Right anchor Anchors the match to the end of a line or string [a-z] Assuming a collating sequence, matches any single character in range [^a-z] Assuming a collating sequence, matches any single character not in range \d Digit Matches any single digit, i.e., string in [0-9] \D Nondigit Matches any single nondigit character, i.e., [^0-9] \w Alphanumeric Matches any single “word” character, i.e., [a-zA-Z0-9] \W Nonalphanumeric Matches any character in [^a-zA-Z0-9] \s White space Matches any character in [space, tab, newline, etc.]

Regular Expressions in Perl Syntax Name Description \S Nonwhite space Matches any character not matched by \s \n Newline Matches newline \r Return Matches return \t Tab Matches tab \f Formfeed Matches formfeed \b Backspace Matches backspace inside [] Word boundary Matches a word boundary outside [] \B Nonword boundary Matches a non-word boundary \0 Null Matches a null character \nnn Octal Matches an ASCII character with octal value nnn \xnn Hexadecimal Matches an ASCII character with hexadecimal value nn \cX Control Matches an ASCII control character \char Quote Matches char; used to quote symbols such as . and \ (a) Store Matches a, where a is any regex, and stores the matched string in the next variable \1 Variable Matches whatever the first parenthesized expression matched \2 Matches whatever the second parenthesized expression matched … For all remaining variables

Simplifying Regular Expressions Regex’s describe sets: ● Union is commutative:    =   . ● Union is associative: (  )   =   (  ). ●  is the identity for union:    =    = . ● Union is idempotent:    = . Concatenation: ● Concatenation is associative: () = (). ●  is the identity for concatenation:   =   = . ●  is a zero for concatenation:   =   = . Concatenation distributes over union: ● (  )  = ( )  ( ). ●  (  ) = ( )  ( ). Kleene star: ● * = . ● * = . ●(*)* = *. ● ** = *. ●(  )* = (**)*.