1 Introduction to Regular Expressions EELS Meeting, Dec Tom Horton Dept. of Computer Science Univ. of Virginia
Basics A regular expression defines a pattern Strings match that pattern. (Perhaps many!) Thus the regular expression is short-hand for a set of strings Alternatively: the regex defines a grammar and thus a set of valid strings (statements) for that grammar Search / Matching with regexs The pattern is applied to one or more strings Words, lines, etc Matches or not, or Find next (or all) matching string(s) (e.g. line in a file) 2
Website for this Presentation 3
How to Express Patterns Done live on board without slides, and with demo 4
Theoretical Background The following might interest those who want to see how those in math and CS think about theoretical aspects of such things 5
6 Phrase Structured Grammars A phrase structured grammar G is a four-tuple (V, T, S, P) where: V is the:Vocabulary T is the set of:Terminals S is the:Start symbol P is the set of:Productions T is a subset of V The set V – T is the set N, the Non-terminals Productions are literally the way in which one string can replace (or produce) another Language of G is all strings derivable from S
7 Types Of Languages Types distinguished by the form of the productions in the languages that generate them Classification introduced by Chomsky Type 0 Type 1Context-sensitive languages Type 2Context-free languages Productions:LHS- A (i.e., single non-terminal) Type 3Regular languages Productions:LHS- A RHS- a or aB where A and B are non-terminals and a is a terminal Chomsky Hierarchy
Type 3 Languages The REGULAR LANGUAGES or REGULAR EXPRESSIONS Productions:LHS- A RHS - a or aB where: A and B are non-terminals a is a terminal Simplest kind of formal language structure Useful for defining things in CS File name completion Search patterns 8 This form for the RHS defines the REGULAR LANGUAGES
Type 3 Language Example V={a, b, A, B, S} T={a, b} N={A, B} S=S P=S ª aBS ª bA A ª aA A ª a B ª bB B ª b So what is the language? An “a” followed by a string of “b”s and vice versa 9
10 Quick Bits Of Notation x* (aka the Kleene star or closure) means the set of elements with zero or more x’s e.g. ‘a’* = { , a, aa, aaa, aaaa, aaaaa, … } x + means the set of elements with one or more x’s e.g. ‘a’ + = {a, aa, aaa, aaaa, aaaaa, … } x m means exactly m x’s x | y means x or y e.g. a | b x can be a set in which case the result is concatenation of set elements e.g. {‘a’, ‘b’}* = { , a, b, aa, ab, bb, aaa, aab, aba, baa, abb, bab, bba, … }
11 Quick Bits Of Notation These ideas are used to specify regular languages: (a | b | c)* Examples include: aabbbccc, abcabc, aaccbbb (ab + | ba + ) Examples include: ab, abbb, baaa, ba, baaaaaaa Regular languages occur all the time This is the example we looked at earlier
12 Finite State Automata A finite-state automaton is a five-tuple: IA set of symbols, the input alphabet Literally the set of input symbols SA set of states that it can be in S 0 A designated initial state AA set of designated states called the accepting states NThe next state function N:S I S
13 Example: Vending Machine $0.75 Deposited $1.00 Deposited $0.50 Deposited $0.25 Deposited $0 Deposited $0.25 $0.50 $0.25 $0.50 $0.25 Based on Epp, page 746 but simpler $0.25
14 Example: Vending Machine $0.75 Deposited $1.00 Deposited $0.50 Deposited $0.25 Deposited $0 Deposited $0.25 $0.50 $0.25 $0.50 $0.25 Based on Epp, page 746 but simpler
15 Example: Parity Checking OddEven Example strings: Initial And Accepting State This is just a recognizer for strings in a language
16 Language Recognizers Kleene’s theorem: The set of languages defined by type 3 (regular) grammars is identical to the set of languages accepted by finite-state automata Thus, for any regular language there is a finite state automaton that recognizes it Another theorem: The set of languages defined by type 2 (context free) grammars is identical to the set of languages accepted by pushdown automata Thus, for any context-free language, there is a pushdown automaton that recognizes it A pushdown automaton is a finite state automaton supplemented with a pushdown stack Really cool thing : given a context-free or regular language, there are programs (parser generators) that will build the automaton for us!