Download presentation
Presentation is loading. Please wait.
1
1 Foundations of Software Design Lecture 22: Regular Expressions and Finite Automata Marti Hearst Fall 2002
2
2 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Regular Expressions Language = set of strings Language is defined by a regular expression –the set of strings that match the expression. Regular Exp.Corresponding Set of Strings {""} a{"a"} (ab)*{"", "ab", "abab", "ababab"} a | b | c{"a", "b", "c"} (a | b | c)*{"", "a", "b", "c", "aa", "ab",..., "bccabb"...}
3
3 Regex rules (Perl) Goal: match patterns // String of characters matches the same string /woodchuck/ “how much wood does a woodchuck chuck?” /p/ “you are a good programmer” /pink elephant/ “this is not a pink elephant. /!/ “Keep you hands to yourself!” [] Disjunction /[wW]ood/ “how much wood does a Woodchuck chuck?” /[abcd]*/ “you are a good programmer” /[A-Za-z]*/ (any letter sequence) Special rule: when ^ is FIRST WITHIN BRACKETS it means NOT /[^A-Z]*/ (anything not an upper case letter) /a^b/ “look up a^b now”
4
4 Regex Rules, cont. ? The preceding character or nothing /woodchucks?/ “how much wood does a woodchuck chuck?” /behaviou?r/ “behaviour is the British spelling” * Kleene closure; zero or more occurrences of the preceding character or regular expression /baa*/ ba, baa, baaa, baaaa … /ba*/ b, ba, baa, baaa, baaaa … /[ab]*/ , a, b, ab, ba, baaa, aaabbb, … /[0-9][0-9]*/any positive integer + Kleene closure; one or more occurrences of the preceding character or regular expression /ba+/ ba, baa, baaa, baaaa …. Wildcard; matches any character at that position /p.nt/ pant, pint, punt /cat.*cat/A string where “cat” appears twice anywhere
5
5 Regex Rules, cont. | Disjunction /(cats?|dogs?)+/“It’s raining cats and a dog.” ( ) Grouping /(gupp(y|ies))*/“His guppy is the king of guppies.” ^ $ \b Anchors (start, end of the line) /^The/“The cat in the hat.” /^The end\.$/“The end.” /^The.* end\.$/“The bitter end.” /(the)*/“I saw him the other day.” /(\bthe\b)*/“I saw him the other day.”
6
6 Regexp for Dollars No commas /$[0-9]+(\.[0-9][0-9])?/ With commas /$[0-9][0-9]?[0-9]?(,[0-9][0-9][0-9])*(\.[0-9][0-9])?/ With or without commas /$[0-9][0-9]?[0-9]?((,[0-9][0-9][0-9])*| [0-9]*) (\.[0-9][0-9])?/
7
7 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Regexps and Substitutions s/title/ / title /the (.*)er they are, the \1er they will be/ The bigger they are, the bigger they will be. /the (.*)er they (.*), the \1er they \2/ The bigger they were, the bigger they were
8
8 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html How to Simulate a Therapist Eliza examples s/.* all.*/ IN WHAT WAY/ s/.* I am (sad | depressed).*/I AM SORRY TO HEAR YOU ARE \1/ s/.* I am (happy | glad).*/I AM GLAD TO HEAR YOU ARE \1/ S/.* always */CAN YOU THINK OF A SPECIFIC EXAMPLE/ s/.*/TELL ME ABOUT YOUR MOTHER/
9
9 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Practice: Regexp’s for email addresses Handle cases like these: hearst@sims.berkeley.edu Wacky@yahoo.com Vip@ic-arda.gov
10
10 Regexps and Scanners Regular expressions are used to define the language recognized by the scanner for the parser We create rules in which names stand for regular expressions Example: –digit: [0-9] –letter: [A-Za-z]
11
11 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Precedence of operators What is the difference? letter letter | digit* letter (letter | digit)* Regular Expression Operator Precedence ()highest * + ? {} Sequences, anchors |lowest
12
12 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Practicing Regexps Describe (in English) the language defined by each of the following regular expressions: –letter (letter | digit*) –digit digit* "." digit digit*
13
13 Adapted from Jurafsky & Martin 2000 Three Equivalent Representations Finite automata Regular expressions Regular languages Each can describe the others
14
14 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Finite Automata A FA is similar to a compiler in that: –A compiler recognizes legal programs in some (source) language. –A finite-state machine recognizes legal strings in some language. Example: Pascal Identifiers –sequences of one or more letters or digits, starting with a letter: letter letter | digit S A
15
15 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Finite-Automata State Graphs The start state An accepting state A transition a A state
16
16 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Finite Automata Transition s1 a s2s1 a s2 Is read In state s 1 on input “a” go to state s 2 If end of input –If in accepting state => accept –Otherwise => reject If no transition possible (got stuck) => reject FSA = Finite State Automata
17
17 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Language defined by FSA The language defined by a FSA is the set of strings accepted by the FSA. –in the language of the FSM shown below: x, tmp2, XyZzy, position27. –not in the language of the FSM shown below: 123, a?, 13apples. letter letter | digit S A
18
18 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Example: Integer Literals FSA that accepts integer literals with an optional + or - sign: Note – two different edges from S to A \(+|-)?[0-9]+\ + digit S B A -
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.