Download presentation
Presentation is loading. Please wait.
1
CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar Habash TAs: Nitin Madnani and Nate Waisbrot
2
Regular Expressions and Finite State Automata REs: Language for specifying text strings Search for document containing a string –Searching for “woodchuck” Finite-state automata (FSA) (singular: automaton) How much wood would a woodchuck chuck if a woodchuck would chuck wood? –Searching for “woodchucks” with an optional final “s”
3
Regular Expressions Basic regular expression patterns Perl-based syntax (slightly different from other notations for regular expressions) Disjunctions /[wW]oodchuck/
4
Regular Expressions Ranges [A-Z] Negations [^Ss]
5
Regular Expressions Optional characters ?, * and + –? (0 or 1) /colou?r/ color or colour –* (0 or more) /oo*h!/ oh! or Ooh! or Ooooh! *+*+ Stephen Cole Kleene –+ (1 or more) /o+h!/ oh! or Ooh! or Ooooh! Wild cards. –/beg.n/ begin or began or begun
6
Regular Expressions Anchors ^ and $ –/^[A-Z]/ “Ramallah, Palestine” –/^[^A-Z]/ “¿verdad?” “really?” –/\.$/ “It is over.” –/.$/ ? Boundaries \b and \B –/\bon\b/ “on my way” “Monday” –/\Bon\b/ “automaton” Disjunction | –/yours|mine/ “it is either yours or mine”
7
Disjunction, Grouping, Precedence Column 1 Column 2 Column 3 … How do we express this? /Column [0-9]+ */ /(Column [0-9]+ *)*/ Precedence –Parenthesis () –Counters * + ? {} –Sequences and anchors the ^my end$ –Disjunction | REs are greedy!
8
Perl Commands While ($line= ){ if ($line =~ /the/){ print “MATCH: $line”; }
9
Writing correct expressions Exercise: write a Perl regular expression to match the English article “the”: /the/ /[tT]he/ /\b[tT]he\b/ /[^a-zA-Z][tT]he[^a-zA-Z]/ /(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
10
A more complex example Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”: /$[0-9]+/ /$[0-9]+\.[0-9][0-9]/ /\b$[0-9]+(\.[0-9][0-9])?\b/ /\b$[0-9][0-9]?[0-9]?(\.[0-9][0-9])?\b/ /\b[0-9]+ *([MG]Hz|[Mm]egahertz| [Gg]igahertz)\b/ /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
11
Advanced operators
12
Substitutions and Memory Substitutions Memory ( \1, \2, etc. refer back to matches) s/colour/color/ s/colour/color/g s/([0-9]+)/ / /the (.*)er they were, the \1er they will be/ /the (.*)er they (.*), the \1er they \2/ Substitute as many times as possible! Case insensitive matching s/colour/color/i
13
Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re always bugging us about something or other ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? User: Well, my boyfriend made me come here ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I’m depressed much of the time ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED
14
Eliza-style regular expressions s/.* YOU ARE (depressed|sad).*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad).*/WHY DO YOU THINK YOU ARE \1/ s/.* all.*/IN WHAT WAY/ s/.* always.*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ Step 1: replace first person references with second person references s/\bI(’m| am)\b/YOU ARE/g s/\bmy\b/YOUR/g S/\bmine\b/YOURS/g Step 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations
15
Finite-state Automata Finite-state automata (FSA) Regular languages Regular expressions
16
Finite-state Automata (Machines) q0q0 q1q1 q2q2 q3q3 q4q4 baa! a /baa+!/ state transition final state baa! baaa! baaaa! baaaaa!...
17
Input Tape aba!b q0q0 0 1234 baa ! a REJECT
18
Input Tape baaa q0q0 q1q1 q2q2 q3q3 q3q3 q4q4 ! 0 1234 baa ! a ACCEPT
19
Finite-state Automata Q: a finite set of N states –Q = {q 0, q 1, q 2, q 3, q 4 } : a finite input alphabet of symbols – = {a, b, !} q 0 : the start state F: the set of final states –F = {q 4 } (q,i): transition function –Given state q and input symbol i, return new state q' – (q 3,!) q 4
20
State-transition Tables Input Stateba! 0100 1020 2030 3034 4:000
21
D-RECOGNIZE function D-RECOGNIZE (tape, machine) returns accept or reject index Beginning of tape current-state Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state transition-table [current-state, tape[index]] index index + 1 end
22
Adding a failing state q0q0 q1q1 q2q2 q3q3 q4q4 baa! a qFqF a ! b ! b! b b a !
23
Adding an “all else” arc q0q0 q1q1 q2q2 q3q3 q4q4 baa! a qFqF = = = =
24
Languages and Automata Can use FSA as a generator as well as a recognizer Formal language L: defined by machine M that both generates and recognizes all and only the strings of that language. –L(M) = {baa!, baaa!, baaaa!, …} Regular languages vs. non-regular languages
25
Languages and Automata Deterministic vs. Non-deterministic FSAs Epsilon ( ) transitions
26
Using NFSAs to accept strings Backup: add markers at choice points, then possibly revisit unexplored arcs at marked choice point. Look-ahead: look ahead in input Parallelism: look at alternatives in parallel
27
Using NFSAs Input Stateba! 01000 10200 202,300 30040 40000
28
More about FSAs Transducers Equivalence of DFSAs and NFSAs Recognition as search: depth-first, breadth- search
29
Recognition using NFSAs
30
NFSA Recognition of “baaa!”
31
Breadth-first Recognition of “baaa!”
32
Regular languages Regular languages are characterized by FSAs For every NFSA, there is an equivalent DFSA. Regular languages are closed under concatenation, Kleene closure, union.
33
Concatenation
34
Kleene Closure
35
Union
36
Readings for next time J&M Chapter 3
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.