CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Monz Regular Expressions and Finite State Automata (J&M 2) Prof. Bonnie J. Dorr Dr. Christof Monz TA: Adam Lee
Regular Expressions and Finite State Automata REs: Language for specifying text strings Search for document containing a string –Searching for “woodchuck” Finite-state automata (FSA) (singular: automaton) How much wood would a woodchuck chuck if a woodchuck would chuck wood? –Searching for “woodchucks” with an optional final “s”
Regular Expressions Basic regular expression patterns Perl-based syntax (slightly different from other notations for regular expressions) Disjunctions /[wW]oodchuck/
Regular Expressions Ranges [A-Z] Negations [^Ss]
Regular Expressions Optional characters ?, * and + –? (0 or 1) /colou?r/ color or colour –* (0 or more) /oo*h!/ oh! or Ooh! or Ooooh! *+*+ Stephen Cole Kleene –+ (1 or more) /o+h!/ oh! or Ooh! or Ooooh! Wild cards. - /beg.n/ begin or began or begun
Regular Expressions Anchors ^ and $ –/^[A-Z]/ “Ramallah, Palestine” –/^[^A-Z]/ “¿verdad?” “really?” –/\.$/ “It is over.” –/.$/ ? Boundaries \b and \B –/\bon\b/ “on my way” “Monday” –/\Bon\b/ “automaton” Disjunction | –/yours|mine/ “it is either yours or mine”
Disjunction, Grouping, Precedence Column 1 Column 2 Column 3 … How do we express this? /Column [0-9]+ */ /(Column [0-9]+ +)*/ Precedence –Parenthesis () –Counters * + ? {} –Sequences and anchors the ^my end$ –Disjunction | REs are greedy!
Perl Commands While ($line= ){ if ($line =~ /the/){ print “MATCH: $line”; }
Writing correct expressions Exercise: Write a Perl regular expression to match the English article “the”: /the/ /[tT]he/ /\b[tT]he\b/ /[^a-zA-Z][tT]he[^a-zA-Z]/ /(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
A more complex example Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”: /$[0-9]+/ /$[0-9]+\.[0-9][0-9]/ /\b$[0-9]+(\.[0-9][0-9])?\b/ /\b$[0-9][0-9]?[0-9]?(\.[0-9][0-9])?\b/ /\b[0-9]+ *([MG]Hz|[Mm]egahertz| [Gg]igahertz)\b/ /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
Advanced operators should be _
Substitutions and Memory Substitutions s/colour/color/ s/colour/color/g s/([Cc]olour)/$1olor/ /the (.*)er they were, the $1er they will be/ /the (.*)er they (.*), the $1er they $2/ Substitute as many times as possible! Case insensitive matching s/colour/color/i Memory ( $1, $2, etc. refer back to matches)
Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re always bugging us about something or other ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? User: Well, my boyfriend made me come here ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I’m depressed much of the time ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED
Eliza-style regular expressions s/.* YOU ARE (depressed|sad).*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad).*/WHY DO YOU THINK YOU ARE \1/ s/.* all.*/IN WHAT WAY/ s/.* always.*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ Step 1: replace first person with second person references s/\bI(’m| am)\b /YOU ARE/g s/\bmy\b /YOUR/g S/\bmine\b /YOURS/g Step 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations
Finite-state Automata Finite-state automata (FSA) Regular languages Regular expressions
Finite-state Automata (Machines) /^baa+!$/ q0q0 q1q1 q2q2 q3q3 q4q4 baa! a state transition final state baa! baaa! baaaa! baaaaa!...
Input Tape aba!b q0q baa ! a REJECT
Input Tape baaa q0q0 q1q1 q2q2 q3q3 q3q3 q4q4 ! baa ! a ACCEPT
Finite-state Automata Q: a finite set of N states –Q = {q 0, q 1, q 2, q 3, q 4 } : a finite input alphabet of symbols – = {a, b, !} q 0 : the start state F: the set of final states –F = {q 4 } (q,i): transition function –Given state q and input symbol i, return new state q' – (q 3,!) q 4
State-transition Tables Input Stateba! 01ØØ 1Ø2Ø 2Ø3Ø 3Ø34 4:ØØØ
D-RECOGNIZE function D-RECOGNIZE (tape, machine) returns accept or reject index Beginning of tape current-state Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state transition-table [current-state, tape[index]] index index + 1 end
Adding a failing state q0q0 q1q1 q2q2 q3q3 q4q4 baa! a qFqF a ! b ! b! b b a !
Adding an “all else” arc q0q0 q1q1 q2q2 q3q3 q4q4 baa! a qFqF = = = =
Languages and Automata Can use FSA as a generator as well as a recognizer Formal language L: defined by machine M that both generates and recognizes all and only the strings of that language. –L(M) = {baa!, baaa!, baaaa!, …} Regular languages vs. non-regular languages
Languages and Automata Deterministic vs. Non-deterministic FSAs Epsilon ( ) transitions
Using NFSAs to accept strings Backup: add markers at choice points, then possibly revisit unexplored arcs at marked choice point. Look-ahead: look ahead in input Parallelism: look at alternatives in parallel
Using NFSAs Input Stateba! 01ØØØ 1Ø2ØØ 2Ø2,3ØØ 3ØØ4Ø 4:ØØØØ
Readings for next time J&M Chapter 3