Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar.

Similar presentations


Presentation on theme: "CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar."— Presentation transcript:

1 CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar Habash TAs: Nitin Madnani and Nate Waisbrot

2 Regular Expressions and Finite State Automata REs: Language for specifying text strings Search for document containing a string –Searching for “woodchuck” Finite-state automata (FSA) (singular: automaton) How much wood would a woodchuck chuck if a woodchuck would chuck wood? –Searching for “woodchucks” with an optional final “s”

3 Regular Expressions Basic regular expression patterns Perl-based syntax (slightly different from other notations for regular expressions) Disjunctions /[wW]oodchuck/

4 Regular Expressions Ranges [A-Z] Negations [^Ss]

5 Regular Expressions Optional characters ?, * and + –? (0 or 1) /colou?r/  color or colour –* (0 or more) /oo*h!/  oh! or Ooh! or Ooooh! *+*+ Stephen Cole Kleene –+ (1 or more) /o+h!/  oh! or Ooh! or Ooooh! Wild cards. –/beg.n/  begin or began or begun

6 Regular Expressions Anchors ^ and $ –/^[A-Z]/  “Ramallah, Palestine” –/^[^A-Z]/  “¿verdad?” “really?” –/\.$/  “It is over.” –/.$/  ? Boundaries \b and \B –/\bon\b/  “on my way” “Monday” –/\Bon\b/  “automaton” Disjunction | –/yours|mine/  “it is either yours or mine”

7 Disjunction, Grouping, Precedence Column 1 Column 2 Column 3 … How do we express this? /Column [0-9]+ */ /(Column [0-9]+ *)*/ Precedence –Parenthesis () –Counters * + ? {} –Sequences and anchors the ^my end$ –Disjunction | REs are greedy!

8 Perl Commands While ($line= ){ if ($line =~ /the/){ print “MATCH: $line”; }

9 Writing correct expressions Exercise: write a Perl regular expression to match the English article “the”: /the/ /[tT]he/ /\b[tT]he\b/ /[^a-zA-Z][tT]he[^a-zA-Z]/ /(^|[^a-zA-Z])[tT]he[^a-zA-Z]/

10 A more complex example Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”: /$[0-9]+/ /$[0-9]+\.[0-9][0-9]/ /\b$[0-9]+(\.[0-9][0-9])?\b/ /\b$[0-9][0-9]?[0-9]?(\.[0-9][0-9])?\b/ /\b[0-9]+ *([MG]Hz|[Mm]egahertz| [Gg]igahertz)\b/ /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/

11 Advanced operators

12 Substitutions and Memory Substitutions Memory ( \1, \2, etc. refer back to matches) s/colour/color/ s/colour/color/g s/([0-9]+)/ / /the (.*)er they were, the \1er they will be/ /the (.*)er they (.*), the \1er they \2/ Substitute as many times as possible! Case insensitive matching s/colour/color/i

13 Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re always bugging us about something or other ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? User: Well, my boyfriend made me come here ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I’m depressed much of the time ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED

14 Eliza-style regular expressions s/.* YOU ARE (depressed|sad).*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad).*/WHY DO YOU THINK YOU ARE \1/ s/.* all.*/IN WHAT WAY/ s/.* always.*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ Step 1: replace first person references with second person references s/\bI(’m| am)\b/YOU ARE/g s/\bmy\b/YOUR/g S/\bmine\b/YOURS/g Step 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations

15 Finite-state Automata Finite-state automata (FSA) Regular languages Regular expressions

16 Finite-state Automata (Machines) q0q0 q1q1 q2q2 q3q3 q4q4 baa! a /baa+!/ state transition final state baa! baaa! baaaa! baaaaa!...

17 Input Tape aba!b q0q0 0 1234 baa ! a REJECT

18 Input Tape baaa q0q0 q1q1 q2q2 q3q3 q3q3 q4q4 ! 0 1234 baa ! a ACCEPT

19 Finite-state Automata Q: a finite set of N states –Q = {q 0, q 1, q 2, q 3, q 4 }  : a finite input alphabet of symbols –  = {a, b, !} q 0 : the start state F: the set of final states –F = {q 4 }  (q,i): transition function –Given state q and input symbol i, return new state q' –  (q 3,!)  q 4

20 State-transition Tables Input Stateba! 0100 1020 2030 3034 4:000

21 D-RECOGNIZE function D-RECOGNIZE (tape, machine) returns accept or reject index  Beginning of tape current-state  Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state  transition-table [current-state, tape[index]] index  index + 1 end

22 Adding a failing state q0q0 q1q1 q2q2 q3q3 q4q4 baa! a qFqF a ! b ! b! b b a !

23 Adding an “all else” arc q0q0 q1q1 q2q2 q3q3 q4q4 baa! a qFqF = = = =

24 Languages and Automata Can use FSA as a generator as well as a recognizer Formal language L: defined by machine M that both generates and recognizes all and only the strings of that language. –L(M) = {baa!, baaa!, baaaa!, …} Regular languages vs. non-regular languages

25 Languages and Automata Deterministic vs. Non-deterministic FSAs Epsilon (  ) transitions

26 Using NFSAs to accept strings Backup: add markers at choice points, then possibly revisit unexplored arcs at marked choice point. Look-ahead: look ahead in input Parallelism: look at alternatives in parallel

27 Using NFSAs Input Stateba!  01000 10200 202,300 30040 40000

28 More about FSAs Transducers Equivalence of DFSAs and NFSAs Recognition as search: depth-first, breadth- search

29 Recognition using NFSAs

30 NFSA Recognition of “baaa!”

31 Breadth-first Recognition of “baaa!”

32 Regular languages Regular languages are characterized by FSAs For every NFSA, there is an equivalent DFSA. Regular languages are closed under concatenation, Kleene closure, union.

33 Concatenation

34 Kleene Closure

35 Union

36 Readings for next time J&M Chapter 3


Download ppt "CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar."

Similar presentations


Ads by Google