CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Monz Regular Expressions and Finite State Automata (J&M 2) Prof. Bonnie J. Dorr.

Slides:



Advertisements
Similar presentations
Regular expressions Day 2
Advertisements

CS 208: Computing Theory Assoc. Prof. Dr. Brahim Hnich Faculty of Computer Sciences Izmir University of Economics.
1 Regular Expressions and Automata September Lecture #2.
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705 Julia Hirschberg.
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
Chapter Section Section Summary Set of Strings Finite-State Automata Language Recognition by Finite-State Machines Designing Finite-State.
Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)
1 Regular Expressions and Automata September Lecture #2-2.
Finite-State Automata Shallow Processing Techniques for NLP Ling570 October 5, 2011.
CS5371 Theory of Computation
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
Finite Automata Finite-state machine with no output. FA consists of States, Transitions between states FA is a 5-tuple Example! A string x is recognized.
Natural Language Processing (NLP) Overview and history of the field Knowledge of language The role of ambiguity Models and Algorithms Eliza, Turing, and.
1 Lecture 13 FSA’s –Defining FSA’s –Computing with FSA’s Defining L(M) –Defining language class LFSA.
Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst.
Fall 2005 Lecture Notes #4 EECS 595 / LING 541 / SI 661 Natural Language Processing.
Computational Language Finite State Machines and Regular Expressions.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705 Julia Hirschberg.
LING 388: Language and Computers Sandiway Fong Lecture 10: 9/26.
1 Foundations of Software Design Lecture 22: Regular Expressions and Finite Automata Marti Hearst Fall 2002.
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
CSC 361Finite Automata1. CSC 361Finite Automata2 Formal Specification of Languages Generators Grammars Context-free Regular Regular Expressions Recognizers.
1.Defs. a)Finite Automaton: A Finite Automaton ( FA ) has finite set of ‘states’ ( Q={q 0, q 1, q 2, ….. ) and its ‘control’ moves from state to state.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Grammars, Languages and Finite-state automata Languages are described by grammars We need an algorithm that takes as input grammar sentence And gives a.
CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar.
Topics Non-Determinism (NFSAs) Recognition of NFSAs Proof that regular expressions = FSAs Very brief sketch: Morphology, FSAs, FSTs Very brief sketch:
CMSC 723 / LING 645: Intro to Computational Linguistics September 29, 2004: Dorr Toward a Kimmo FAQ Prof. Bonnie J. Dorr Dr. Christof Monz TA: Adam Lee.
Chapter 2: Finite-State Machines Heshaam Faili University of Tehran.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
Fall 2004 Lecture Notes #2 EECS 595 / LING 541 / SI 661 Natural Language Processing.
CMSC 723 / LING 723: Computational Linguistics I September 5, 2007: Dorr Part I: MT (cont), MT Evaluation (J & M, 24) Part II: Reg. Expressions, FSAutomata.
1 i206: Lecture 18: Regular Expressions Marti Hearst Spring 2012.
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009.
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
LING/C SC/PSYC 438/538 Lecture 7 9/15 Sandiway Fong.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
1 Regular Expressions and Automata CPE 641 Natural Language Processing from Kathy McCoy’s slides, CISC 882 Introduction to NLP
Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: tml Some changes.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
LING 388: Language and Computers Sandiway Fong 9/27 Lecture 10.
Finite-state automata Day 12 LING Computational Linguistics Harry Howard Tulane University.
1 LING 6932 Spring 2007 LING 6932 Topics in Computational Linguistics Hana Filip Lecture 2: Regular Expressions, Finite State Automata.
1 Regular Expressions and Automata August Lecture #2.
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
CMSC 330: Organization of Programming Languages Theory of Regular Expressions Finite Automata.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
CS 4705 Lecture 2 Regular Expressions and Automata.
September1999 CMSC 203 / 0201 Fall 2002 Week #15 – 2/4/6 December 2002 Prof. Marie desJardins.
Search and Decoding in Speech Recognition Regular Expressions and Automata.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
Nondeterministic Finite Automata (NFAs). Reminder: Deterministic Finite Automata (DFA) q For every state q in Q and every character  in , one and only.
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
Deterministic Finite Automata Nondeterministic Finite Automata.
Theory of Computation Automata Theory Dr. Ayman Srour.
Korea Maritime and Ocean University NLP Jung Tae LEE
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
/208/.
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
CSC NLP - Regex, Finite State Automata
CSE322 Definition and description of finite Automata
Non Deterministic Automata
CPSC 503 Computational Linguistics
Non Deterministic Automata
Natural Language Processing (NLP)
Presentation transcript:

CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Monz Regular Expressions and Finite State Automata (J&M 2) Prof. Bonnie J. Dorr Dr. Christof Monz TA: Adam Lee

Regular Expressions and Finite State Automata  REs: Language for specifying text strings  Search for document containing a string –Searching for “woodchuck” Finite-state automata (FSA) (singular: automaton) How much wood would a woodchuck chuck if a woodchuck would chuck wood? –Searching for “woodchucks” with an optional final “s”

Regular Expressions  Basic regular expression patterns  Perl-based syntax (slightly different from other notations for regular expressions)  Disjunctions /[wW]oodchuck/

Regular Expressions  Ranges [A-Z]  Negations [^Ss]

Regular Expressions  Optional characters ?, * and + –? (0 or 1) /colou?r/  color or colour –* (0 or more) /oo*h!/  oh! or Ooh! or Ooooh! *+*+ Stephen Cole Kleene –+ (1 or more) /o+h!/  oh! or Ooh! or Ooooh!  Wild cards. - /beg.n/  begin or began or begun

Regular Expressions  Anchors ^ and $ –/^[A-Z]/  “Ramallah, Palestine” –/^[^A-Z]/  “¿verdad?” “really?” –/\.$/  “It is over.” –/.$/  ?  Boundaries \b and \B –/\bon\b/  “on my way” “Monday” –/\Bon\b/  “automaton”  Disjunction | –/yours|mine/  “it is either yours or mine”

Disjunction, Grouping, Precedence  Column 1 Column 2 Column 3 … How do we express this? /Column [0-9]+ */ /(Column [0-9]+ +)*/  Precedence –Parenthesis () –Counters * + ? {} –Sequences and anchors the ^my end$ –Disjunction |  REs are greedy!

Perl Commands While ($line= ){ if ($line =~ /the/){ print “MATCH: $line”; }

Writing correct expressions Exercise: Write a Perl regular expression to match the English article “the”: /the/ /[tT]he/ /\b[tT]he\b/ /[^a-zA-Z][tT]he[^a-zA-Z]/ /(^|[^a-zA-Z])[tT]he[^a-zA-Z]/

A more complex example Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”: /$[0-9]+/ /$[0-9]+\.[0-9][0-9]/ /\b$[0-9]+(\.[0-9][0-9])?\b/ /\b$[0-9][0-9]?[0-9]?(\.[0-9][0-9])?\b/ /\b[0-9]+ *([MG]Hz|[Mm]egahertz| [Gg]igahertz)\b/ /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/

Advanced operators should be _

Substitutions and Memory  Substitutions s/colour/color/ s/colour/color/g s/([Cc]olour)/$1olor/ /the (.*)er they were, the $1er they will be/ /the (.*)er they (.*), the $1er they $2/ Substitute as many times as possible! Case insensitive matching s/colour/color/i  Memory ( $1, $2, etc. refer back to matches)

Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re always bugging us about something or other ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? User: Well, my boyfriend made me come here ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I’m depressed much of the time ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED

Eliza-style regular expressions s/.* YOU ARE (depressed|sad).*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad).*/WHY DO YOU THINK YOU ARE \1/ s/.* all.*/IN WHAT WAY/ s/.* always.*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ Step 1: replace first person with second person references s/\bI(’m| am)\b /YOU ARE/g s/\bmy\b /YOUR/g S/\bmine\b /YOURS/g Step 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations

Finite-state Automata  Finite-state automata (FSA)  Regular languages  Regular expressions

Finite-state Automata (Machines) /^baa+!$/ q0q0 q1q1 q2q2 q3q3 q4q4 baa! a state transition final state baa! baaa! baaaa! baaaaa!...

Input Tape aba!b q0q baa ! a REJECT

Input Tape baaa q0q0 q1q1 q2q2 q3q3 q3q3 q4q4 ! baa ! a ACCEPT

Finite-state Automata  Q: a finite set of N states –Q = {q 0, q 1, q 2, q 3, q 4 }   : a finite input alphabet of symbols –  = {a, b, !}  q 0 : the start state  F: the set of final states –F = {q 4 }   (q,i): transition function –Given state q and input symbol i, return new state q' –  (q 3,!)  q 4

State-transition Tables Input Stateba! 01ØØ 1Ø2Ø 2Ø3Ø 3Ø34 4:ØØØ

D-RECOGNIZE function D-RECOGNIZE (tape, machine) returns accept or reject index  Beginning of tape current-state  Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state  transition-table [current-state, tape[index]] index  index + 1 end

Adding a failing state q0q0 q1q1 q2q2 q3q3 q4q4 baa! a qFqF a ! b ! b! b b a !

Adding an “all else” arc q0q0 q1q1 q2q2 q3q3 q4q4 baa! a qFqF = = = =

Languages and Automata  Can use FSA as a generator as well as a recognizer  Formal language L: defined by machine M that both generates and recognizes all and only the strings of that language. –L(M) = {baa!, baaa!, baaaa!, …}  Regular languages vs. non-regular languages

Languages and Automata  Deterministic vs. Non-deterministic FSAs  Epsilon (  ) transitions

Using NFSAs to accept strings  Backup: add markers at choice points, then possibly revisit unexplored arcs at marked choice point.  Look-ahead: look ahead in input  Parallelism: look at alternatives in parallel

Using NFSAs Input Stateba!  01ØØØ 1Ø2ØØ 2Ø2,3ØØ 3ØØ4Ø 4:ØØØØ

Readings for next time  J&M Chapter 3