1 LING 6932 Spring 2007 LING 6932 Topics in Computational Linguistics Hana Filip Lecture 2: Regular Expressions, Finite State Automata.

Slides:



Advertisements
Similar presentations
Automata Theory Part 1: Introduction & NFA November 2002.
Advertisements

Regular expressions Day 2
CS 345: Chapter 9 Algorithmic Universality and Its Robustness
1 Regular Expressions and Automata September Lecture #2.
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705 Julia Hirschberg.
Formal Language, chapter 9, slide 1Copyright © 2007 by Adam Webber Chapter Nine: Advanced Topics in Regular Languages.
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
Chapter Section Section Summary Set of Strings Finite-State Automata Language Recognition by Finite-State Machines Designing Finite-State.
Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)
1 Regular Expressions and Automata September Lecture #2-2.
CS5371 Theory of Computation
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
61 Nondeterminism and Nodeterministic Automata. 62 The computational machine models that we learned in the class are deterministic in the sense that the.
1 Regular Expressions & Automata Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.
Finite Automata Finite-state machine with no output. FA consists of States, Transitions between states FA is a 5-tuple Example! A string x is recognized.
Natural Language Processing (NLP) Overview and history of the field Knowledge of language The role of ambiguity Models and Algorithms Eliza, Turing, and.
Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst.
Regular Expressions Lecture 3. Regular Expressions Motivation: To search for strings using partially specified patterns. Examples: To validate data fields.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Computational Language Finite State Machines and Regular Expressions.
CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Monz Regular Expressions and Finite State Automata (J&M 2) Prof. Bonnie J. Dorr.
Finite state automaton (FSA)
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705 Julia Hirschberg.
1 Foundations of Software Design Lecture 22: Regular Expressions and Finite Automata Marti Hearst Fall 2002.
Finite Automata Chapter 5. Formal Language Definitions Why need formal definitions of language –Define a precise, unambiguous and uniform interpretation.
CSC 361Finite Automata1. CSC 361Finite Automata2 Formal Specification of Languages Generators Grammars Context-free Regular Regular Expressions Recognizers.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Grammars, Languages and Finite-state automata Languages are described by grammars We need an algorithm that takes as input grammar sentence And gives a.
Regular Expressions & Automata Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar.
CSE467/567 Computational Linguistics Carl Alphonce Computer Science & Engineering University at Buffalo.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
1 i206: Lecture 18: Regular Expressions Marti Hearst Spring 2012.
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009.
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
Natural Language Processing Lecture 2—1/15/2015 Susan W. Brown.
1 Regular Expressions and Automata CPE 641 Natural Language Processing from Kathy McCoy’s slides, CISC 882 Introduction to NLP
Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: tml Some changes.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Finite-state automata Day 12 LING Computational Linguistics Harry Howard Tulane University.
CS1Q Computer Systems Lecture 11 Simon Gay. Lecture 11CS1Q Computer Systems - Simon Gay 2 The D FlipFlop The RS flipflop stores one bit of information.
1 Regular Expressions and Automata August Lecture #2.
Complexity and Computability Theory I Lecture #11 Instructor: Rina Zviel-Girshin Lea Epstein.
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Chapter 5 Finite Automata Finite State Automata n Capable of recognizing numerous symbol patterns, the class of regular languages n Suitable for.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 Design and Analysis of Algorithms.
Turing Machines. The next level of Machine… PDAs improved on FSAs by adding memory. We make the memory more flexible to do more complicated tasks.
Theory of Computation Automata Theory Dr. Ayman Srour.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Lecture Three: Finite Automata Finite Automata, Lecture 3, slide 1 Amjad Ali.
/208/.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lecture2 Regular Language
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
CSCI 5832 Natural Language Processing
CSC NLP - Regex, Finite State Automata
Chapter Nine: Advanced Topics in Regular Languages
CPSC 503 Computational Linguistics
Natural Language Processing (NLP)
Presentation transcript:

1 LING 6932 Spring 2007 LING 6932 Topics in Computational Linguistics Hana Filip Lecture 2: Regular Expressions, Finite State Automata

2 LING 6932 Spring 2007 Regular expressions formulas for specifying text strings How can we search for any of these strings? woodchuck woodchucks Woodchuck Woodchucks Figure from Dorr/Monz slides

3 LING 6932 Spring 2007 Regular Expressions Basic patterns of regular expressions Perl-based syntax (slightly different from other notations for regular expressions as used in UNIX, for example) /Woodchuck/ matches any string containing the substring Woodchuck, if your search application returns entire lines, for example ‘/’ notation used by Perl, NOT part of the RE Google: Woodchuck Draft Cider Producers of Woodchuck Draft Cider in Spingfield, VT k - Cached - Similar pages Slide from Dorr/Monz

4 LING 6932 Spring 2007 Regular Expressions Regular expressions are CASE SENSITIVE The pattern /woodchuck/ will not match the string Woodchuck Disjunction /[wW]oodchuck/ Slide from Dorr/Monz

5 LING 6932 Spring 2007 Regular Expressions Ranges [A-Z] Slide from Dorr/Monz

6 LING 6932 Spring 2007 Regular Expressions  Negation / [^a]/ ^: caret ‘match any single character except a’ Slide from Dorr/Monz

7 LING 6932 Spring 2007 Regular Expressions Operators ?, * and + ? (0 or 1) /woodchucks?/  woodchuck or woodchucks /colou?r/  color or colour * (0 or more) /oo*h!/  oh! or ooh! or ooooh! + (1 or more) /o+h!/  oh! or ooh! or ooooh!  related to the immediately preceding character or regular expression *+*+ Stephen Cole Kleene  Wild card. /beg. n/  begin or began or begun any character between beg and n (except a carriage return) Slide from Dorr/Monz

8 LING 6932 Spring 2007 Regular Expressions Anchors ^ and $ start of line /^[A-Z]/  “Ramallah, Palestine” /^[^A-Z]/  “¿verdad?” “really?” end of line /\.$/  “It is over.” /.$/  ? Boundaries \b and \B /\bon\b/  “on my way” “Monday” (boundary) /\Bon\b/  “automaton” (non-boundary) Slide from Dorr/Monz

9 LING 6932 Spring 2007 Disjunction, Grouping, Precedence Disjunction | /yours|mine/  “it is either yours or mine” /gupp(y|ies)/  “guppy” or “guppies” Column 1 Column 2 Column 3 … How do we express this? /Column[0-9]  */  ‘space’ /(Column[0-9]  *)*/NOT a RE character matches the word Column, followed by one number, followed by zero or more spaces, the whole pattern repeated any number of times (zero or more times) Slide from Dorr/Monz

10 LING 6932 Spring 2007 Disjunction, Grouping, Precedence Operator Precedence Hierarchy Parenthesis () Counters * + ? Sequences and anchors the ^my end$ Disjunction | REs are greedy! They always match the largest string they can Slide from Dorr/Monz

11 LING 6932 Spring 2007 Example Find me all instances of the word “the” in a text. /the/ Misses capitalized examples /[tT]he/ Returns “other” or “theology” /\b[tT]he\b/ matches “the” or “The” /[^a-zA-Z][tT]he[^a-zA-Z]/ /(^|[^a-zA-Z])[tT]he[^a-zA-Z]/ Matches “the_” or “the25”

12 LING 6932 Spring 2007 Errors The process we just went through was based on two fixing kinds of errors Not matching things that we should have matched (The) –False negatives Matching strings that we should not have matched (there, then, other) –False positives

13 LING 6932 Spring 2007 Errors cont. We’ll be telling the same story for many tasks Reducing the error rate for an application often involves two antagonistic efforts: Increasing accuracy (minimizing false positives) Increasing coverage (minimizing false negatives).

14 LING 6932 Spring 2007 More complex RE example Regular expressions for prices /$[0-9]+/ Doesn’t deal with fractions of dollars /$[0-9]+\.[0-9][0-9]/ Doesn’t allow $199, not at a word boundary /\b$[0-9]+(\.[0-9]0-9])?\b)/

15 LING 6932 Spring 2007 Advanced operators Regular expression operators for counting REMatch {n} exactly n occurrences of the previous character or expression {n,m} from n to m occurrences of the previous character or expression {n, } at least n occurrences of the previous character or expression /a\.{24}z/ a followed by 24 dots followed by z

16 LING 6932 Spring 2007 Advanced operators To refer to characters that are special themselves precede them with a backslash RE MatchExample Strings Matched \* an asterisk“*”“K*A*P*L*A*N” \. a period “.”“Dr.Livingston, I presume.” \? A question mark “?” “Would you light my candle?” \n a newline \t tab

17 LING 6932 Spring 2007 Advanced operators Slide from Dorr/Monz

18 LING 6932 Spring 2007 Substitutions and Memory Substitution operator s/regexp1/regexp2/ (UNIX, Perl) s/colour/color/ s/colour/color/g Substitute as many times as possible! Case insensitive matching s/colour/color/i Slide from Dorr/Monz

19 LING 6932 Spring 2007 Substitutions and Memory Substitutions “the Xer they were, the Xer they will be” constrain the two X’s to be the same string /the (.*)er they were, the $1er they will be/ /the (.*)er they (.*), the $1er they $2/ Using numbered memories or registers: $1, $2, etc. used to refer back to matches An extended feature of regular expressions Slide from Dorr/Monz

20 LING 6932 Spring 2007 Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re always bugging us about something or other ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? User: Well, my boyfriend made me come here ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I’m depressed much of the time ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED

21 LING 6932 Spring 2007 Eliza-style regular expressions s/.* YOU ARE (depressed|sad).*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad).*/WHY DO YOU THINK YOU ARE \1/ s/.* all.*/IN WHAT WAY/ s/.* always.*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ Step 1: replace first person with second person references s/\bI(’m | am)\b /YOU ARE/g s/\bmy\b /YOUR/g S/\bmine\b /YOURS/g Step 2: use substitutions that look for relevant patterns in the input and create an appropriate output (reply) Step 3: use scores to rank possible transformations Slide from Dorr/Monz

22 LING 6932 Spring 2007 Summary on REs so far Regular expressions are perhaps the single most useful tool for text manipulation Dumb but ubiquitous Eliza: you can do a lot with simple regular-expression substitutions

23 LING 6932 Spring 2007 Three Views Three equivalent formal ways to look at what we’re up to (thanks to Martin Kay) Regular Expressions Regular Languages Finite State Automata

24 LING 6932 Spring 2007 Finite State Automata Terminology: Finite State Automata, Finite State Machines, FSA, Finite Automata Regular expressions are one way of specifying the structure of finite-state automata. FSAs and their close relatives are at the core of most algorithms for speech and language processing.

25 LING 6932 Spring 2007 Finite-state Automata (Machines) /^baa+!$/ q0q0 q1q1 q2q2 q3q3 q4q4 baa! a state transition final state baa! baaa! baaaa! baaaaa!... Slide from Dorr/Monz

26 LING 6932 Spring 2007 Sheep FSA We can say the following things about this machine It has 5 states At least b, a, and ! are in its alphabet q0 is the start state q4 is the final (= accept) state It has 5 transitions

27 LING 6932 Spring 2007 More Formally: Defining an FSA You can specify an FSA by enumerating the following things. a finite set of states: Q a finite alphabet of symbols:  the start state: q 0 The set of accepting/final states: F such that FQ A transition function (q,i) that maps Qx  to Q Given a state qQ and an input symbol i , (q,i) returns a new state q’Q.

28 LING 6932 Spring 2007 Yet Another View State-transition table

29 LING 6932 Spring 2007 Recognition Recognition is the process of determining if a string should be accepted by a machine Or… it’s the process of determining if a string is in the language we’re defining with the machine Or… it’s the process of determining if a regular expression matches a string

30 LING 6932 Spring 2007 Recognition Traditionally, (Turing’s idea, 1936) this process is depicted with a tape.

31 LING 6932 Spring 2007 Recognition - Execution Start in the start state Examine the current input in the active cell Consult the table: a finite table of instructions (a state transition diagram) that specifies exactly what action the machine takes at each step Go to a new state and update the tape pointer. Until you run out of tape.

32 LING 6932 Spring 2007 Input Tape baaa q0q0 q1q1 q2q2 q3q3 q3q3 q4q4 ! baa ! a ACCEPT Slide from Dorr/Monz

33 LING 6932 Spring 2007 Input Tape aba!b q0q baa ! a REJECT Slide from Dorr/Monz

34 LING 6932 Spring 2007 Adding a failing state q0q0 q1q1 q2q2 q3q3 q4q4 baa! a qFqF a ! b ! b! b b a ! Slide from Dorr/Monz

35 LING 6932 Spring 2007 Tracing D-Recognize

36 LING 6932 Spring 2007 Key Points Deterministic means that at each point in processing there is always one unique thing to do (no choices). D-recognize is a simple table-driven interpreter The algorithm is universal for all unambiguous languages. To change the machine, you change the table.

37 LING 6932 Spring 2007 Key Points Deterministic Pattern Example: Consider a set of traffic lights; the sequence of lights is red - red/amber - green - amber - red. The sequence can be pictured as a state machine, where the different states of the traffic lights follow each other. Each state is dependent solely on the previous state, so if the lights are green, an amber light will always follow - that is, the system is deterministic. Deterministic systems are relatively easy to understand and analyse, once the transitions are fully known.

38 LING 6932 Spring 2007 Key Points Crudely therefore… matching strings with regular expressions (a la Perl) is a matter of translating the expression into a machine (table) and passing the table to an interpreter

39 LING 6932 Spring 2007 Recognition as Search You can view this algorithm as state-space search. States are pairings of tape positions and state numbers. Operators are compiled into the table Goal state is a pairing with the end of tape position and a final accept state

40 LING 6932 Spring 2007 Generative Formalisms A formal Language is a model m which can both generate and recognize all and only the strings of a formal language; each string is composed of symbols from a finite set of symbols (alphabet) L(m) ‘a formal language L characterized by the model m’ Finite-state automata define formal languages (without having to enumerate all the strings in the language) The term Generative is based on the view that you can run the machine as a generator to get strings from the language.

41 LING 6932 Spring 2007 Generative Formalisms FSAs can be viewed from two perspectives: Acceptors that can tell you if a string is in the language (recognition) Generators to produce all and only the strings in the language (production)

42 LING 6932 Spring 2007 Summary Regular expressions are just a compact textual representation of FSAs Recognition is the process of determining if a string/input is in the language defined by some machine. Recognition is straightforward with deterministic machines.