Natural Language Processing Lecture 2—1/15/2015 Susan W. Brown.

Slides:



Advertisements
Similar presentations
4b Lexical analysis Finite Automata
Advertisements

Natural Language Processing Lecture 3—9/3/2013 Jim Martin.
Nondeterministic Finite Automata CS 130: Theory of Computation HMU textbook, Chapter 2 (Sec 2.3 & 2.5)
CSE 105 Theory of Computation Alexander Tsiatas Spring 2012 Theory of Computation Lecture Slides by Alexander Tsiatas is licensed under a Creative Commons.
Regular Expressions and DFAs COP 3402 (Summer 2014)
Finite Automata CPSC 388 Ellen Walker Hiram College.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Formal Language, chapter 9, slide 1Copyright © 2007 by Adam Webber Chapter Nine: Advanced Topics in Regular Languages.
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
1 Regular Expressions and Automata September Lecture #2-2.
Introduction to Computability Theory
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
1 Introduction to Computability Theory Lecture2: Non Deterministic Finite Automata Prof. Amos Israeli.
Finite Automata and Non Determinism
CS5371 Theory of Computation
61 Nondeterminism and Nodeterministic Automata. 62 The computational machine models that we learned in the class are deterministic in the sense that the.
Cohen, Chapter 61 Introduction to Computational Theory Chapter 6.
Lecture 3 Goals: Formal definition of NFA, acceptance of a string by an NFA, computation tree associated with a string. Algorithm to convert an NFA to.
Automata & Formal Languages, Feodor F. Dragan, Kent State University 1 CHAPTER 1 Regular Languages Contents Finite Automata (FA or DFA) definitions, examples,
Lecture 3 Goals: Formal definition of NFA, acceptance of a string by an NFA, computation tree associated with a string. Algorithm to convert an NFA to.
CS Chapter 2. LanguageMachineGrammar RegularFinite AutomatonRegular Expression, Regular Grammar Context-FreePushdown AutomatonContext-Free Grammar.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Foundations of (Theoretical) Computer Science Chapter 1 Lecture Notes (Section 1.2: NFA’s) David Martin With some modifications.
FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
CMPS 3223 Theory of Computation Automata, Computability, & Complexity by Elaine Rich ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Slides provided.
Nondeterministic Finite Automata CS 130: Theory of Computation HMU textbook, Chapter 2 (Sec 2.3 & 2.5)
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
REGULAR LANGUAGES.
Theory of Languages and Automata
March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009.
Lexical Analysis Constructing a Scanner from Regular Expressions.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
Computabilty Computability Finite State Machine. Regular Languages. Homework: Finish Craps. Next Week: On your own: videos +
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
CMSC 330: Organization of Programming Languages Finite Automata NFAs  DFAs.
1 Regular Expressions and Automata August Lecture #2.
Finite State Machines Concepts DFA NFA.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
INHERENT LIMITATIONS OF COMPUTER PROGAMS CSci 4011.
Finite State Machines 1.Finite state machines with output 2.Finite state machines with no output 3.DFA 4.NDFA.
Donghyun (David) Kim Department of Mathematics and Physics North Carolina Central University 1 Chapter 1 Regular Languages Some slides are in courtesy.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
CS 154 Formal Languages and Computability February 9 Class Meeting Department of Computer Science San Jose State University Spring 2016 Instructor: Ron.
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
CSCI 2670 Introduction to Theory of Computing September 7, 2004.
CSCI 2670 Introduction to Theory of Computing September 11, 2007.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Theory of Computation Automata Theory Dr. Ayman Srour.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Lexical analysis Finite Automata
Chapter 2 FINITE AUTOMATA.
CSCI 5832 Natural Language Processing
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
CSCI 5832 Natural Language Processing
CSC NLP - Regex, Finite State Automata
CSCI 5832 Natural Language Processing
Chapter Nine: Advanced Topics in Regular Languages
4b Lexical analysis Finite Automata
4b Lexical analysis Finite Automata
CPSC 503 Computational Linguistics
Chapter 1 Regular Language
Lecture 5 Scanning.
Presentation transcript:

Natural Language Processing Lecture 2—1/15/2015 Susan W. Brown

6/4/2016 Speech and Language Processing - Jurafsky and Martin 2 Today Regular expressions Finite-state methods

6/4/2016 Speech and Language Processing - Jurafsky and Martin 3 Regular Expressions and Text Searching Regular expressions are a compact textual representation of a set of strings that constitute a language  In the simplest case, regular expressions describe regular languages  Here, a language means a set of strings given some alphabet. Extremely versatile and widely used technology  Emacs, vi, perl, grep, etc.

6/4/2016 Speech and Language Processing - Jurafsky and Martin 4 Example Find all the instances of the word “the” in a text.  /the/  /[tT]he/  /\b[tT]he\b/

6/4/2016 Speech and Language Processing - Jurafsky and Martin 5 Errors The process we just went through was based on fixing two kinds of errors  Matching strings that we should not have matched (there, then, other)  False positives (Type I)  Not matching things that we should have matched (The)  False negatives (Type II)

6/4/2016 Speech and Language Processing - Jurafsky and Martin 6 Errors We’ll be telling the same story with respect to evaluation for many tasks. Reducing the error rate for an application often involves two antagonistic efforts:  Increasing accuracy, or precision, (minimizing false positives)  Increasing coverage, or recall, (minimizing false negatives).

6/4/2016 Speech and Language Processing - Jurafsky and Martin 7 3 Formalisms Recall that I said that regular expressions describe languages (sets of strings) Turns out that there are 3 formalisms for capturing such languages, each with their own motivation and history  Regular expressions  Compact textual strings Perfect for specifying patterns in programs or command-lines  Finite state automata  Graphs  Regular grammars  Rules

3 Formalisms These three approaches are all equivalent in terms of their ability to capture regular languages. But, as we’ll see, they do inspire different algorithms and frameworks 6/4/2016 Speech and Language Processing - Jurafsky and Martin 8

6/4/2016 Speech and Language Processing - Jurafsky and Martin 9 FSAs as Graphs Let’s start with the sheep language from Chapter 2  /baa+!/

6/4/2016 Speech and Language Processing - Jurafsky and Martin 10 Sheep FSA We can say the following things about this machine  It has 5 states  b, a, and ! are in its alphabet  q 0 is the start state  q 4 is an accept state  It has 5 transitions

6/4/2016 Speech and Language Processing - Jurafsky and Martin 11 But Note There are other machines that correspond to this same language More on this one later

6/4/2016 Speech and Language Processing - Jurafsky and Martin 12 More Formally You can specify an FSA by enumerating the following things.  The set of states: Q  A finite alphabet: Σ  A start state  A set of accept states  A transition function that maps Qx Σ to Q

6/4/2016 Speech and Language Processing - Jurafsky and Martin 13 About Alphabets Don’t take term alphabet too narrowly; it just means we need a finite set of symbols in the input. These symbols can and will stand for bigger objects that may in turn have internal structure  Such as another FSA

6/4/2016 Speech and Language Processing - Jurafsky and Martin 14 Dollars and Cents

6/4/2016 Speech and Language Processing - Jurafsky and Martin 15 Yet Another View The guts of an FSA can ultimately be represented as a table ba! If you’re in state 1 and you’re looking at an a, go to state 2

6/4/2016 Speech and Language Processing - Jurafsky and Martin 16 Recognition Recognition is the process of determining if a string should be accepted by a machine Or… it’s the process of determining if a string is in the language we’re defining with the machine Or… it’s the process of determining if a regular expression matches a string Those all amount to the same thing in the end

6/4/2016 Speech and Language Processing - Jurafsky and Martin 17 Recognition Traditionally, (Turing’s notion) this process is depicted with an input string written on a tape.

6/4/2016 Speech and Language Processing - Jurafsky and Martin 18 Recognition Simply a process of starting in the start state Examining the current input Consulting the table Going to a new state and updating the tape pointer. Until you run out of tape.

6/4/2016 Speech and Language Processing - Jurafsky and Martin 19 D-Recognize

6/4/2016 Speech and Language Processing - Jurafsky and Martin 20 Key Points Deterministic means that at each point in processing there is always one unique thing to do (no choices; no ambiguity). D-recognize is a simple table-driven interpreter The algorithm is universal for all unambiguous regular languages.  To change the machine, you simply change the table.

6/4/2016 Speech and Language Processing - Jurafsky and Martin 21 Key Points Crudely therefore… matching strings with regular expressions (ala Perl, grep, etc.) is a matter of  translating the regular expression into a machine (a table) and  passing the table and the string to an interpreter that implements D-recognize (or something like it)

6/4/2016 Speech and Language Processing - Jurafsky and Martin 22 Recognition as Search You can view this algorithm as a trivial kind of state-space search. States are pairings of tape positions and state numbers. Operators are compiled into the table Goal state is a pairing with the end of tape position and a final accept state It is trivial because?

6/4/2016 Speech and Language Processing - Jurafsky and Martin 23 Non-Determinism

6/4/2016 Speech and Language Processing - Jurafsky and Martin 24 Table View Allow multiple entries in the table to capture non-determinism ba! ,3 34 4

6/4/2016 Speech and Language Processing - Jurafsky and Martin 25 Non-Determinism cont. Yet another technique  Epsilon transitions  Key point: these transitions do not examine or advance the tape during recognition

6/4/2016 Speech and Language Processing - Jurafsky and Martin 26 Equivalence Non-deterministic machines can be converted to deterministic ones with a fairly simple construction That means that they have the same power; non-deterministic machines are not more powerful than deterministic ones in terms of the languages they can and can’t characterize

6/4/2016 Speech and Language Processing - Jurafsky and Martin 27 ND Recognition Two basic approaches (used in all major implementations of regular expressions, see Friedl 2006) 1.Either take a ND machine and convert it to a D machine and then do recognition with that. 2.Or explicitly manage the process of recognition as a state-space search (leaving the machine/table as is).

6/4/2016 Speech and Language Processing - Jurafsky and Martin 28 Non-Deterministic Recognition: Search In a ND FSA there exists at least one path through the machine for a string that is in the language defined by the machine. But not all paths directed through the machine for an accept string lead to an accept state. No paths through the machine lead to an accept state for a string not in the language.

6/4/2016 Speech and Language Processing - Jurafsky and Martin 29 Non-Deterministic Recognition So success in non-deterministic recognition occurs when a path is found through the machine that ends in an accept. Failure occurs when all of the possible paths for a given string lead to failure.

6/4/2016 Speech and Language Processing - Jurafsky and Martin 30 Example ba a a !\ q0q0 q1q1 q2q2 q2q2 q3q3 q4q4

6/4/2016 Speech and Language Processing - Jurafsky and Martin 31 Example

6/4/2016 Speech and Language Processing - Jurafsky and Martin 32 Example

6/4/2016 Speech and Language Processing - Jurafsky and Martin 33 Example

6/4/2016 Speech and Language Processing - Jurafsky and Martin 34 Example

6/4/2016 Speech and Language Processing - Jurafsky and Martin 35 Example

6/4/2016 Speech and Language Processing - Jurafsky and Martin 36 Example

6/4/2016 Speech and Language Processing - Jurafsky and Martin 37 Example

6/4/2016 Speech and Language Processing - Jurafsky and Martin 38 Example

6/4/2016 Speech and Language Processing - Jurafsky and Martin 39 Key Points States in the search space are pairings of tape positions and states in the machine. By keeping track of as yet unexplored states, a recognizer can systematically explore all the paths through the machine given an input.

6/4/2016 Speech and Language Processing - Jurafsky and Martin 40 Why Bother? Non-determinism doesn’t get us more formal power and it causes headaches so why bother?  More natural (understandable) solutions  Not always obvious to users whether the regex that they’ve produced is deterministic or not  Better to not make them worry about it

Admin Questions? 6/4/2016 Speech and Language Processing - Jurafsky and Martin 41

Converting NFAs to DFAs The Subset Construction is the means by which we can convert an NFA to a DFA automatically. The intuition is to think about being in multiple states at the same time. Let’s go back to our earlier example where we’re in state q 2 looking at an “a” 6/4/2016 Speech and Language Processing - Jurafsky and Martin 42

Subset Construction So the trick is to simulate going to both q2 and q3 at the same time One way to do this is to imagine a new state of a new machine that represents the state of being in states q2 and q3 at the same time  Let’s call that new state {q2,q3}  That’s just the name of a new state but it helps us remember where it came from  That’s a subset of the original set of states The construction does this for all possible subsets of the original states (the powerset).  And then we fill in the transition table for that set 6/4/2016 Speech and Language Processing - Jurafsky and Martin 43

Subset Construction Given an NFA with the usual parts: Q, Σ, transition function δ, start state q 0, and designated accept states We’ll construct a new DFA that accepts the same language where  States of the new machine are the powerset of states Q: call it Q D  Set of all subsets of Q  Start state is {q 0 }  Alphabet is the same: Σ  Accept states are the states in Q D that contain any accept state from Q 6/4/2016 Speech and Language Processing - Jurafsky and Martin 44

Subset Construction What about the transition function?  For every new state we’ll create a transition on a symbol α from the alphabet to a new state as follows  δ D ({q 1,…,q k }, α) = is the union over all i = 1,…,k of δ N (q i, α) for all α in the alphabet 6/4/2016 Speech and Language Processing - Jurafsky and Martin 45

Baaa! How does that work out for our example?  Alphabet is still “a”, “b” and “!”  Start state is {q0}  Rest of the states are: {q1}, {q2},... {q4}, {q1,q2}, {q1,q3}... {q0,q1,q2,q3,q4,q5}  All subsets of states in Q What’s the transition table going to look like? 6/4/2016 Speech and Language Processing - Jurafsky and Martin 46

Lazy Method 6/4/2016 Speech and Language Processing - Jurafsky and Martin 47 ba! q0q1 q2 q2,q3 q3q4 ba! {q0}

Baaa! 6/4/2016 Speech and Language Processing - Jurafsky and Martin 48 ba! q0q1 q2 q2,q3 q3q4 ba! {q0}{q1}

Baaa! 6/4/2016 Speech and Language Processing - Jurafsky and Martin 49 ba! q0q1 q2 q2,q3 q3q4 ba! {q0}{q1}

Baaa! 6/4/2016 Speech and Language Processing - Jurafsky and Martin 50 ba! q0q1 q2 q2,q3 q3q4 ba! {q0}{q1} {q2}

Baaa! 6/4/2016 Speech and Language Processing - Jurafsky and Martin 51 ba! q0q1 q2 q2,q3 q3q4 ba! {q0}{q1} {q2} {q2,q3}

Baaa! 6/4/2016 Speech and Language Processing - Jurafsky and Martin 52 ba! q0q1 q2 q2,q3 q3q4 ba! {q0}{q1} {q2} {q2,q3}

Baaa! 6/4/2016 Speech and Language Processing - Jurafsky and Martin 53 ba! q0q1 q2 q2,q3 q3q4 ba! {q0}{q1} {q2} {q2,q3} {q4}

Couple of Notes We didn’t come close to needing 2 Q new states. Most of those were unreachable. So in theory there is the potential for an explosion in the number of states. In practice, it may be more manageable. Draw the new deterministic machine from the table on the previous slide... It should look familiar. 6/4/2016 Speech and Language Processing - Jurafsky and Martin 54

6/4/2016 Speech and Language Processing - Jurafsky and Martin 55 Compositional Machines Recall that formal languages are just sets of strings Therefore, we can talk about set operations (intersection, union, concatenation, negation) on languages This turns out to be a very useful  It allows us to decompose problems into smaller problems, solve those problems with specific languages, and then compose those solutions to solve the big problems.

Example Create a regex to match all the ways that people write down phone numbers. For just the U.S. that needs to cover  (303)     (01) You could write a big hairy regex to capture all that, or you could write individual regex’s for each type and then OR them together into a new regex/machine. How does that work? 6/4/2016 Speech and Language Processing - Jurafsky and Martin 56

6/4/2016 Speech and Language Processing - Jurafsky and Martin 57 Union (Or)

6/4/2016 Speech and Language Processing - Jurafsky and Martin 58 Negation Construct a machine M2 to accept all strings not accepted by machine M1 and reject all the strings accepted by M1  Invert all the accept and not accept states in M1 Does that work for non-deterministic machines?

6/4/2016 Speech and Language Processing - Jurafsky and Martin 59 Intersection (AND) Accept a string that is in both of two specified languages An indirect construction…  A^B = ~(~A or ~B)

Problem Set 1 From the Jurafsky and Martin book, end of chapter 2 Problem 2.1; subsections 2, 4, and 6 Problem 2.4 Due Thursday, Jan. 22, in class 6/4/2016 Speech and Language Processing - Jurafsky and Martin 60

6/4/2016 Speech and Language Processing - Jurafsky and Martin 61 Next Week On to Chapter 3  Crash course in English morphology  Finite state transducers  Applications  Lexicons  Segmentation