Download presentation
Presentation is loading. Please wait.
1
CSC 9010- NLP - Regex, Finite State Automata
CSC 9010 Natural Language Processing Lecture 2: Regular Expressions, Finite State Automata Paula Matuszek Mary-Angela Papalaskari Presentation slides adapted from Jim Martin’s course: 12/3/2018 CSC NLP - Regex, Finite State Automata
2
Regular Expressions and Text Searching
Everybody does it Emacs, vi, perl, grep, etc.. 12/3/2018 CSC NLP - Regex, Finite State Automata
3
CSC 9010- NLP - Regex, Finite State Automata
Example Find me all instances of the word “the” in a text. /the/ /[tT]he/ /\b[tT]he\b/ 12/3/2018 CSC NLP - Regex, Finite State Automata
4
CSC 9010- NLP - Regex, Finite State Automata
Two kinds of Errors Matching strings that we should not have matched (there, then, other) False positives Not matching things that we should have matched (The) False negatives 12/3/2018 CSC NLP - Regex, Finite State Automata
5
Two Antagonistic Goals
Accuracy (minimize false positives) Coverage (minimize false negatives). 12/3/2018 CSC NLP - Regex, Finite State Automata
6
CSC 9010- NLP - Regex, Finite State Automata
Idealized machines for processing regular expressions Example: /baa+!/ 12/3/2018 CSC NLP - Regex, Finite State Automata
7
CSC 9010- NLP - Regex, Finite State Automata
Idealized machines for processing regular expressions Example: /baa+!/ 5 states 5 transitions alphabet? initial state accept state 12/3/2018 CSC NLP - Regex, Finite State Automata
8
CSC 9010- NLP - Regex, Finite State Automata
More examples: 12/3/2018 CSC NLP - Regex, Finite State Automata
9
Another FSA for the same language:
12/3/2018 CSC NLP - Regex, Finite State Automata
10
Formally Specifying a FSA
The set of states: Q A finite alphabet: Σ A start state A set of accept/final states A transition function that maps QxΣ to Q discuss alphabets = not too narrow! do example: STATE TRANSITION TABLE input State b a ! 12/3/2018 CSC NLP - Regex, Finite State Automata
11
CSC 9010- NLP - Regex, Finite State Automata
Dollars and Cents 12/3/2018 CSC NLP - Regex, Finite State Automata
12
CSC 9010- NLP - Regex, Finite State Automata
Recognition Recognition is the process of determining if a string should be accepted by a machine Or… it’s the process of determining if as string is in the language we’re defining with the machine Or… it’s the process of determining if a regular expression matches a string 12/3/2018 CSC NLP - Regex, Finite State Automata
13
Turing’s way of Visualizing Recognition
12/3/2018 CSC NLP - Regex, Finite State Automata
14
CSC 9010- NLP - Regex, Finite State Automata
Recognition Begin in the start state Examine current input Consult the table Go to a new state and update the tape pointer. When you run out of tape: if in accepting state, accept input else reject input 12/3/2018 CSC NLP - Regex, Finite State Automata
15
CSC 9010- NLP - Regex, Finite State Automata
D-Recognize 12/3/2018 CSC NLP - Regex, Finite State Automata
16
CSC 9010- NLP - Regex, Finite State Automata
Key Points Deterministic means that at each point in processing there is always one unique thing to do (no choices). D-recognize is a simple table-driven interpreter The algorithm is universal for all unambiguous languages. To change the machine, you change the table. 12/3/2018 CSC NLP - Regex, Finite State Automata
17
CSC 9010- NLP - Regex, Finite State Automata
Key Points Crudely therefore… matching strings with regular expressions is a matter of translating the expression into a machine (table) and passing the table to an interpreter 12/3/2018 CSC NLP - Regex, Finite State Automata
18
CSC 9010- NLP - Regex, Finite State Automata
Recognition as Search You can view this algorithm as a degenerate kind of state-space search. States are pairings of tape positions and state numbers. Operators are compiled into the table Goal state is a pairing with the end of tape position and a final accept state Its degenerate because? 12/3/2018 CSC NLP - Regex, Finite State Automata
19
Generative Formalisms
Formal Languages are sets of strings composed of symbols from a finite set of symbols. Finite-state automata define formal languages (without having to enumerate all the strings in the language) The term Generative is based on the view that you can run the machine as a generator to get strings from the language. 12/3/2018 CSC NLP - Regex, Finite State Automata
20
Generative Formalisms
FSAs can be viewed from two perspectives: Acceptors that can tell you if a string is in the language Generators to produce all and only the strings in the language 12/3/2018 CSC NLP - Regex, Finite State Automata
21
CSC 9010- NLP - Regex, Finite State Automata
Review Regular expressions are just a compact textual representation of FSAs Recognition is the process of determining if a string/input is in the language defined by some machine. Recognition is straightforward with deterministic machines. 12/3/2018 CSC NLP - Regex, Finite State Automata
22
CSC 9010- NLP - Regex, Finite State Automata
Three Views Three equivalent formal ways to look at what we’re up to (not including tables) Regular Expressions Mention machine (Turing) production systems (Post) Regular sets (Kleene) Finite State Automata Regular Languages 12/3/2018 CSC NLP - Regex, Finite State Automata
23
Defining Languages with Productions
S → b a a A A → a A A → ! S → NP VP NP → PrNoun NP → Det Noun Det → a | the Noun → cat | dog| book PrNoun → samantha |elmer | fido VP → IVerb | TVerb NP IVerb → ran |slept | ate TVerb → hit | kissed | ate Regular language Regular? 12/3/2018 CSC NLP - Regex, Finite State Automata
24
CSC 9010- NLP - Regex, Finite State Automata
Non-Determinism Compare: 12/3/2018 CSC NLP - Regex, Finite State Automata
25
CSC 9010- NLP - Regex, Finite State Automata
Non-Determinism cont. Epsilon transitions: Note: these transitions do not examine or advance the tape during recognition ε 12/3/2018 CSC NLP - Regex, Finite State Automata
26
Are Non-deterministic FSA more powerful?
Non-deterministic machines can be converted to deterministic ones with a fairly simple construction One way to do recognition with a non-deterministic machine is to turn it into a deterministic one. 12/3/2018 CSC NLP - Regex, Finite State Automata
27
Non-Deterministic Recognition
In a ND FSA there exists at least one path through the machine for a string that is in the language defined by the machine. But not all paths directed through the machine for an accept string lead to an accept state. No paths through the machine lead to an accept state for a string not in the language. 12/3/2018 CSC NLP - Regex, Finite State Automata
28
Non-Deterministic Recognition
So success in a non-deterministic recognition occurs when a path is found through the machine that ends in an accept. Failure occurs when none of the possible paths lead to an accept state. 12/3/2018 CSC NLP - Regex, Finite State Automata
29
CSC 9010- NLP - Regex, Finite State Automata
Example b a a a ! \ q0 q1 q2 q2 q3 q4 12/3/2018 CSC NLP - Regex, Finite State Automata
30
CSC 9010- NLP - Regex, Finite State Automata
Example 12/3/2018 CSC NLP - Regex, Finite State Automata
31
CSC 9010- NLP - Regex, Finite State Automata
Example 12/3/2018 CSC NLP - Regex, Finite State Automata
32
CSC 9010- NLP - Regex, Finite State Automata
Example 12/3/2018 CSC NLP - Regex, Finite State Automata
33
CSC 9010- NLP - Regex, Finite State Automata
Example 12/3/2018 CSC NLP - Regex, Finite State Automata
34
CSC 9010- NLP - Regex, Finite State Automata
Example 12/3/2018 CSC NLP - Regex, Finite State Automata
35
CSC 9010- NLP - Regex, Finite State Automata
Example 12/3/2018 CSC NLP - Regex, Finite State Automata
36
CSC 9010- NLP - Regex, Finite State Automata
Example 12/3/2018 CSC NLP - Regex, Finite State Automata
37
CSC 9010- NLP - Regex, Finite State Automata
Example 12/3/2018 CSC NLP - Regex, Finite State Automata
38
CSC 9010- NLP - Regex, Finite State Automata
Key Points States in the search space are pairings of tape positions and states in the machine. By keeping track of as yet unexplored states, a recognizer can systematically explore all the paths through the machine given an input. 12/3/2018 CSC NLP - Regex, Finite State Automata
39
CSC 9010- NLP - Regex, Finite State Automata
ND-Recognize Code 12/3/2018 CSC NLP - Regex, Finite State Automata
40
CSC 9010- NLP - Regex, Finite State Automata
Infinite Search If you’re not careful such searches can go into an infinite loop. How? 12/3/2018 CSC NLP - Regex, Finite State Automata
41
CSC 9010- NLP - Regex, Finite State Automata
Why Bother? Non-determinism doesn’t get us more formal power and it causes headaches so why bother? More natural solutions Machines based on construction are too big 12/3/2018 CSC NLP - Regex, Finite State Automata
42
Compositional Machines
Formal languages are just sets of strings Therefore, we can talk about various set operations (intersection, union, concatenation) This turns out to be a useful exercise 12/3/2018 CSC NLP - Regex, Finite State Automata
43
CSC 9010- NLP - Regex, Finite State Automata
Union Accept a string in either of two languages 12/3/2018 CSC NLP - Regex, Finite State Automata
44
CSC 9010- NLP - Regex, Finite State Automata
Concatenation Accept a string consisting of a string from language L1 followed by a string from language L2. 12/3/2018 CSC NLP - Regex, Finite State Automata
45
CSC 9010- NLP - Regex, Finite State Automata
Negation Construct a machine M2 to accept all strings not accepted by machine M1 and reject all the strings accepted by M1 Invert all the accept and not accept states in M1 Does that work for non-deterministic machines? 12/3/2018 CSC NLP - Regex, Finite State Automata
46
CSC 9010- NLP - Regex, Finite State Automata
Intersection Accept a string that is in both of two specified languages An indirect construction… A^B = ~(~A or ~B) 12/3/2018 CSC NLP - Regex, Finite State Automata
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.