CSC 9010- NLP - Regex, Finite State Automata CSC 9010 Natural Language Processing Lecture 2: Regular Expressions, Finite State Automata Paula Matuszek Mary-Angela Papalaskari Presentation slides adapted from Jim Martin’s course: http://www.cs.colorado.edu/~martin/csci5832.html 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
Regular Expressions and Text Searching Everybody does it Emacs, vi, perl, grep, etc.. 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Example Find me all instances of the word “the” in a text. /the/ /[tT]he/ /\b[tT]he\b/ 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Two kinds of Errors Matching strings that we should not have matched (there, then, other) False positives Not matching things that we should have matched (The) False negatives 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
Two Antagonistic Goals Accuracy (minimize false positives) Coverage (minimize false negatives). 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Idealized machines for processing regular expressions Example: /baa+!/ 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Idealized machines for processing regular expressions Example: /baa+!/ 5 states 5 transitions alphabet? initial state accept state 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata More examples: 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
Another FSA for the same language: 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
Formally Specifying a FSA The set of states: Q A finite alphabet: Σ A start state A set of accept/final states A transition function that maps QxΣ to Q discuss alphabets = not too narrow! do example: STATE TRANSITION TABLE input State b a ! 0 1 . . 1 . 2 . 2 . 3 . 3 . 3 4 4 . . . 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Dollars and Cents 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Recognition Recognition is the process of determining if a string should be accepted by a machine Or… it’s the process of determining if as string is in the language we’re defining with the machine Or… it’s the process of determining if a regular expression matches a string 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
Turing’s way of Visualizing Recognition 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Recognition Begin in the start state Examine current input Consult the table Go to a new state and update the tape pointer. When you run out of tape: if in accepting state, accept input else reject input 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata D-Recognize 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Key Points Deterministic means that at each point in processing there is always one unique thing to do (no choices). D-recognize is a simple table-driven interpreter The algorithm is universal for all unambiguous languages. To change the machine, you change the table. 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Key Points Crudely therefore… matching strings with regular expressions is a matter of translating the expression into a machine (table) and passing the table to an interpreter 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Recognition as Search You can view this algorithm as a degenerate kind of state-space search. States are pairings of tape positions and state numbers. Operators are compiled into the table Goal state is a pairing with the end of tape position and a final accept state Its degenerate because? 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
Generative Formalisms Formal Languages are sets of strings composed of symbols from a finite set of symbols. Finite-state automata define formal languages (without having to enumerate all the strings in the language) The term Generative is based on the view that you can run the machine as a generator to get strings from the language. 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
Generative Formalisms FSAs can be viewed from two perspectives: Acceptors that can tell you if a string is in the language Generators to produce all and only the strings in the language 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Review Regular expressions are just a compact textual representation of FSAs Recognition is the process of determining if a string/input is in the language defined by some machine. Recognition is straightforward with deterministic machines. 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Three Views Three equivalent formal ways to look at what we’re up to (not including tables) Regular Expressions Mention machine (Turing) production systems (Post) Regular sets (Kleene) Finite State Automata Regular Languages 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
Defining Languages with Productions S → b a a A A → a A A → ! S → NP VP NP → PrNoun NP → Det Noun Det → a | the Noun → cat | dog| book PrNoun → samantha |elmer | fido VP → IVerb | TVerb NP IVerb → ran |slept | ate TVerb → hit | kissed | ate Regular language Regular? 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Non-Determinism Compare: 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Non-Determinism cont. Epsilon transitions: Note: these transitions do not examine or advance the tape during recognition ε 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
Are Non-deterministic FSA more powerful? Non-deterministic machines can be converted to deterministic ones with a fairly simple construction One way to do recognition with a non-deterministic machine is to turn it into a deterministic one. 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
Non-Deterministic Recognition In a ND FSA there exists at least one path through the machine for a string that is in the language defined by the machine. But not all paths directed through the machine for an accept string lead to an accept state. No paths through the machine lead to an accept state for a string not in the language. 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
Non-Deterministic Recognition So success in a non-deterministic recognition occurs when a path is found through the machine that ends in an accept. Failure occurs when none of the possible paths lead to an accept state. 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Example b a a a ! \ q0 q1 q2 q2 q3 q4 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Example 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Example 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Example 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Example 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Example 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Example 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Example 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Example 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Key Points States in the search space are pairings of tape positions and states in the machine. By keeping track of as yet unexplored states, a recognizer can systematically explore all the paths through the machine given an input. 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata ND-Recognize Code 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Infinite Search If you’re not careful such searches can go into an infinite loop. How? 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Why Bother? Non-determinism doesn’t get us more formal power and it causes headaches so why bother? More natural solutions Machines based on construction are too big 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
Compositional Machines Formal languages are just sets of strings Therefore, we can talk about various set operations (intersection, union, concatenation) This turns out to be a useful exercise 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Union Accept a string in either of two languages 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Concatenation Accept a string consisting of a string from language L1 followed by a string from language L2. 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Negation Construct a machine M2 to accept all strings not accepted by machine M1 and reject all the strings accepted by M1 Invert all the accept and not accept states in M1 Does that work for non-deterministic machines? 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010- NLP - Regex, Finite State Automata Intersection Accept a string that is in both of two specified languages An indirect construction… A^B = ~(~A or ~B) 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata