CPSC 503 Computational Linguistics RegExps and Finite State Automata Lecture 2 Giuseppe Carenini 2/28/2019 CPSC503 Spring 2004
Survey Results By Student By topic 2/28/2019 CPSC503 Spring 2004
Knowledge-Formalisms Map (including probabilistic formalisms) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Syntax Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) Semantics My Conceptual map This is the master plan I have added probabilistic models We will go back to this throughout the course Pragmatics Discourse and Dialogue Logical formalisms (First-Order Logics) AI planners 2/28/2019 CPSC503 Spring 2004
Next Two Lectures State Machines (no prob.) Finite State Automata (and Regular Expressions) Finite State Transducers (English) Morphology Logical formalisms (First-Order Logics) Rule systems (and prob. version) (e.g., (Prob.) Context-Free Grammars) Syntax Pragmatics Discourse and Dialogue Semantics AI planners The next two lectures will learn about Finite state automata (and Regular Expressions) Finite state transducers English morphology 2/28/2019 CPSC503 Spring 2004
Today 1/16 Regular Expressions Errors Finite-state automata Generation Recognition Non-determinism Regular expressions can be viewed as a textual way of specifying the structure of finite-state automata. Finite-state automata can be viewed as implementations of regular expressions 2/28/2019 CPSC503 Spring 2004
Regular Expressions Def. Notation to specify a set of strings Simplest case: /CPSC503/ [] disjunction of characters, [^] negation /CPSC50[34]/, /CPSC50[0-9]/,/CPSC50[^34]/ . Any character (to match a period \.) | “OR” /([Ff]rom|[Ss]ubject|[Dd]ate)/ Searching and acting on what you find String: sequence of symbols (characters] / (Perl notation) Case sensitive Disjunction character classes (matches a character in the defined class) not the following (set of) character/s Question: if I wanted to find all English words that have a q followed by something other that a u Word any sequence of digits underscores and letters Everybody does it Emacs, vi, perl, grep, etc.. Anchors: ^ (start of of line), $ (end of line), \b (word boundary) /^([Ff]rom\b|[Ss]ubject\b|[Dd]ate\b)/ 2/28/2019 CPSC503 Spring 2004
Regular Expressions (cont.) ( ) Grouping: /happy|ier/ vs. /happ(y|ier)/ Operators applied to preceding item (character or exp.) ? Optional /colou?r/,/July? (fourth|4(th)?)/ Repetitions + one or more * any number including none {num} num times Real power comes from Optional and Counting elements Optional: preceding expressions is allowed to appear but it is not required /[0-9]+(\.[0-9]+){3}/ 2/28/2019 CPSC503 Spring 2004
Example of Usage: Text Searching Find me all instances of the determiner “the” in an English text. To count them To substitute them with something else You try: /the/ The other cop went to the bank but there were no people there. /[tT]he/ /\bthe\b/ /\b[tT]he\b/ 2/28/2019 CPSC503 Spring 2004
Errors The process we just went through was based on fixing two kinds of errors Matching strings that we should not have matched (there, other) False positives Not matching things that we should have matched (The) False negatives 2/28/2019 CPSC503 Spring 2004
Errors cont. Reducing the error rate for an application often involves two antagonistic efforts: Increasing accuracy (minimizing false positives) Increasing coverage (minimizing false negatives). We’ll be telling the same story for may tasks, all semester. 2/28/2019 CPSC503 Spring 2004
(generate and recognize) Finite State Automata implement (generate and recognize) Regular Expressions FSA describe Many Linguistic Phenomena FSAs and their close relatives are at the core of what we’ll be doing all semester. Reg Exp notation to specify a set of strings Besides implementing resular expression FSA have a wide variety of uses…. Regular expressions can be viewed as a textual way of specifying the structure of finite-state automata. model 2/28/2019 CPSC503 Spring 2004
FSAs as Graphs Let’s start with the sheep language from the text: /baa+!/ A set of states Initial state and some accept states How to construct one given a regular expression? Intuitively when you have a sequence of character you have a sequence of states when you have a character class you have as many links from one node to the next As the characters in the class… 2/28/2019 CPSC503 Spring 2004
Verify It can generate the same set of strings (language) To generate a string: follow a path leading to an accept state at each transition output corresponding symbol How to construct one given a regular expression? Intuitively when you have a sequence of character you have a sequence of states when you have a character class you have many links from one node to the ne 2/28/2019 CPSC503 Spring 2004
Sheep FSA We can say the following things about this machine It has 5 states b,a, and ! are in its alphabet q0 is the start state q4 is an accept state It has 5 transitions 2/28/2019 CPSC503 Spring 2004
Sheep FSA We can say the following things about this machine It has 5 states At least b,a, and ! are in its alphabet q0 is the start state q4 is an accept state It has 5 transitions 2/28/2019 CPSC503 Spring 2004
But note There are other machines that correspond to this language More on this one later 2/28/2019 CPSC503 Spring 2004
More Formally You can specify an FSA by enumerating the following things. The set of states: Q A finite alphabet: Σ A start state A set of accept/final states A transition function that maps QxΣ to Q 2/28/2019 CPSC503 Spring 2004
Represented as a Table 2/28/2019 CPSC503 Spring 2004
About Alphabets Don’t take that word to narrowly; it just means we need a finite set of symbols in the input. These symbols can and will stand for bigger objects that can have internal structure. So you can model facts about word combinations 2/28/2019 CPSC503 Spring 2004
Dollars and Cents 2/28/2019 CPSC503 Spring 2004
Recognition Def. process of determining if a string is in the language we’re defining with the machine Or… it’s the process of determining if the equivalent regular expression matches a string 2/28/2019 CPSC503 Spring 2004
Recognition Pseudocode (slide) Assume input on a tape Start in the start state pointing at the beginning of the tape Examine the current input symbol Consult the table (If a transition is allowed) Go to a new state and update the tape pointer (Else Fail). Repeat this process, until you run out of tape Now, if you are in an accept state accept the string otherwise Fail If a transition is allowed … State of the algorithm is a machine state and a pointer to the input 2/28/2019 CPSC503 Spring 2004
D-Recognize 2/28/2019 CPSC503 Spring 2004
Key Points D-recognize is a simple table-driven interpreter Matching strings with regular expressions (ala Perl) is a matter of translating the expression into a machine (table) and passing the table to an interpreter ? The algorithm is universal for all unambiguous languages. To change the machine, you change the table. Deterministic means that at each point in processing there is always one unique thing to do (no choices). 2/28/2019 CPSC503 Spring 2004
FSA: Generative Formalisms FSAs can be viewed from two perspectives: Acceptors that can tell you if a string is in the language Generators to produce all and only the strings in the language 2/28/2019 CPSC503 Spring 2004
Non-Determinism 2/28/2019 CPSC503 Spring 2004
Non-Determinism cont. Yet another technique Epsilon transitions Key point: these transitions do not examine or advance the tape during recognition ε We might not know whether to follow the epsilon transition or the ! arc 2/28/2019 CPSC503 Spring 2004
Non-Deterministic Recognition Key ideas An input can lead to multiple paths The algorithm may need to explore all possible paths Whenever there is a choice (one possibility) is to explore alternatives one at the time. Save alternatives in an agenda For deterministic: if there is a path trough the machine that leads to a final state 2/28/2019 CPSC503 Spring 2004
Non-Deterministic Recognition Success occurs when a path is found through the machine that ends in an accept state Failure occurs when none of the possible paths lead to an accept state 2/28/2019 CPSC503 Spring 2004
Example (slide) b a a a ! \ 2/28/2019 CPSC503 Spring 2004 All the states the automaton can go at any given point, given the input are saved in an agenda b a a a ! \ 2/28/2019 CPSC503 Spring 2004
Recognition as Search 2/28/2019 CPSC503 Spring 2004 You can think of the process I have described as a search in the space of reachable recognition states Do not confuse them with machine states They comprise a machine state and a pointer to the input tape State-Space Search 2/28/2019 CPSC503 Spring 2004
Equivalence between D and ND ND machines can always be converted to D ones That means that ND machines are not more powerful than D ones It also means that one way to do recognition with a ND machine is to turn it into a D one. Non-deterministic machines can be converted to deterministic ones with a fairly simple construction That means that they have the same power; non-deterministic machines are not more powerful than deterministic ones It also means that one way to do recognition with a non-deterministic machine is to turn it into a deterministic one. 2/28/2019 CPSC503 Spring 2004
Why Bother? Non-determinism doesn’t get us more formal power and it causes headaches so why bother? More natural solutions Machines based on construction are too big 2/28/2019 CPSC503 Spring 2004
Next Time Read Chapter 1 (on-line) and Chapter 2 of textbook Try understand: ND-recognize algorithm and why it is a state-space search algorithm 2/28/2019 CPSC503 Spring 2004