Download presentation
Presentation is loading. Please wait.
1
CPSC 503 Computational Linguistics
RegExps and Finite State Automata Lecture 2 Giuseppe Carenini 2/28/2019 CPSC503 Spring 2004
2
Survey Results By Student By topic 2/28/2019 CPSC503 Spring 2004
3
Knowledge-Formalisms Map (including probabilistic formalisms)
State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Syntax Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) Semantics My Conceptual map This is the master plan I have added probabilistic models We will go back to this throughout the course Pragmatics Discourse and Dialogue Logical formalisms (First-Order Logics) AI planners 2/28/2019 CPSC503 Spring 2004
4
Next Two Lectures State Machines (no prob.)
Finite State Automata (and Regular Expressions) Finite State Transducers (English) Morphology Logical formalisms (First-Order Logics) Rule systems (and prob. version) (e.g., (Prob.) Context-Free Grammars) Syntax Pragmatics Discourse and Dialogue Semantics AI planners The next two lectures will learn about Finite state automata (and Regular Expressions) Finite state transducers English morphology 2/28/2019 CPSC503 Spring 2004
5
Today 1/16 Regular Expressions Errors Finite-state automata Generation
Recognition Non-determinism Regular expressions can be viewed as a textual way of specifying the structure of finite-state automata. Finite-state automata can be viewed as implementations of regular expressions 2/28/2019 CPSC503 Spring 2004
6
Regular Expressions Def. Notation to specify a set of strings
Simplest case: /CPSC503/ [] disjunction of characters, [^] negation /CPSC50[34]/, /CPSC50[0-9]/,/CPSC50[^34]/ . Any character (to match a period \.) | “OR” /([Ff]rom|[Ss]ubject|[Dd]ate)/ Searching and acting on what you find String: sequence of symbols (characters] / (Perl notation) Case sensitive Disjunction character classes (matches a character in the defined class) not the following (set of) character/s Question: if I wanted to find all English words that have a q followed by something other that a u Word any sequence of digits underscores and letters Everybody does it Emacs, vi, perl, grep, etc.. Anchors: ^ (start of of line), $ (end of line), \b (word boundary) /^([Ff]rom\b|[Ss]ubject\b|[Dd]ate\b)/ 2/28/2019 CPSC503 Spring 2004
7
Regular Expressions (cont.)
( ) Grouping: /happy|ier/ vs. /happ(y|ier)/ Operators applied to preceding item (character or exp.) ? Optional /colou?r/,/July? (fourth|4(th)?)/ Repetitions + one or more * any number including none {num} num times Real power comes from Optional and Counting elements Optional: preceding expressions is allowed to appear but it is not required /[0-9]+(\.[0-9]+){3}/ 2/28/2019 CPSC503 Spring 2004
8
Example of Usage: Text Searching
Find me all instances of the determiner “the” in an English text. To count them To substitute them with something else You try: /the/ The other cop went to the bank but there were no people there. /[tT]he/ /\bthe\b/ /\b[tT]he\b/ 2/28/2019 CPSC503 Spring 2004
9
Errors The process we just went through was based on fixing two kinds of errors Matching strings that we should not have matched (there, other) False positives Not matching things that we should have matched (The) False negatives 2/28/2019 CPSC503 Spring 2004
10
Errors cont. Reducing the error rate for an application often involves two antagonistic efforts: Increasing accuracy (minimizing false positives) Increasing coverage (minimizing false negatives). We’ll be telling the same story for may tasks, all semester. 2/28/2019 CPSC503 Spring 2004
11
(generate and recognize)
Finite State Automata implement (generate and recognize) Regular Expressions FSA describe Many Linguistic Phenomena FSAs and their close relatives are at the core of what we’ll be doing all semester. Reg Exp notation to specify a set of strings Besides implementing resular expression FSA have a wide variety of uses…. Regular expressions can be viewed as a textual way of specifying the structure of finite-state automata. model 2/28/2019 CPSC503 Spring 2004
12
FSAs as Graphs Let’s start with the sheep language from the text: /baa+!/ A set of states Initial state and some accept states How to construct one given a regular expression? Intuitively when you have a sequence of character you have a sequence of states when you have a character class you have as many links from one node to the next As the characters in the class… 2/28/2019 CPSC503 Spring 2004
13
Verify It can generate the same set of strings (language)
To generate a string: follow a path leading to an accept state at each transition output corresponding symbol How to construct one given a regular expression? Intuitively when you have a sequence of character you have a sequence of states when you have a character class you have many links from one node to the ne 2/28/2019 CPSC503 Spring 2004
14
Sheep FSA We can say the following things about this machine
It has 5 states b,a, and ! are in its alphabet q0 is the start state q4 is an accept state It has 5 transitions 2/28/2019 CPSC503 Spring 2004
15
Sheep FSA We can say the following things about this machine
It has 5 states At least b,a, and ! are in its alphabet q0 is the start state q4 is an accept state It has 5 transitions 2/28/2019 CPSC503 Spring 2004
16
But note There are other machines that correspond to this language
More on this one later 2/28/2019 CPSC503 Spring 2004
17
More Formally You can specify an FSA by enumerating the following things. The set of states: Q A finite alphabet: Σ A start state A set of accept/final states A transition function that maps QxΣ to Q 2/28/2019 CPSC503 Spring 2004
18
Represented as a Table 2/28/2019 CPSC503 Spring 2004
19
About Alphabets Don’t take that word to narrowly; it just means we need a finite set of symbols in the input. These symbols can and will stand for bigger objects that can have internal structure. So you can model facts about word combinations 2/28/2019 CPSC503 Spring 2004
20
Dollars and Cents 2/28/2019 CPSC503 Spring 2004
21
Recognition Def. process of determining if a string is in the language we’re defining with the machine Or… it’s the process of determining if the equivalent regular expression matches a string 2/28/2019 CPSC503 Spring 2004
22
Recognition Pseudocode (slide)
Assume input on a tape Start in the start state pointing at the beginning of the tape Examine the current input symbol Consult the table (If a transition is allowed) Go to a new state and update the tape pointer (Else Fail). Repeat this process, until you run out of tape Now, if you are in an accept state accept the string otherwise Fail If a transition is allowed … State of the algorithm is a machine state and a pointer to the input 2/28/2019 CPSC503 Spring 2004
23
D-Recognize 2/28/2019 CPSC503 Spring 2004
24
Key Points D-recognize is a simple table-driven interpreter
Matching strings with regular expressions (ala Perl) is a matter of translating the expression into a machine (table) and passing the table to an interpreter ? The algorithm is universal for all unambiguous languages. To change the machine, you change the table. Deterministic means that at each point in processing there is always one unique thing to do (no choices). 2/28/2019 CPSC503 Spring 2004
25
FSA: Generative Formalisms
FSAs can be viewed from two perspectives: Acceptors that can tell you if a string is in the language Generators to produce all and only the strings in the language 2/28/2019 CPSC503 Spring 2004
26
Non-Determinism 2/28/2019 CPSC503 Spring 2004
27
Non-Determinism cont. Yet another technique Epsilon transitions
Key point: these transitions do not examine or advance the tape during recognition ε We might not know whether to follow the epsilon transition or the ! arc 2/28/2019 CPSC503 Spring 2004
28
Non-Deterministic Recognition Key ideas
An input can lead to multiple paths The algorithm may need to explore all possible paths Whenever there is a choice (one possibility) is to explore alternatives one at the time. Save alternatives in an agenda For deterministic: if there is a path trough the machine that leads to a final state 2/28/2019 CPSC503 Spring 2004
29
Non-Deterministic Recognition
Success occurs when a path is found through the machine that ends in an accept state Failure occurs when none of the possible paths lead to an accept state 2/28/2019 CPSC503 Spring 2004
30
Example (slide) b a a a ! \ 2/28/2019 CPSC503 Spring 2004
All the states the automaton can go at any given point, given the input are saved in an agenda b a a a ! \ 2/28/2019 CPSC503 Spring 2004
31
Recognition as Search 2/28/2019 CPSC503 Spring 2004
You can think of the process I have described as a search in the space of reachable recognition states Do not confuse them with machine states They comprise a machine state and a pointer to the input tape State-Space Search 2/28/2019 CPSC503 Spring 2004
32
Equivalence between D and ND
ND machines can always be converted to D ones That means that ND machines are not more powerful than D ones It also means that one way to do recognition with a ND machine is to turn it into a D one. Non-deterministic machines can be converted to deterministic ones with a fairly simple construction That means that they have the same power; non-deterministic machines are not more powerful than deterministic ones It also means that one way to do recognition with a non-deterministic machine is to turn it into a deterministic one. 2/28/2019 CPSC503 Spring 2004
33
Why Bother? Non-determinism doesn’t get us more formal power and it causes headaches so why bother? More natural solutions Machines based on construction are too big 2/28/2019 CPSC503 Spring 2004
34
Next Time Read Chapter 1 (on-line) and Chapter 2 of textbook
Try understand: ND-recognize algorithm and why it is a state-space search algorithm 2/28/2019 CPSC503 Spring 2004
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.