Presentation is loading. Please wait.

Presentation is loading. Please wait.

CPSC 503 Computational Linguistics

Similar presentations


Presentation on theme: "CPSC 503 Computational Linguistics"— Presentation transcript:

1 CPSC 503 Computational Linguistics
RegExps and Finite State Automata Lecture 2 Giuseppe Carenini 2/28/2019 CPSC503 Spring 2004

2 Survey Results By Student By topic 2/28/2019 CPSC503 Spring 2004

3 Knowledge-Formalisms Map (including probabilistic formalisms)
State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Syntax Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) Semantics My Conceptual map This is the master plan I have added probabilistic models We will go back to this throughout the course Pragmatics Discourse and Dialogue Logical formalisms (First-Order Logics) AI planners 2/28/2019 CPSC503 Spring 2004

4 Next Two Lectures State Machines (no prob.)
Finite State Automata (and Regular Expressions) Finite State Transducers (English) Morphology Logical formalisms (First-Order Logics) Rule systems (and prob. version) (e.g., (Prob.) Context-Free Grammars) Syntax Pragmatics Discourse and Dialogue Semantics AI planners The next two lectures will learn about Finite state automata (and Regular Expressions) Finite state transducers English morphology 2/28/2019 CPSC503 Spring 2004

5 Today 1/16 Regular Expressions Errors Finite-state automata Generation
Recognition Non-determinism Regular expressions can be viewed as a textual way of specifying the structure of finite-state automata. Finite-state automata can be viewed as implementations of regular expressions 2/28/2019 CPSC503 Spring 2004

6 Regular Expressions Def. Notation to specify a set of strings
Simplest case: /CPSC503/ [] disjunction of characters, [^] negation /CPSC50[34]/, /CPSC50[0-9]/,/CPSC50[^34]/ . Any character (to match a period \.) | “OR” /([Ff]rom|[Ss]ubject|[Dd]ate)/ Searching and acting on what you find String: sequence of symbols (characters] / (Perl notation) Case sensitive Disjunction character classes (matches a character in the defined class) not the following (set of) character/s Question: if I wanted to find all English words that have a q followed by something other that a u Word any sequence of digits underscores and letters Everybody does it Emacs, vi, perl, grep, etc.. Anchors: ^ (start of of line), $ (end of line), \b (word boundary) /^([Ff]rom\b|[Ss]ubject\b|[Dd]ate\b)/ 2/28/2019 CPSC503 Spring 2004

7 Regular Expressions (cont.)
( ) Grouping: /happy|ier/ vs. /happ(y|ier)/ Operators applied to preceding item (character or exp.) ? Optional /colou?r/,/July? (fourth|4(th)?)/ Repetitions + one or more * any number including none {num} num times Real power comes from Optional and Counting elements Optional: preceding expressions is allowed to appear but it is not required /[0-9]+(\.[0-9]+){3}/ 2/28/2019 CPSC503 Spring 2004

8 Example of Usage: Text Searching
Find me all instances of the determiner “the” in an English text. To count them To substitute them with something else You try: /the/ The other cop went to the bank but there were no people there. /[tT]he/ /\bthe\b/ /\b[tT]he\b/ 2/28/2019 CPSC503 Spring 2004

9 Errors The process we just went through was based on fixing two kinds of errors Matching strings that we should not have matched (there, other) False positives Not matching things that we should have matched (The) False negatives 2/28/2019 CPSC503 Spring 2004

10 Errors cont. Reducing the error rate for an application often involves two antagonistic efforts: Increasing accuracy (minimizing false positives) Increasing coverage (minimizing false negatives). We’ll be telling the same story for may tasks, all semester. 2/28/2019 CPSC503 Spring 2004

11 (generate and recognize)
Finite State Automata implement (generate and recognize) Regular Expressions FSA describe Many Linguistic Phenomena FSAs and their close relatives are at the core of what we’ll be doing all semester. Reg Exp notation to specify a set of strings Besides implementing resular expression FSA have a wide variety of uses…. Regular expressions can be viewed as a textual way of specifying the structure of finite-state automata. model 2/28/2019 CPSC503 Spring 2004

12 FSAs as Graphs Let’s start with the sheep language from the text: /baa+!/ A set of states Initial state and some accept states How to construct one given a regular expression? Intuitively when you have a sequence of character you have a sequence of states when you have a character class you have as many links from one node to the next As the characters in the class… 2/28/2019 CPSC503 Spring 2004

13 Verify It can generate the same set of strings (language)
To generate a string: follow a path leading to an accept state at each transition output corresponding symbol How to construct one given a regular expression? Intuitively when you have a sequence of character you have a sequence of states when you have a character class you have many links from one node to the ne 2/28/2019 CPSC503 Spring 2004

14 Sheep FSA We can say the following things about this machine
It has 5 states b,a, and ! are in its alphabet q0 is the start state q4 is an accept state It has 5 transitions 2/28/2019 CPSC503 Spring 2004

15 Sheep FSA We can say the following things about this machine
It has 5 states At least b,a, and ! are in its alphabet q0 is the start state q4 is an accept state It has 5 transitions 2/28/2019 CPSC503 Spring 2004

16 But note There are other machines that correspond to this language
More on this one later 2/28/2019 CPSC503 Spring 2004

17 More Formally You can specify an FSA by enumerating the following things. The set of states: Q A finite alphabet: Σ A start state A set of accept/final states A transition function that maps QxΣ to Q 2/28/2019 CPSC503 Spring 2004

18 Represented as a Table 2/28/2019 CPSC503 Spring 2004

19 About Alphabets Don’t take that word to narrowly; it just means we need a finite set of symbols in the input. These symbols can and will stand for bigger objects that can have internal structure. So you can model facts about word combinations 2/28/2019 CPSC503 Spring 2004

20 Dollars and Cents 2/28/2019 CPSC503 Spring 2004

21 Recognition Def. process of determining if a string is in the language we’re defining with the machine Or… it’s the process of determining if the equivalent regular expression matches a string 2/28/2019 CPSC503 Spring 2004

22 Recognition Pseudocode (slide)
Assume input on a tape Start in the start state pointing at the beginning of the tape Examine the current input symbol Consult the table (If a transition is allowed) Go to a new state and update the tape pointer (Else Fail). Repeat this process, until you run out of tape Now, if you are in an accept state accept the string otherwise Fail If a transition is allowed … State of the algorithm is a machine state and a pointer to the input 2/28/2019 CPSC503 Spring 2004

23 D-Recognize 2/28/2019 CPSC503 Spring 2004

24 Key Points D-recognize is a simple table-driven interpreter
Matching strings with regular expressions (ala Perl) is a matter of translating the expression into a machine (table) and passing the table to an interpreter ? The algorithm is universal for all unambiguous languages. To change the machine, you change the table. Deterministic means that at each point in processing there is always one unique thing to do (no choices). 2/28/2019 CPSC503 Spring 2004

25 FSA: Generative Formalisms
FSAs can be viewed from two perspectives: Acceptors that can tell you if a string is in the language Generators to produce all and only the strings in the language 2/28/2019 CPSC503 Spring 2004

26 Non-Determinism 2/28/2019 CPSC503 Spring 2004

27 Non-Determinism cont. Yet another technique Epsilon transitions
Key point: these transitions do not examine or advance the tape during recognition ε We might not know whether to follow the epsilon transition or the ! arc 2/28/2019 CPSC503 Spring 2004

28 Non-Deterministic Recognition Key ideas
An input can lead to multiple paths The algorithm may need to explore all possible paths Whenever there is a choice (one possibility) is to explore alternatives one at the time. Save alternatives in an agenda For deterministic: if there is a path trough the machine that leads to a final state 2/28/2019 CPSC503 Spring 2004

29 Non-Deterministic Recognition
Success occurs when a path is found through the machine that ends in an accept state Failure occurs when none of the possible paths lead to an accept state 2/28/2019 CPSC503 Spring 2004

30 Example (slide) b a a a ! \ 2/28/2019 CPSC503 Spring 2004
All the states the automaton can go at any given point, given the input are saved in an agenda b a a a ! \ 2/28/2019 CPSC503 Spring 2004

31 Recognition as Search 2/28/2019 CPSC503 Spring 2004
You can think of the process I have described as a search in the space of reachable recognition states Do not confuse them with machine states They comprise a machine state and a pointer to the input tape State-Space Search 2/28/2019 CPSC503 Spring 2004

32 Equivalence between D and ND
ND machines can always be converted to D ones That means that ND machines are not more powerful than D ones It also means that one way to do recognition with a ND machine is to turn it into a D one. Non-deterministic machines can be converted to deterministic ones with a fairly simple construction That means that they have the same power; non-deterministic machines are not more powerful than deterministic ones It also means that one way to do recognition with a non-deterministic machine is to turn it into a deterministic one. 2/28/2019 CPSC503 Spring 2004

33 Why Bother? Non-determinism doesn’t get us more formal power and it causes headaches so why bother? More natural solutions Machines based on construction are too big 2/28/2019 CPSC503 Spring 2004

34 Next Time Read Chapter 1 (on-line) and Chapter 2 of textbook
Try understand: ND-recognize algorithm and why it is a state-space search algorithm 2/28/2019 CPSC503 Spring 2004


Download ppt "CPSC 503 Computational Linguistics"

Similar presentations


Ads by Google