CPSC 503 Computational Linguistics

Slides:



Advertisements
Similar presentations
LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong. Administrivia Homework 3 graded.
Advertisements

1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
1 Regular Expressions and Automata September Lecture #2-2.
Computational Language Finite State Machines and Regular Expressions.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Grammars, Languages and Finite-state automata Languages are described by grammars We need an algorithm that takes as input grammar sentence And gives a.
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
PZ02B Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ02B - Regular grammars Programming Language Design.
March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009.
Natural Language Processing Lecture 2—1/15/2015 Susan W. Brown.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
Copyright © Curt Hill Finite State Automata Again This Time No Output.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Theory of Computation Automata Theory Dr. Ayman Srour.
1 Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Department of Software & Media Technology
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Finite State Machines Dr K R Bond 2009
Context-Free Grammars: an overview
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lexical analysis Finite Automata
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Pushdown Automata.
Pushdown Automata.
CSE 105 theory of computation
Jaya Krishna, M.Tech, Assistant Professor
CSCI 5832 Natural Language Processing
Department of Software & Media Technology
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
CSCI 5832 Natural Language Processing
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
CSC NLP - Regex, Finite State Automata
CHAPTER 2 Context-Free Languages
CSCI 5832 Natural Language Processing
Introduction to Finite Automata
Finite Automata.
4b Lexical analysis Finite Automata
Regular Expressions
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Algorithm and Ambiguity
Regular Expressions and Automata in Language Analysis
4b Lexical analysis Finite Automata
LING/C SC/PSYC 438/538 Lecture 17 Sandiway Fong.
Instructor: Aaron Roth
CPSC 503 Computational Linguistics
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
CSC312 Automata Theory Transition Graphs Lecture # 9
CSE 105 theory of computation
Chapter 1 Regular Language
CPSC 503 Computational Linguistics
Lecture 5 Scanning.
CHAPTER 1 Regular Languages
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
PZ02B - Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section PZ02B.
CSE 105 theory of computation
Presentation transcript:

CPSC 503 Computational Linguistics RegExps and Finite State Automata Lecture 2 Giuseppe Carenini 2/28/2019 CPSC503 Spring 2004

Survey Results By Student By topic 2/28/2019 CPSC503 Spring 2004

Knowledge-Formalisms Map (including probabilistic formalisms) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Syntax Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) Semantics My Conceptual map This is the master plan I have added probabilistic models We will go back to this throughout the course Pragmatics Discourse and Dialogue Logical formalisms (First-Order Logics) AI planners 2/28/2019 CPSC503 Spring 2004

Next Two Lectures State Machines (no prob.) Finite State Automata (and Regular Expressions) Finite State Transducers (English) Morphology Logical formalisms (First-Order Logics) Rule systems (and prob. version) (e.g., (Prob.) Context-Free Grammars) Syntax Pragmatics Discourse and Dialogue Semantics AI planners The next two lectures will learn about Finite state automata (and Regular Expressions) Finite state transducers English morphology 2/28/2019 CPSC503 Spring 2004

Today 1/16 Regular Expressions Errors Finite-state automata Generation Recognition Non-determinism Regular expressions can be viewed as a textual way of specifying the structure of finite-state automata. Finite-state automata can be viewed as implementations of regular expressions 2/28/2019 CPSC503 Spring 2004

Regular Expressions Def. Notation to specify a set of strings Simplest case: /CPSC503/ [] disjunction of characters, [^] negation /CPSC50[34]/, /CPSC50[0-9]/,/CPSC50[^34]/ . Any character (to match a period \.) | “OR” /([Ff]rom|[Ss]ubject|[Dd]ate)/ Searching and acting on what you find String: sequence of symbols (characters] / (Perl notation) Case sensitive Disjunction character classes (matches a character in the defined class) not the following (set of) character/s Question: if I wanted to find all English words that have a q followed by something other that a u Word any sequence of digits underscores and letters Everybody does it Emacs, vi, perl, grep, etc.. Anchors: ^ (start of of line), $ (end of line), \b (word boundary) /^([Ff]rom\b|[Ss]ubject\b|[Dd]ate\b)/ 2/28/2019 CPSC503 Spring 2004

Regular Expressions (cont.) ( ) Grouping: /happy|ier/ vs. /happ(y|ier)/ Operators applied to preceding item (character or exp.) ? Optional /colou?r/,/July? (fourth|4(th)?)/ Repetitions + one or more * any number including none {num} num times Real power comes from Optional and Counting elements Optional: preceding expressions is allowed to appear but it is not required /[0-9]+(\.[0-9]+){3}/ 2/28/2019 CPSC503 Spring 2004

Example of Usage: Text Searching Find me all instances of the determiner “the” in an English text. To count them To substitute them with something else You try: /the/ The other cop went to the bank but there were no people there. /[tT]he/ /\bthe\b/ /\b[tT]he\b/ 2/28/2019 CPSC503 Spring 2004

Errors The process we just went through was based on fixing two kinds of errors Matching strings that we should not have matched (there, other) False positives Not matching things that we should have matched (The) False negatives 2/28/2019 CPSC503 Spring 2004

Errors cont. Reducing the error rate for an application often involves two antagonistic efforts: Increasing accuracy (minimizing false positives) Increasing coverage (minimizing false negatives). We’ll be telling the same story for may tasks, all semester. 2/28/2019 CPSC503 Spring 2004

(generate and recognize) Finite State Automata implement (generate and recognize) Regular Expressions FSA describe Many Linguistic Phenomena FSAs and their close relatives are at the core of what we’ll be doing all semester. Reg Exp notation to specify a set of strings Besides implementing resular expression FSA have a wide variety of uses…. Regular expressions can be viewed as a textual way of specifying the structure of finite-state automata. model 2/28/2019 CPSC503 Spring 2004

FSAs as Graphs Let’s start with the sheep language from the text: /baa+!/ A set of states Initial state and some accept states How to construct one given a regular expression? Intuitively when you have a sequence of character you have a sequence of states when you have a character class you have as many links from one node to the next As the characters in the class… 2/28/2019 CPSC503 Spring 2004

Verify It can generate the same set of strings (language) To generate a string: follow a path leading to an accept state at each transition output corresponding symbol How to construct one given a regular expression? Intuitively when you have a sequence of character you have a sequence of states when you have a character class you have many links from one node to the ne 2/28/2019 CPSC503 Spring 2004

Sheep FSA We can say the following things about this machine It has 5 states b,a, and ! are in its alphabet q0 is the start state q4 is an accept state It has 5 transitions 2/28/2019 CPSC503 Spring 2004

Sheep FSA We can say the following things about this machine It has 5 states At least b,a, and ! are in its alphabet q0 is the start state q4 is an accept state It has 5 transitions 2/28/2019 CPSC503 Spring 2004

But note There are other machines that correspond to this language More on this one later 2/28/2019 CPSC503 Spring 2004

More Formally You can specify an FSA by enumerating the following things. The set of states: Q A finite alphabet: Σ A start state A set of accept/final states A transition function that maps QxΣ to Q 2/28/2019 CPSC503 Spring 2004

Represented as a Table 2/28/2019 CPSC503 Spring 2004

About Alphabets Don’t take that word to narrowly; it just means we need a finite set of symbols in the input. These symbols can and will stand for bigger objects that can have internal structure. So you can model facts about word combinations 2/28/2019 CPSC503 Spring 2004

Dollars and Cents 2/28/2019 CPSC503 Spring 2004

Recognition Def. process of determining if a string is in the language we’re defining with the machine Or… it’s the process of determining if the equivalent regular expression matches a string 2/28/2019 CPSC503 Spring 2004

Recognition Pseudocode (slide) Assume input on a tape Start in the start state pointing at the beginning of the tape Examine the current input symbol Consult the table (If a transition is allowed) Go to a new state and update the tape pointer (Else Fail). Repeat this process, until you run out of tape Now, if you are in an accept state accept the string otherwise Fail If a transition is allowed … State of the algorithm is a machine state and a pointer to the input 2/28/2019 CPSC503 Spring 2004

D-Recognize 2/28/2019 CPSC503 Spring 2004

Key Points D-recognize is a simple table-driven interpreter Matching strings with regular expressions (ala Perl) is a matter of translating the expression into a machine (table) and passing the table to an interpreter ? The algorithm is universal for all unambiguous languages. To change the machine, you change the table. Deterministic means that at each point in processing there is always one unique thing to do (no choices). 2/28/2019 CPSC503 Spring 2004

FSA: Generative Formalisms FSAs can be viewed from two perspectives: Acceptors that can tell you if a string is in the language Generators to produce all and only the strings in the language 2/28/2019 CPSC503 Spring 2004

Non-Determinism 2/28/2019 CPSC503 Spring 2004

Non-Determinism cont. Yet another technique Epsilon transitions Key point: these transitions do not examine or advance the tape during recognition ε We might not know whether to follow the epsilon transition or the ! arc 2/28/2019 CPSC503 Spring 2004

Non-Deterministic Recognition Key ideas An input can lead to multiple paths The algorithm may need to explore all possible paths Whenever there is a choice (one possibility) is to explore alternatives one at the time. Save alternatives in an agenda For deterministic: if there is a path trough the machine that leads to a final state 2/28/2019 CPSC503 Spring 2004

Non-Deterministic Recognition Success occurs when a path is found through the machine that ends in an accept state Failure occurs when none of the possible paths lead to an accept state 2/28/2019 CPSC503 Spring 2004

Example (slide) b a a a ! \ 2/28/2019 CPSC503 Spring 2004 All the states the automaton can go at any given point, given the input are saved in an agenda b a a a ! \ 2/28/2019 CPSC503 Spring 2004

Recognition as Search 2/28/2019 CPSC503 Spring 2004 You can think of the process I have described as a search in the space of reachable recognition states Do not confuse them with machine states They comprise a machine state and a pointer to the input tape State-Space Search 2/28/2019 CPSC503 Spring 2004

Equivalence between D and ND ND machines can always be converted to D ones That means that ND machines are not more powerful than D ones It also means that one way to do recognition with a ND machine is to turn it into a D one. Non-deterministic machines can be converted to deterministic ones with a fairly simple construction That means that they have the same power; non-deterministic machines are not more powerful than deterministic ones It also means that one way to do recognition with a non-deterministic machine is to turn it into a deterministic one. 2/28/2019 CPSC503 Spring 2004

Why Bother? Non-determinism doesn’t get us more formal power and it causes headaches so why bother? More natural solutions Machines based on construction are too big 2/28/2019 CPSC503 Spring 2004

Next Time Read Chapter 1 (on-line) and Chapter 2 of textbook Try understand: ND-recognize algorithm and why it is a state-space search algorithm 2/28/2019 CPSC503 Spring 2004