CSC NLP - Regex, Finite State Automata

Slides:



Advertisements
Similar presentations
Natural Language Processing Lecture 3—9/3/2013 Jim Martin.
Advertisements

LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong. Administrivia Homework 3 graded.
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
Chapter Section Section Summary Set of Strings Finite-State Automata Language Recognition by Finite-State Machines Designing Finite-State.
1 Regular Expressions and Automata September Lecture #2-2.
Finite state automaton (FSA)
1 Finite state automaton (FSA) LING 570 Fei Xia Week 2: 10/07/09 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA.
Finite Automata Chapter 5. Formal Language Definitions Why need formal definitions of language –Define a precise, unambiguous and uniform interpretation.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Grammars, Languages and Finite-state automata Languages are described by grammars We need an algorithm that takes as input grammar sentence And gives a.
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
CS490 Presentation: Automata & Language Theory Thong Lam Ran Shi.
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009.
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Computabilty Computability Finite State Machine. Regular Languages. Homework: Finish Craps. Next Week: On your own: videos +
Natural Language Processing Lecture 2—1/15/2015 Susan W. Brown.
Regular Expressions CIS 361. Need finite descriptions of infinite sets of strings. Discover and specify “regularity”. The set of languages over a finite.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
1 LING 6932 Spring 2007 LING 6932 Topics in Computational Linguistics Hana Filip Lecture 2: Regular Expressions, Finite State Automata.
1 Regular Expressions and Automata August Lecture #2.
D E C I D A B I L I T Y 1. 2 Objectives To investigate the power of algorithms to solve problems. To explore the limits of algorithmic solvability. To.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
September1999 CMSC 203 / 0201 Fall 2002 Week #15 – 2/4/6 December 2002 Prof. Marie desJardins.
Finite State Machines 1.Finite state machines with output 2.Finite state machines with no output 3.DFA 4.NDFA.
Donghyun (David) Kim Department of Mathematics and Physics North Carolina Central University 1 Chapter 1 Regular Languages Some slides are in courtesy.
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Lecture 15: Theory of Automata:2014 Finite Automata with Output.
1/29/02CSE460 - MSU1 Nondeterminism-NFA Section 4.1 of Martin Textbook CSE460 – Computability & Formal Language Theory Comp. Science & Engineering Michigan.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
CIS Automata and Formal Languages – Pei Wang
Finite State Machines Dr K R Bond 2009
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lexical analysis Finite Automata
L1= { w  {a,b}* : w consists of all strings that begin with an even number of a's followed by an odd number of b's. } L2= { w  {a,b}* : the number of.
Non Deterministic Automata
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
Natural Language Processing - Formal Language -
Pushdown Automata.
Pushdown Automata.
[Week#03,04] (b) - Finite Automata
Two issues in lexical analysis
Language Recognition (12.4)
CSCI 5832 Natural Language Processing
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
CSCI 5832 Natural Language Processing
CS 154, Lecture 3: DFANFA, Regular Expressions.
Non-Deterministic Finite Automata
Decidable Languages Costas Busch - LSU.
CSCI 5832 Natural Language Processing
Chapter Nine: Advanced Topics in Regular Languages
NFAs and Transition Graphs
Finite Automata.
4b Lexical analysis Finite Automata
FORMAL LANGUAGES, AUTOMATA, AND COMPUTABILITY
Regular Expressions and Automata in Language Analysis
4b Lexical analysis Finite Automata
Language Recognition (12.4)
CPSC 503 Computational Linguistics
Instructor: Aaron Roth
MA/CSSE 474 Theory of Computation
CSC312 Automata Theory Transition Graphs Lecture # 9
NFAs and Transition Graphs
Lecture 5 Scanning.
CHAPTER 1 Regular Languages
Presentation transcript:

CSC 9010- NLP - Regex, Finite State Automata CSC 9010 Natural Language Processing Lecture 2: Regular Expressions, Finite State Automata Paula Matuszek Mary-Angela Papalaskari Presentation slides adapted from Jim Martin’s course: http://www.cs.colorado.edu/~martin/csci5832.html 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

Regular Expressions and Text Searching Everybody does it Emacs, vi, perl, grep, etc.. 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Example Find me all instances of the word “the” in a text. /the/ /[tT]he/ /\b[tT]he\b/ 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Two kinds of Errors Matching strings that we should not have matched (there, then, other) False positives Not matching things that we should have matched (The) False negatives 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

Two Antagonistic Goals Accuracy (minimize false positives) Coverage (minimize false negatives). 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Idealized machines for processing regular expressions Example: /baa+!/ 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Idealized machines for processing regular expressions Example: /baa+!/ 5 states 5 transitions alphabet? initial state accept state 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata More examples: 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

Another FSA for the same language: 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

Formally Specifying a FSA The set of states: Q A finite alphabet: Σ A start state A set of accept/final states A transition function that maps QxΣ to Q discuss alphabets = not too narrow! do example: STATE TRANSITION TABLE input State b a ! 0 1 . . 1 . 2 . 2 . 3 . 3 . 3 4 4 . . . 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Dollars and Cents 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Recognition Recognition is the process of determining if a string should be accepted by a machine Or… it’s the process of determining if as string is in the language we’re defining with the machine Or… it’s the process of determining if a regular expression matches a string 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

Turing’s way of Visualizing Recognition 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Recognition Begin in the start state Examine current input Consult the table Go to a new state and update the tape pointer. When you run out of tape: if in accepting state, accept input else reject input 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata D-Recognize 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Key Points Deterministic means that at each point in processing there is always one unique thing to do (no choices). D-recognize is a simple table-driven interpreter The algorithm is universal for all unambiguous languages. To change the machine, you change the table. 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Key Points Crudely therefore… matching strings with regular expressions is a matter of translating the expression into a machine (table) and passing the table to an interpreter 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Recognition as Search You can view this algorithm as a degenerate kind of state-space search. States are pairings of tape positions and state numbers. Operators are compiled into the table Goal state is a pairing with the end of tape position and a final accept state Its degenerate because? 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

Generative Formalisms Formal Languages are sets of strings composed of symbols from a finite set of symbols. Finite-state automata define formal languages (without having to enumerate all the strings in the language) The term Generative is based on the view that you can run the machine as a generator to get strings from the language. 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

Generative Formalisms FSAs can be viewed from two perspectives: Acceptors that can tell you if a string is in the language Generators to produce all and only the strings in the language 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Review Regular expressions are just a compact textual representation of FSAs Recognition is the process of determining if a string/input is in the language defined by some machine. Recognition is straightforward with deterministic machines. 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Three Views Three equivalent formal ways to look at what we’re up to (not including tables) Regular Expressions Mention machine (Turing) production systems (Post) Regular sets (Kleene) Finite State Automata Regular Languages 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

Defining Languages with Productions S → b a a A A → a A A → ! S → NP VP NP → PrNoun NP → Det Noun Det → a | the Noun → cat | dog| book PrNoun → samantha |elmer | fido VP → IVerb | TVerb NP IVerb → ran |slept | ate TVerb → hit | kissed | ate Regular language Regular? 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Non-Determinism Compare: 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Non-Determinism cont. Epsilon transitions: Note: these transitions do not examine or advance the tape during recognition ε 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

Are Non-deterministic FSA more powerful? Non-deterministic machines can be converted to deterministic ones with a fairly simple construction One way to do recognition with a non-deterministic machine is to turn it into a deterministic one. 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

Non-Deterministic Recognition In a ND FSA there exists at least one path through the machine for a string that is in the language defined by the machine. But not all paths directed through the machine for an accept string lead to an accept state. No paths through the machine lead to an accept state for a string not in the language. 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

Non-Deterministic Recognition So success in a non-deterministic recognition occurs when a path is found through the machine that ends in an accept. Failure occurs when none of the possible paths lead to an accept state. 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Example b a a a ! \ q0 q1 q2 q2 q3 q4 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Example 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Example 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Example 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Example 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Example 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Example 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Example 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Example 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Key Points States in the search space are pairings of tape positions and states in the machine. By keeping track of as yet unexplored states, a recognizer can systematically explore all the paths through the machine given an input. 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata ND-Recognize Code 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Infinite Search If you’re not careful such searches can go into an infinite loop. How? 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Why Bother? Non-determinism doesn’t get us more formal power and it causes headaches so why bother? More natural solutions Machines based on construction are too big 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

Compositional Machines Formal languages are just sets of strings Therefore, we can talk about various set operations (intersection, union, concatenation) This turns out to be a useful exercise 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Union Accept a string in either of two languages 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Concatenation Accept a string consisting of a string from language L1 followed by a string from language L2. 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Negation Construct a machine M2 to accept all strings not accepted by machine M1 and reject all the strings accepted by M1 Invert all the accept and not accept states in M1 Does that work for non-deterministic machines? 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata

CSC 9010- NLP - Regex, Finite State Automata Intersection Accept a string that is in both of two specified languages An indirect construction… A^B = ~(~A or ~B) 12/3/2018 CSC 9010- NLP - Regex, Finite State Automata