1 Regular Expressions and Automata September 10 2009 Lecture #2-2.

Slides:



Advertisements
Similar presentations
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Advertisements

Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
4b Lexical analysis Finite Automata
LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong. Administrivia Homework 3 graded.
Regular Expressions and DFAs COP 3402 (Summer 2014)
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705 Julia Hirschberg.
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
Chapter Section Section Summary Set of Strings Finite-State Automata Language Recognition by Finite-State Machines Designing Finite-State.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
Finite-State Automata Shallow Processing Techniques for NLP Ling570 October 5, 2011.
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
61 Nondeterminism and Nodeterministic Automata. 62 The computational machine models that we learned in the class are deterministic in the sense that the.
Finite Automata Finite-state machine with no output. FA consists of States, Transitions between states FA is a 5-tuple Example! A string x is recognized.
CS 4705 Lecture 2 Regular Expressions and Automata in Language Analysis.
Computational Language Finite State Machines and Regular Expressions.
CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Monz Regular Expressions and Finite State Automata (J&M 2) Prof. Bonnie J. Dorr.
Automata & Formal Languages, Feodor F. Dragan, Kent State University 1 CHAPTER 1 Regular Languages Contents Finite Automata (FA or DFA) definitions, examples,
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705.
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705 Julia Hirschberg.
Finite Automata Chapter 5. Formal Language Definitions Why need formal definitions of language –Define a precise, unambiguous and uniform interpretation.
Topics Automata Theory Grammars and Languages Complexities
CSC 361Finite Automata1. CSC 361Finite Automata2 Formal Specification of Languages Generators Grammars Context-free Regular Regular Expressions Recognizers.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Grammars, Languages and Finite-state automata Languages are described by grammars We need an algorithm that takes as input grammar sentence And gives a.
CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar.
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
1 Unit 1: Automata Theory and Formal Languages Readings 1, 2.2, 2.3.
March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
Natural Language Processing Lecture 2—1/15/2015 Susan W. Brown.
1 Regular Expressions and Automata CPE 641 Natural Language Processing from Kathy McCoy’s slides, CISC 882 Introduction to NLP
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
Copyright © Curt Hill Finite State Automata Again This Time No Output.
1 Regular Expressions and Automata August Lecture #2.
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
INHERENT LIMITATIONS OF COMPUTER PROGAMS CSci 4011.
CS 4705 Lecture 2 Regular Expressions and Automata.
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Finite State Machines 1.Finite state machines with output 2.Finite state machines with no output 3.DFA 4.NDFA.
Donghyun (David) Kim Department of Mathematics and Physics North Carolina Central University 1 Chapter 1 Regular Languages Some slides are in courtesy.
Modeling Computation: Finite State Machines without Output
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
Lecture 04: Theory of Automata:08 Transition Graphs.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2007.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Theory of Computation Automata Theory Dr. Ayman Srour.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
Pushdown Automata.
Chapter 2 FINITE AUTOMATA.
CSCI 5832 Natural Language Processing
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
CSCI 5832 Natural Language Processing
CSC NLP - Regex, Finite State Automata
Chapter Five: Nondeterministic Finite Automata
Regular Expressions and Automata in Language Analysis
CPSC 503 Computational Linguistics
Lecture 5 Scanning.
Presentation transcript:

1 Regular Expressions and Automata September Lecture #2-2

2 Finite State Automata Regular Expressions (REs) can be viewed as a way to describe machines called Finite State Automata (FSA, also known as automata, finite automata). FSAs and their close variants are a theoretical foundation of much of the field of NLP.

3 Finite State Automata FSAs recognize the regular languages represented by regular expressions –SheepTalk: /baa+!/ q0 q4 q1q2q3 ba a a! Directed graph with labeled nodes and arc transitions Five states: q0 the start state, q4 the final state, 5 transitions

4 Formally FSA is a 5-tuple consisting of –Q: set of states {q0,q1,q2,q3,q4} –  : a finite alphabet of symbols {a,b,!} –q0: a start state –F: a set of accept/final states in Q {q4} –  (q,i): a transition function mapping Q x  to Q q0 q4 q1q2q3 ba a a!

5 State Transition Table for SheepTalk SheepTalk State Input ba! 01ØØ 1Ø2Ø 2Ø3Ø 3Ø34 4ØØØ

6 Recognition Recognition (or acceptance) is the process of determining whether or not a given input should be accepted by a given machine. Or… it’s the process of determining if as string is in the language we’re defining with the machine In terms of REs, it’s the process of determining whether or not a given input matches a particular regular expression. Traditionally, recognition is viewed as processing an input written on a tape consisting of cells containing elements from the alphabet.

7 FSA recognizes (accepts) strings of a regular language –baa! –baaa! –… Tape metaphor: a rejected input b!aba q0q0

8 Recognition Simply a process of starting in the start state Examining the current input Consulting the table Going to a new state and updating the tape pointer. Until you run out of tape.

9 D-Recognize

10 baaaa! q0q0 State Input ba! 01ØØ 1Ø2Ø 2Ø3Ø 3Ø34 4ØØØ q3q3 q3q3 q4q4 q1q1 q2q2 q3q3

11 Key Points Deterministic means that at each point in processing there is always one unique thing to do (no choices). D-recognize is a simple table-driven interpreter The algorithm is universal for all unambiguous languages. –To change the machine, you change the table.

12 Key Points Crudely therefore… matching strings with regular expressions (ala Perl) is a matter of –translating the expression into a machine (table) and –passing the table to an interpreter

13 Recognition as Search You can view this algorithm as a degenerate kind of state-space search. States are pairings of tape positions and state numbers. Operators are compiled into the table Goal state is a pairing with the end of tape position and a final accept state Its degenerate because?

14 Formal Languages Formal Languages are sets of strings composed of symbols from a finite set of symbols. Finite-state automate define formal languages (without having to enumerate all the strings in the language) Given a machine m (such as a particular FSA) L(m) means the formal language characterized by m. –L(Sheeptalk FSA) = {baa!, baaa!, baaaa!, …} (an infinite set)

15 Generative Formalisms The term Generative is based on the view that you can run the machine as a generator to get strings from the language. FSAs can be viewed from two perspectives: –Acceptors that can tell you if a string is in the language –Generators to produce all and only the strings in the language

16 Three Views Three equivalent formal ways to look at what we’re up to (not including tables – and we’ll find more…) Regular Expressions Regular Languages Finite State Automata

17 Determinism Let’s take another look at what is going on with d- recognize. In particular, let’s look at what it means to be deterministic here and see if we can relax that notion. How would our recognition algorithm change? What would it mean for the accepted language?

18 Determinism and Non-Determinism Deterministic: There is at most one transition that can be taken given a current state and input symbol. Non-deterministic: There is a choice of several transitions that can be taken given a current state and input symbol. (The machine doesn’t specify how to make the choice.)

19 Non-Deterministic FSAs for SheepTalk SheepTalk q0 q4 q1q2q3 ba a a! q0 q4 q1q2q3 baa! 

20 FSAs as Grammars for Natural Language q2q4q5q0q3q1q6 therev mr dr hon patl.robinson ms mrs  Can you use a regexpr to capture this too?

21 Equivalence Non-deterministic machines can be converted to deterministic ones with a fairly simple construction (essentially building “set states” that are reached by following all possible states in parallel) That means that they have the same power; non- deterministic machines are not more powerful than deterministic ones It also means that one way to do recognition with a non- deterministic machine is to turn it into a deterministic one. Problems: translating gives us a not very intuitive machine, and this machine has LOTS of states

22 Non-Deterministic Recognition In a ND FSA there exists at least one path directed through the machine by a string that is in the language defined by the machine that leads to an accept condition.. But not all paths directed through the machine by an accept string lead to an accept state. It is OK for some paths to lead to a reject condition. In a ND FSA no path directed through the machine by a string outside the language leads to an accept condition.

23 Non-Deterministic Recognition So success in a non-deterministic recognition occurs when a path is found through the machine that ends in an accept. However, being driven to a reject condition by an input does not imply it should be rejected. Failure occurs only when none of the possible paths lead to an accept state. This means that the problem of non-deterministic recognition can be thought of as a standard search problem.

24 The Problem of Choice Choice in non-deterministic models comes up again and again in NLP. Several Standard Solutions Backup (search, this chapter) –Save input/state of machine at choice points –If wrong choice, use this saved state to back up and try another choice Lookahead –Look ahead in the input to help make a choice Parallelism –Look at all choices in parallel

25 Backup After a wrong choice leads to a dead-end (either no input left in a non-accept state, or no legal transitions), return to a previous choice point to pursue another unexplored choice. Thus, at each choice point, the search process needs to remember the (unexplored) choices. Standard State Space Search. State = (FSA node or machine state, tape-position)

26 Example ba a a !\ q0q0 q1q1 q2q2 q2q2 q3q3 q4q4

27 ND-Recognize Code

28 Example Agenda:

29 Example

30 Example Agenda:

31 Example

32 Example Agenda:

33 Example

34 Example Agenda:

35 Example Agenda:

36 Example Agenda:

37 Example

38 Example Agenda:

39 Example Agenda:

40 Example Agenda:

41 Example Agenda:

42 Key Points States in the search space are pairings of tape positions and states in the machine. By keeping track of as yet unexplored states, a recognizer can systematically explore all the paths through the machine given an input.

43 Infinite Search If you’re not careful such searches can go into an infinite loop. How?

44 Why Bother? Non-determinism doesn’t get us more formal power and it causes headaches so why bother? –More natural solutions –Machines based on construction are too big

45 Compositional Machines Formal languages are just sets of strings Therefore, we can talk about various set operations (intersection, union, concatenation) This turns out to be a useful exercise

46 Union Accept a string in either of two languages

47 Concatenation Accept a string consisting of a string from language L1 followed by a string from language L2.

48 Negation Construct a machine M2 to accept all strings not accepted by machine M1 and reject all the strings accepted by M1 –Invert all the accept and not accept states in M1 Does that work for non-deterministic machines?

49 Intersection Accept a string that is in both of two specified languages An indirect construction… –A^B = ~(~A or ~B)

50 Why Bother? ‘FSAs can be useful tools for recognizing – and generating – subsets of natural language –But they cannot represent all NL phenomena (Center Embedding: The mouse the cat... chased died.)

51 Summing Up Regular expressions and FSAs can represent subsets of natural language as well as regular languages –Both representations may be impossible for humans to understand for any real subset of a language –But they are very easy to use for smaller subsets Next time: Read Ch 3 For fun: –Think of ways you might characterize features of your using only regular expressions