Chapter 2: Finite-State Machines Heshaam Faili University of Tehran.

Slides:



Advertisements
Similar presentations
Finite-State Machines with No Output Ying Lu
Advertisements

LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong. Administrivia Homework 3 graded.
YES-NO machines Finite State Automata as language recognizers.
Week 13 - Wednesday.  What did we talk about last time?  Exam 3  Before review:  Graphing functions  Rules for manipulating asymptotic bounds  Computing.
Chapter Section Section Summary Set of Strings Finite-State Automata Language Recognition by Finite-State Machines Designing Finite-State.
Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)
1 Regular Expressions and Automata September Lecture #2-2.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
1 Regular Expressions & Automata Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.
Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst.
Languages, grammars, and regular expressions
LING 438/538 Computational Linguistics Sandiway Fong Lecture 11: 10/3.
Computational Language Finite State Machines and Regular Expressions.
1 Languages and Finite Automata or how to talk to machines...
PZ02B Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ02B - Regular grammars Programming Language Design.
Normal forms for Context-Free Grammars
1 Foundations of Software Design Lecture 22: Regular Expressions and Finite Automata Marti Hearst Fall 2002.
Finite Automata Chapter 5. Formal Language Definitions Why need formal definitions of language –Define a precise, unambiguous and uniform interpretation.
CS 3240 – Chuck Allison.  A model of computation  A very simple, manual computer (we draw pictures!)  Our machines: automata  1) Finite automata (“finite-state.
Topics Automata Theory Grammars and Languages Complexities
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Regular Expressions & Automata Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar.
1 Overview Regular expressions Notation Patterns Java support.
Finite State Machines Data Structures and Algorithms for Information Processing 1.
Scripting Languages Chapter 8 More About Regular Expressions.
CPSC 388 – Compiler Design and Construction
Regular Languages A language is regular over  if it can be built from ;, {  }, and { a } for every a 2 , using operators union ( [ ), concatenation.
Regular Expressions and Finite State Automata. Introduction Regular expressions are equivalent to Finite State Automata in recognizing regular languages,
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
CS490 Presentation: Automata & Language Theory Thong Lam Ran Shi.
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
CS/IT 138 THEORY OF COMPUTATION Chapter 1 Introduction to the Theory of Computation.
PZ02B Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ02B - Regular grammars Programming Language Design.
March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009.
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
1 Introduction to Regular Expressions EELS Meeting, Dec Tom Horton Dept. of Computer Science Univ. of Virginia
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
Regular Expressions CIS 361. Need finite descriptions of infinite sets of strings. Discover and specify “regularity”. The set of languages over a finite.
Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
Brian Mitchell - Drexel University MCS680-FCS 1 Patterns, Automata & Regular Expressions int MSTWeight(int graph[][], int size)
CS 203: Introduction to Formal Languages and Automata
Recursive Definations Regular Expressions Ch # 4 by Cohen
September1999 CMSC 203 / 0201 Fall 2002 Week #15 – 2/4/6 December 2002 Prof. Marie desJardins.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
Finite State LanguagesCSE Intro to Cognitive Science1 The Computational Modeling of Language: Finite State Languages Lecture I: Slides 1-21 Lecture.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
Regular Languages Chapter 1 Giorgi Japaridze Theory of Computability.
Set, Alphabets, Strings, and Languages. The regular languages. Clouser properties of regular sets. Finite State Automata. Types of Finite State Automata.
Lecture 03: Theory of Automata:2014 Asif Nawaz Theory of Automata.
Deterministic Finite Automata Nondeterministic Finite Automata.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Week 14 - Friday.  What did we talk about last time?  Simplifying FSAs  Quotient automata.
Deterministic Finite-State Machine (or Deterministic Finite Automaton) A DFA is a 5-tuple, (S, Σ, T, s, A), consisting of: S: a finite set of states Σ:
Theory of Computation Lecture #
REGULAR LANGUAGES AND REGULAR GRAMMARS
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
LING/C SC/PSYC 438/538 Lecture 17 Sandiway Fong.
Presentation transcript:

Chapter 2: Finite-State Machines Heshaam Faili University of Tehran

2 Overview Regular Expressions FSAs Properties of Regular Languages

3 Regular Expressions A regular expression (RE) is a formula in a specialized language, used to characterize strings. A string is a sequence of characters REs allow us to search for patterns A finite-state machine is a device for recognizing/generating regular expressions We’ll use a “Perlish” notation for writing regular expressions, based on regular expressions in the Perl programming language. The concepts are the important thing NB: Perlish isn’t exactly the same as Perl We will write REs between slashes: /…/

Regular expression inventory (1) Character Literals and Classes Characters: /abcd/ Set: /p[aeiou]p/ Range: /ab[a-z]d/ Operators (disjunction, negation) Disjunction: Set elements: /[Aa]ardvark/ Sequences of characters: /ant(eater|farm)/ Negation: Single item: /[^a]/ (any character but a) Range: [^a-z] (not a lowercase letter)

Regular expression inventory (2) Counters ?: Optionality (0 or 1 occurrence): /colou?r/ * (Kleene star): Any number of occurrences: /[0- 9]*/ +: At least one occurrence: /[0-9]+/ {n}: n number of occurrences: /[0-9]{4}/ Wildcard: matches any single character (.) /beg.n/

6 Regular expression inventory (3) Parentheses: used to group items together /ant(farm)?/: all of farm is optional Escaped characters: needed to specify characters that have a special meaning: *, +, ?, (, ), |, [, ]: Use a backslash: /why\?/ Period expressed as: _

7 Regular expression inventory (4) Anchors: anchor expressions to various parts of the string ^ start of line do not confuse with [^..] used to express negation; anywhere else it’s a start of line $ end of line \b non-word character word characters are digits, underscores, or letters, i.e., [0-9A-Za-z\_]

8 Examples of Regular Expressions /fire/ a sequence of f followed immediately by i, then immediately by r, then immediately by e /fires?/ matches fire or fires /fires\?/ matches fires ? /[abcd]/ matches a, b, c, or d /[0-9]/ matches any character in the range 0 to 9 (inclusive) /[^0-9]/ matches any non-digit character, i.e., any character except those in the set 0 thru 9 /[0-9]+/ matches 0, 1, 11, 12, 367, … /[0-9]*/ matches 0, 1, 11, 12, 367, … and matches no string /fir./ matches fire, fir9, firm, firp, … /fir.*/ matches fir, fire, fir987, firppery, … /[fFHhs]ire/ matches fire, Fire, Hire, hire, sire /f|Fire/ matches f and Fire

9 Precedence /fire|ings?/ the sequence fire or the sequence ing (the latter optionally followed by s) Why? Because sequences have precedence over disjunction To override precedence, use parentheses /fir(e|ings)/ the sequence fire followed by either the sequence e or the sequence ings

10 Precedence Rules 1)Parentheses have the highest precedence. 2)Then come counters, *, +, ?, {} 3)Then come sequences and anchors so, /good.*/ matches goodies, etc., and not (just) goodgood /echo{3}/the sequence ech followed by ooo /(echo){3}/the sequence echoechoecho 4)Then comes disjunction

11 Aliases Use aliases to designate particular recurrent sets of characters \d[0-9]: digit \D[^\d]: non-digit \w[a-zA-Z0-9\_]: alphanumeric \W[^\w]: non-alphanumeric \s[~\r\t\n\f]: whitespace character \r: space, \t: tab \n: newline, \f: formfeed \S[^\s]: non-whitespace

12 Example 1 /\$[0-9]+(\.[0-9][0-9])?/

13 Example 2 Times on a digital watch (hours and minutes) /[1-9]|(1[012]):[0-5][0-9]/

14 Overgeneration /\d\d:\d\d/ recognizes watch times, but also other sequences. In other words, the pattern over generates, covering expressions which aren’t in the target

15 Undergeneration /1[012]:[0-5][0-9]/ undergenerates, i.e., does not cover all watch times.

16 Representing sentences ‘handling’ agreement: /the (student solves|students solve) the problem/ an optional adjective: /the clever?(student solves|students solve) the problem/ generating an infinite number of sentences /the clever?(student solves|students solve) the problem (and (the clever?(student solves|students solve) the problem)*/ NOTE: here the symbols are words, not characters! Be sure to define the symbol type

17 Overview Regular Expressions FSAs Properties of Regular Languages

18 A Simple Finite State Analyzer (or FSA) Example: FSA to recognize strings of the form: /[ab]+/ i.e., L ={a, b, ab, ba, aab, bab, aba, bba, …} Transition Table initial =0; final = {1} 0–>a-> 1 0->b->1 1->a->1 1->b->1

19 How an FSA accepts or rejects a string The behavior of an FSA is completely determined by its transition table. The assumption is that there is a tape, with the input symbols are read off consecutive cells of the tape. The machine starts in the start (initial) state, about to read the contents of the first cell on the input ‘tape’. The FSA uses the transition table to decide where to go at each step A string is rejected in exactly two cases: 1. a transition on an input symbol takes you nowhere 2. the state you’re in after processing the entire input is not an accept (final) state Otherwise. the string is accepted.

20 FSA formally Finite state automaton defined by the following parameters: Q: finite set of (N) states: q0, q1, …, qN  : finite input alphabet q0: designated start state F: set of final states (subset of Q)  (q, i): transition function

21 More Examples of FSA’s Let’s design FSA’s to recognize the set of zero or more a’s the set of all lowercase alphabetic strings ending in a b. the set of all strings in [ab]* with exactly two a’s. simple NPs, PPs, Ss etc.

22 The set of zero or more a’s L ={ , a, aa, aaa, aaaa, …} Transition Table initial =0; final = {0} 0–>a-> 0

23 FSA for set of all lowercase alphabetic strings ending in b /[a-z]*b/ initial =0; final ={1} 0->[a, c-z]->0 0->b->1 1->b->1 1->[a, c-z]->0

24 The set of all strings in [ab]* with exactly 2 a’s Do this yourself It might help to first rewrite a more precise regular expression for this

25 FSA for simple NPs, PPs, S, … initial=0; final ={2} 0->D->1 0->  ->1 1->N->2 Another FSA for NPs: initial=0; final ={2} 0->N->2 0->D->1 1->N->2 2->N->2 D is an alias for [the, a, an, all,…], N for [dog, cat, robin,…] What if we wanted to add adjectives? Or recognize PPs? What about one for simple sentences? /(Prep D? A* N+)* (D? N) (Prep D? A* N+)* (V_tns|Aux V_ing) (Prep D? A* N+)*/ Note: FSA1 concat FSA2 recognizes L(FSA1) concat L(FSA2)

26 Deterministic and Non- Deterministic FSA’s An FSA is non-deterministic (NFSA) when, for some state and input, there is more than one state it can go to Occurs when transition table allows for a transition to two or more states from one state on a given input symbol. e.g., 1->a->2, 1->a->4 Whenever epsilon-transitions occur, these can be taken without consuming input. So, whenever epsilon-transitions occur, the machine could either take the epsilon-transition, or consume an input symbol, introducing non-determinism. Any NFSA can be reduced to a DFSA (deterministic) (at the expense of possibly more states).

27 FAQ: Why Are These Machines Finite-State? Finite number of states Number of states bounded in advance -- determined by its transition table Therefore, the machine has a limit to the amount of memory it uses. Its behavior at each stage is based on the transition table, and depends just on the state it’s in, and the input. So, the current state reflects the history of the processing so far. Certain classes of formal languages (and linguistic phenomena) which are not regular require additional memory to keep track of previous information (beyond current state and input) e.g., center-embedding constructions (discussed later)

28 Overview Regular Expressions FSAs Properties of Regular Languages

29 Formal Languages Revisited We will view any formal language as a set of expressions The language will use a finite vocabulary  (called an alphabet), and a set of expression-combining operations Regular languages are the simplest class of formal languages Note: Kleene closure of a set Let L = {a, b}. Then L* = the set of a’s and b’s concatenated zero or more times = { , a, b, ab, aab, aaab, aaaab, ba, baa, ….}.

30 Properties of Regular Languages The class of regular languages over  is defined as follows: 1.  (the empty set) is a regular language. 2.  a   U , {a} is a regular language. (  = alphabet of symbols) 3. If L1 and L2 are regular languages, so are: a. L1 U L2, the union (or disjunction) of L1 and L2 b. L1.L2 = {xy | x  L1, y  L2}, concatenation of L1 and L2 c. L1*, the Kleene closure of L1 (set formed by concatenating members of L1 zero or more times) So, if the language L is a regular language, any expression in L must be expressible by the three operations of concatenation, disjunction, and Kleene closure.

31 General Closure Properties of Regular Languages Concatenation, Union, Kleene Closure Intersection: If L1 and L2 are regular languages, so are L1  L2. Set Difference: If L1 and L2 are regular languages, so are L1- L2. Reversal: If L1 is a regular language, so is L1 R, the language formed by reversing all the strings in L1

32 What sorts of expressions aren’t regular In natural language, examples include center-embedding constructions. The cat loves Mozart. The cat the dog chased loves Mozart. The cat the dog the rat bit chased loves Mozart. The cat the dog the rat the elephant admired bit chased loves Mozart. (the noun) n (transitive-verb) n-1 loves Mozart These aren’t regular though /A*B*loves Mozart/ is regular

33 Regular Expressions and FSAs Regular expressions are equivalent to FSA’s So, any FSA can be constructed by just concatenation, union, and Kleene * Question: how would you (graphically) combine FSA’s using: Concatenation Union Kleene *

34 Exercises 2.1,2.4,2.8,2.10,2.11