@#? Text Search g ~ A R B n f u j u q e ! 4 k ] { u "!"

Slides:



Advertisements
Similar presentations
CSC 361NFA vs. DFA1. CSC 361NFA vs. DFA2 NFAs vs. DFAs NFAs can be constructed from DFAs using transitions: Called NFA- Suppose M 1 accepts L 1, M 2 accepts.
Advertisements

Lecture 24 MAS 714 Hartmut Klauck
Longest Common Subsequence
Properties of Regular Languages
Nondeterministic Finite Automata CS 130: Theory of Computation HMU textbook, Chapter 2 (Sec 2.3 & 2.5)
Deterministic Finite Automata (DFA)
1 Languages. 2 A language is a set of strings String: A sequence of letters Examples: “cat”, “dog”, “house”, … Defined over an alphabet: Languages.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
1 Introduction to Computability Theory Lecture4: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Lecture 3UofH - COSC Dr. Verma 1 COSC 3340: Introduction to Theory of Computation University of Houston Dr. Verma Lecture 3.
COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.
1 Languages and Finite Automata or how to talk to machines...
CS5371 Theory of Computation Lecture 1: Mathematics Review I (Basic Terminology)
Regular Languages A language is regular over  if it can be built from ;, {  }, and { a } for every a 2 , using operators union ( [ ), concatenation.
Finite-State Machines with No Output
Regular Expressions. Notation to specify a language –Declarative –Sort of like a programming language. Fundamental in some languages like perl and applications.
Lecture 23: Finite State Machines with no Outputs Acceptors & Recognizers.
Theory of Computation, Feodor F. Dragan, Kent State University 1 Regular expressions: definition An algebraic equivalent to finite automata. We can build.
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Regular Expressions CIS 361. Need finite descriptions of infinite sets of strings. Discover and specify “regularity”. The set of languages over a finite.
Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.
CHAPTER 1 Regular Languages
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
CS 203: Introduction to Formal Languages and Automata
Regular Languages ภาษาปกติ. Jaruloj Chongstitvatana Outline Regular expressions Regular languages Equivalence between languages accepted by.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
Great Theoretical Ideas in Computer Science for Some.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
Conversions Regular Expression to FA FA to Regular Expression.
Finite Automata A simple model of computation. 2 Finite Automata2 Outline Deterministic finite automata (DFA) –How a DFA works.
Set, Alphabets, Strings, and Languages. The regular languages. Clouser properties of regular sets. Finite State Automata. Types of Finite State Automata.
1 Closure E.g., we understand number systems partly by understanding closure properties: Naturals are closed under +, , but not -, . Integers are closed.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
CS314 – Section 5 Recitation 2
Text Search ~ k A R B n f u j ! k e
Languages, grammars, automata
Finite automate.
Languages.
PROPERTIES OF REGULAR LANGUAGES
Copyright © Cengage Learning. All rights reserved.
Regular Expressions.
@#? Text Search g ~ A R B n f u j u q e ! 4 k ] { u "!"
Chapter 2 Finite Automata
Theory of Computation Lecture # 9-10.
CSE 3813 Introduction to Formal Languages and Automata
Deterministic Finite Automata
Jaya Krishna, M.Tech, Assistant Professor
Chapter 2 FINITE AUTOMATA.
Jaya Krishna, M.Tech, Assistant Professor
Deterministic Finite Automata
REGULAR LANGUAGES AND REGULAR GRAMMARS
Decision Properties of Regular Languages
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
Principles of Computing – UFCFA3-30-1
COSC 3340: Introduction to Theory of Computation
Introduction to Finite Automata
CS21 Decidability and Tractability
Closure Properties of Regular Languages
Deterministic Finite Automaton (DFA)
Compiler Construction
CS21 Decidability and Tractability
Chapter 1 Introduction to the Theory of Computation
Finite-State Machines with No Output
Text Search ~ k A R B n f u j ! k e
Presentation transcript:

@#? Text Search g ~ A R B n f u j u q e ! 4 k N @# ] { u "!" Marko Berezovský Radek Mařík PAL 2012 g ~ Automata Examples A R Operations on Regular Languages B n f Automata Reperesenting Operations on Regular Languages u Hamming and Levenshtein Distance j u Approximate Text Search Automaton Bit Arrays Simulation q e ! 4 k N @# ] { u wtf? "!"

Finite Automata Automaton C1 accepts union of sets Easy Examples 1 Automaton C1 accepts union of sets L1 = {00, 0011, 001100, 00110011, 0011001100, ...}  {11, 1100, 110011, 11001100, 1100110011, ...}. 1 3 2 4 C1 Automaton C2 accepts language L2 over  = {0, 1}, in each word of L2 : -- there is at least one symbol 1, -- each symbol 1 is followed by exactly two or three symbols 0. 1 2 3 4 C2

Finite Automata Easy Examples 2 Automaton C3 accepts all binary nonnegative integers divisible by 3, any number of leading zeros may be included. C3 1 1 1 2 1 Automaton C4 accepts all binary positive integers divisible by 3, no leading zeros are allowed. 2 1 3 C4

Language operations Let L1 and L2 be any languages. Then Overview 3 Operations on regular languages revisited Let L1 and L2 be any languages. Then L1  L2 is union of L1 and L2. It is a set of all words which are in L1 or L2. L1  L2 is intersection of L1 and L2. It is a set of all words which are simultaneously in L1 and L2. L1.L2 is concatenation of L1 and L2. It is a set of all words w for which holds w = w1w2 (concatenation of words w1 and w2), where w1 L1 and w2  L2. L1* is Kleene star or Kleene closure or iteration of language L1. It is a set of all words which are concatenations of any number (incl. zero) of any words of L1 in any order. Closure Whenever L1 and L2 are regular languages then L1  L2, L1  L2, L1.L2 , L1* are regular languages too. Automata support When L1 is regular language accepted by automaton A1 and L2 is regular language accepted by automaton A2 then there also are automata A3, A4, A5, A6, which accept L1  L2, L1  L2, L1.L2 , L1*, respectively.

Language operations Union 4 Automaton A3 accepting union of two regular languages L1 , L2 accpted by automata A1, A2 respectively. Automaton A3 is constructed using A1 and A2: Do not change A1 and A2. Create new aditional start state S0, add  - transitions from S0 to start states S1 and S2 of A1 and A2 respectively. Define set of final states of A3 as union of final states of A1 and A2. Scheme .  A3  S0 A2 S1 A1 S2

Union automaton Automaton B3 accepts any word from sets Example 5 C1 C3 1 1 1 1 1 2 3 4 1 2 1 1 Automaton B3 accepts any word from sets {00, 0011, 001100, 00110011, 0011001100, ...} {11, 1100, 110011, 11001100, 1100110011, ...} and also any binary nonnegative integer divisible by 3 with any number of leading zeros  B3 1 1  1 1 1 2 3 4 1 2 1 1

Language operations Concatenation 6 Automaton A5 accepting concatenation of two regular languages L1 , L2 accepted by automata A1, A2 respectively. Automaton A5 is constructed using A1 and A2: Do not change A1 and A2. Add  - transitions from each final state Fk of A1 to the start state S2 of A2. Define start state of A5 to be equal to the start state of A1. Define set of final states of A5 to be equal to the set of final states of A2. Scheme . A5  F1 A1 F2 S2 A2 S1 

Concatenation automaton Example 7 2 1 3 C4 1 2 3 4 C2 Automaton B5 accepts any word over {0, 1} which can be split into two consecutive words w1 and w2, where word w1 is described by regular expression 0*(100+1000)(100+1000)* , word w2 represents binary positive integer divisible by 3 w/o leading 0's. 1 1  B5 1 1 1 1 1 2 3 4 1 2 3  1

Language operations Automaton A6 accepting iteration of language L1 8 Automaton A6 accepting iteration of language L1 accepted by automaton A1. Automaton A6 is constructed using A1: Do not change A1. Create new aditional start state S0 and add  - transition from S0 to start state S1 of A1 Add  - transitions from all final states Fk of A1 to state S1. Define start state of A6 to be S0. Define set of final states of A6 as union of final states Fk and S0. Scheme .  F1 A6  A1 S0 S1 F2 

Iteration automaton Example 9 1 2 3 4 C2 Automaton B6 accepts any word created by concatenation and repetition of any words accepted by C2 including empty word. B6 1 1  1 s0 1 2 3 4   Maybe you can find some more telling informal description of the corresponding language?

Language operations Intersection 10 Automaton A4 accepting intersection of two regular languages L1 , L2 accepted by automata A1, A2 respectively. Automaton A4 is constructed using A1 and A2: Create Cartesian product Q1  Q2 , where Q1, Q2 are sets of states of A1, A2. Each state of A4 will be an ordered pair of states of A1, A2. State (S1, S2) will be start state of A4, where S1,S2 are start states of A1, A2. Final states of A4 will be just those pairs (F, G), where F is a final state of A1 and G is a final state of A2. Create transition from state (p1, p2) to (q1, q2) in A4 labeled by symbol x if and only if there is a transition p1  q1 labeled by x in A1 and also there is a transition p2  q2 labeled by x in A2.

Intersection automaton Language operations Intersection automaton 11 Scheme of an automaton A4 accepting the intersection of two regular languages L1, L2 accepted by automata A1, A2 respectively. S1 x y A4 S1 x y w S1 w x w w y w w a a z S1 z x z z y z z S2 S1 S2 x S2 S2 y S2 S2 A1 a S1 x y A2

intersection automaton Language operations intersection automaton 12 Automaton A4 accepting binary integers divisible by 3 (C4) in which each symbol 1 is followed by exactly two or three symbols 0 (C2). 1 1 1 1 3 03 13 23 33 43 A4 2 02 12 22 32 42 1 1 1 1 1 1 1 1 1 01 11 21 31 41 1 1 1 1 00 10 20 30 40 A2 A1 1 1 1 1 2 3 4

Hamming distance Hamming distance Definition 13 Hamming distance Hamming distance of two strings is equal to k (k  0), whenever k is the minimal number of rewrite operations which when applied on one of the strings produce the other string. Rewrite operation rewrites one symbol of the alphabet by some other symbol of the alphabet. Symbols cannot be deleted or inserted. Hamming distance is defined only for pairs of strings of equal length. Informally: Align the strings and count the number of mismatches of corresponding symbols. Learn some Czech l o k o m o t i v a v y k o l e j i l a distance = 6 m a l é _ p i v o v e l k ý _ v ů z distance = 8 Pokročilá Algoritmizace, A4M33PAL, ZS 2009/2010, FEL ČVUT, 7/12

Hamming distance search automaton 14 Automaton A1 for aproximate pattern matching. It detects all occurences of substrings which Hamming distance form the pattern p1p2p3p4 is less or equal to 3.  p1 1 p2 2 p3 3 p4 4 5 6 7 8 9 10 12 11 13 A1

Hamming distance search automaton 15 Automaton A2 for aproximate pattern matching. It detects all occurences of substrings which Hamming distance form the pattern 'rose' is less or equal to 3. Automaton A2 detects among others also the words: rose (distance = 0) dose (distance = 1) rest (distance = 2) list (distance = 3) and more...  r 1 o 2 s 3 e 4 5 6 7 8 9 10 12 11 13          A2

Power of indeterminsm Example Examples 16 Example NFA accepting any word with subsequence p1p2p3p4 anywhere in it.      p1 p2 p3 p4 1 2 3 4 Example NFA accepting any word with subsequence p1p2p3p4 anywhere in it, one symbol in the sequence may be altered.     p1 p2 p3 p4 1 2 3 4         p2 p3 p4 5 6 7 8 Alternatively: NFA accepting any word containing a subsequence Q which Hamming distance from p1p2p3p4 is at most 1.

Hamming distance r o s e o s e s e r o s e r o s e o s e o s e s e Clever labeling 17 Hamming distance of the found pattern Q from pattern P = "rose" cannot be deduced from the particular end state. E.g.: "rope": r - 1 - o - 2 - p - 7 - e - 8. r - 5 - o - 6 - p - 10 - e - 11.  r o s e 1 2 3 4     o s e 5 6 7 8    s e 9 10 11 Improvement  Notation: x =  ─ {x} means: Complement of x in . r o s e 1 2 3 4 r o s e Hamming distance from the pattern P = "rose" to the found pattern Q corresponds exactly to the end state. o s e 5 6 7 8 o s e s e 9 10 11

Levenshtein distance Levenshtein distance Definition 18 Levenshtein distance Levenshtein distance of two strings A and B is such minimal k (k ≥ 0 ), that we can change A to o B or B to A by applying exactly k edit operations on one of A or B. The edit operations are Remove, Insert or Rewrite any symbol of the alphabet anywhere in the string. (Rewrite is also called Substitution.) Levenshtein distance is defined for any two strings over a given alphabet. B R U X E L L E S Delete X. B E T E L G E U S E Rewrite R->E, U->T, L->G. Insert U, E. Distance = 6 Note Although the distance is defined unambiguously (prove!), the particular edit operations transforming one string to another may vary (find an example).

Levenshtein distance Calculating Levenshtein distance Calculation 19 Calculating Levenshtein distance Apply a simple Dynamic Programming approach. Let A = a[1].a[2]. ... .a[n] = A[1..n], B = b[1].b[2]. ... .b[m] = b[1..m], n, m ≥ 0. Dist(A, B) = |m ─ n| if n = 0 or m = 0 Dist(A, B) = 1+ min ( Dist(A[1..n ─ 1], B[1..m]), if n > 0 and m > 0 Dist(A[1..n], B[1..m ─ 1]), and A[n] ≠ B[m] Dist(A[1..n ─ 1], B[1..m ─ 1]) ) Dist(A, B) = Dist(A[1..n ─ 1], B[1..m ─ 1]) if n > 0 and m > 0 and A[n] = B[m] Calculation corresponds to ... Operation 1+ Dist(A[1..n ─1], B[1..m]), ... Insert(A, n ─1, B[m]) or Delete(B, m) 1+ Dist(A[1..n], B[1..m ─1]), ... Insert(B, m ─1, A[n]) or Delete(A, n) 1+ Dist(A[1..n ─1], B[1..m ─1]) ... Rewrite(A, n, B[m]) or Rewrite(B, m, A[n])

Levenshtein distance B E T E L G E U S E 0 1 2 3 4 5 6 7 8 9 10 Example 20 Dist("BETELGEUSE","BRUXELLES") = 6 B E T E L G E U S E 0 1 2 3 4 5 6 7 8 9 10 B 1 0 1 2 3 4 5 6 7 8 9 R 2 1 1 2 3 4 5 6 7 8 9 U 3 2 2 2 3 4 5 6 6 7 8 X 4 3 3 3 3 4 5 6 7 7 8 E 5 4 3 4 3 4 5 5 6 7 7 L 6 5 4 4 4 3 4 5 6 7 8 L 7 6 5 5 5 4 4 5 6 7 8 E 8 7 6 6 5 5 5 4 5 6 7 S 9 8 7 7 6 6 6 5 5 5 6 D[0][j] = j; D[i][0] = i; // i = 0..n, j = 0..m // supposing A.length = n+1, B.length = m+1 for( i = 1; i <= n; i++ ) for( j = 1; j <= m; j++ ) if( A[i] == B[j] ) D[i][j] = D[i-1][j-1]; else D[i][j] = 1+ min(D[i-1][j-1], D[i-1][j], D[i][j-1]);

Levenshtein distance r o s e r, o, s, e, o s e o s e o, s, e, s Search automaton 21 NFA searches in a text for a string within Levenshtein distance 3 from the pattern "rose". Note the transitions.  r o s e 1 2 3 4 r, o, s, e,  o s e More transitions than in Hamming distance NFA vertical ... Insert operation epsilon ... Delete operation o s e 5 6 7 8 o, s, e,  s e s e 9 10 11 s, e,  Self-check question e Label vertical transitions by  (whole alphabet). How will it change the functionality of this NFA? e 12 13

Text Search Bit representation of NFA Bits 22 Bit representation of NFA Size of transition table T is |Q|  | | and each its element T[i,k] corresponds to state qi  Q and symbol ak  . T[i,k] is vector of length |Q| and it holds: T[i,k][j] == 1  qj  (qi, ak). For bit vector F of final states holds F[j] == 1  qj  FA Example T a b z 1  a 1 b 2 3 Bit representation of automaton A. i=0 i=1 i=2 i=3 a b z 0,1 1 2 3 F A F 1 z    {a, b} Automaton A detects pattern aba in a text.

Text Search A 23 starting configuration time text symbols: Bit Representation Text Search 23 starting configuration time text symbols: 1 2 3 - a c b 1 {0} {0,1} {0,2} {0,1,3} sets of states represented by bit arrays during computation A sets of states T a b z 1 i=0 i=1 i=2 i=3 example Automaton is in states {0,1}, it reads symbol b 1 1 1 OR = F 1

while( (i <= t.length) && (S[i-1]!=[000...0]) ) { Bit Representation Simulation 24 Simulation of work of a NFA without transitions Basic method, implemented with bit vectors. Input: Bit table T of transitions, bit vector F of final states, number of states Q.size, text in array t (indexed from 1). Output: Simulated run and output of the automaton. (notation in format [0101...00] denotes characteristic vector of set of states) S[0] = [100..0]; i = 1; // init while( (i <= t.length) && (S[i-1]!=[000...0]) ) { for( j=0; j < Q.size; j++ ) if( (S[i][j] == 1) && (F[j] == 1) ) print( q[j].final_state_info ); S[i] = [000...0]; if( S[i-1][j]==1 ) S[i] = S[i] | T[j][t[i]]; // "|" i++; }