Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.

Slides:



Advertisements
Similar presentations
Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Advertisements

Parametrized Matching Amir, Farach, Muthukrishnan Orgad Keller.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Space-for-Time Tradeoffs
Algorithm : Design & Analysis [19]
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Two implementation issues Alphabet size Generalizing to multiple strings.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Goodrich, Tamassia String Processing1 Pattern Matching.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.
Faster Algorithm for String Matching with k Mismatches Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp Date.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.
String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)
S C A L E D Pattern Matching Amihood Amir Ayelet Butman Bar-Ilan University Moshe Lewenstein and Johns Hopkins University Bar-Ilan University.
Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Chapter 7 Space and Time Tradeoffs James Gain & Sonia Berman
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Andrzej Ehrenfeucht, University of Colorado, Boulder
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
String Processing.
Chapter 7 Space and Time Tradeoffs
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
2-Dimensional Pattern Matching
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
String Processing.
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Space-for-time tradeoffs
Space-for-time tradeoffs
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University

Classical Pattern Matching Input: - Pattern P = p 1 p 2 …p m - Text T = t 1 t 2 t 3... t n over alphabet Σ. m is the PATTERN size. n is the TEXT size. Output: locations of T where P appears.

Pattern Matching (eg.) Input: P=agca = {a,g,c,t} T=aaagcattagctagcagcat

Pattern Matching (eg.) Input: P=agca = {a,g,c,t} Output: … , 13, 16,… T=aaagcattagctagcagcat

“Dynamic” Pattern Matching  A. Static Text and Dynamic Pattern.  B. Dynamic Text and Dynamic Pattern.  C. Dynamic Text and Static Pattern.

“Dynamic” Pattern Matching  A. Static Text and Dynamic Pattern. a.k.a. - the indexing problem Solution: Preprocess text and answer pattern queries Preprocessing Data Structure: Suffix trees, [Wei73,McC75,Ukk95,Far97] Time: O(n) prepro. O(m) query time

“Dynamic” Pattern Matching  A. Static Text and Dynamic Pattern.  B. Dynamic Text and Dynamic Pattern. Time: O(n) preprocessing O(m) query time a.k.a. - the dynamic indexing problem Solution: sophisticated data structures [SV96,ABR00] Time: query - O(m + log 2 n) change - O(log 2 n)

“Dynamic” Pattern Matching  A. Static Text and Dynamic Pattern.  B. Dynamic Text and Dynamic Pattern. Time: O(n) preprocessing O(m) query time  C. Dynamic Text and Static Pattern? Time: query - O(m + log 2 n) change - O(log 2 n)

Dynamic Text and Static Pattern Matching  Pattern is non-changing  Text changes over time  Goal: report new occurrences of the pattern without performing a new search.

Motivation a 14 a 4 b 2 c 3 d 5 c 8 a 6 FAX 1.Intrusion detection systems 2. Info alerts 3. Two-dimensional run-length compressed matching problem, [ALS03]

Problem Definition  Input: T and P over Σ ={1, …, m}.  Output: 1. at start: all occurrences of P in T. 2. after change operation: a. report all new occurrences of P in T. b. discard all old occurrences of P in T. Change Operation: change one character in the text, e.g. location 5 from a to b.

Example  Input: P=agagagc = (ag) 3 c = {a,g,c,t} T = g a g a g c t a g c g a g c a t

Example  Input: P=agagagc = (ag) 3 c = {a,g,c,t} T = g a g a g c t a g a g a g c a t 10

Example  Input: P=agagagc = (ag) 3 c = {a,g,c,t} T = g a g a g c t a g a g a g c a t 108  Output: {8}

Results O(log log m) time per replacement. After O(n log log m + ) preprocessing time,

“Dynamic” Pattern Matching  A. Static Text and Dynamic Pattern.  B. Dynamic Text and Dynamic Pattern. Time: O(n) preprocessing O(m) query time  C. Dynamic Text and Static Pattern. Time: query - O(m + log 2 n) change - O(log 2 n) Time: change and announce O(log log m)

Static Stage  To initially find all occurrences of P in T, use KMP [Knuth-Morris-Pratt ‘77].  All pattern occurrences in a text of length 2m can be stored in O(1) space.

Succinct Output Assumption: the text is of size 2m. (Break the text T into overlapping strings of length 2m-1. ) T 1 m 2m 3m 4m P

Succinct Output (cont.)  P is periodic: A string p is periodic if it matches itself before position |P|/2. e.g. p = abcabcabca abcabcabca Store the output as a ‘chain’ of pattern occurrences.  P is non-periodic: By definition, no more than two occurrences.

On-line Algorithm Following each replacement:  Delete old matches that are no longer pattern occurrences.  Find new matches.

Delete Old Matches Deleting is trivial since we store the matches in constant space:  P is periodic: Truncate the chain of pattern occurrences.  P is non-periodic: Discard all matches that are within distance -m of the replacement.

Find New Matches  Challenge: How can we locate occurrences of P, following each replacement, without actually searching for P?

Main Idea - Text Covers We ‘cover’ the text with substrings of the pattern, i.e. store the text in terms of P. Pattern Text = g a g a g c t a g c g a g c a t = a g a g a g c g a g a g c [ 2,7] a g c [5,7] g a g c a [4,7][1,1]Cover:

Text Cover (cont.) The text cover must satisfy two properties:  Substring Property: each element of the cover is a substring of P, or a character not included in P.  Maximality Property: no two adjacent elements can concatenate to form a substring of P.

Text Cover (cont.) How does a replacement in the text affect the text cover? Initially, in the static stage, we construct a text cover for T. We ensure that the cover satisfies both the substring and maximality property.

Text Cover following replacement Pattern = a g a g a g c Text = g a g a g c t a g c g a g c a t g a g a g c,a g c,g a g c, a Cover: (2,7) - (5,7) (4,7) (1,1) a (2,7) - (5, 6)(1,1) (4,7) (1,1) - (1,3) (1,7)

Updating the Text Cover At most 5 pieces can violate the maximality property.

Substring Concatenation Query  Query: Given two substrings of P, P[i,j] and P[k,l]. Is their concatenation also a substring of P?  Query time: O(log log m).  Preprocessing time: (also uses - [BG00]) Hence, in O(log log m) we can update the cover satisfying both properties.

Find New Matches  Given: a text cover which satisfies both the substring and maximality properties.  Find: all new locations of the pattern in the text.

Key Observations  A new match must begin within distance -m of the change.  A new match can include at most one entire piece of the cover.  It can span at most three pieces of the cover.

Furthermore A new match can begin in one of at most three pieces of the cover: –the piece with the change –the previous piece –the one previous to that P T

Simplified Problem  Search starts within piece of cover.  Simple O(m) time algorithm: –Check each location in X for a pattern start. –Use suffix trees and LCA queries to compare substrings in constant time. P T X

Improved Algorithm  Really, we only have to check each suffix of X that is a pattern prefix. e.g. X = a g a g a  The KMP automaton can give the necessary information. However, the time is still O(m) !

Improved Algorithm  We can group the prefixes of P by their periods.  Each group of prefixes can be checked in constant time!  There are at most O(log m) groups.

Groups (eg.) Pattern = a g a g a g c X = a g a g a There are three suffixes of X that are also pattern prefixes: { agaga, aga } { a } Prefixes with the same period fall into a single group.

Checking a group in Constant Time Pattern = a g a g a g c X = a g a g a a g a g a a g t... a g a g a g a g a g c Idea: Match the period ‘ag’ as far as possible. As soon as (ag)* doesn’t match, check for a ‘c.’ g c...

Groups  A string cannot have more than O(log m) border groups.  Hence, the time of the algorithm is O(log m). [Intuition: each new group has a new period which has to be at least double the size of the old period. e.g. aagaagaa]

Even Better...  We check only a constant number of groups.  Choosing these O(1) groups takes O(log log m) time.  Hence, our algorithm takes O(log log m) time per replacement.

Open Problems  Allowing insertions and deletions to the text.  Searching for a set of multiple static patterns.