Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.

Slides:



Advertisements
Similar presentations
Speaker: C. C. Lin Adviser: R. C. T. Lee
Advertisements

Greedy Algorithms Amihood Amir Bar-Ilan University.
Suffix Trees Construction and Applications João Carreira 2008.
Sparse Compact Directed Acyclic Word Graphs
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Two implementation issues Alphabet size Generalizing to multiple strings.
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses.
1 2 Dimensional Parameterized Matching Carmit Hazay Moshe Lewenstein Dekel Tsur.
Goodrich, Tamassia String Processing1 Pattern Matching.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Suffix trees.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Faster Algorithm for String Matching with k Mismatches Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp Date.
Suffix trees. Trie A tree representing a set of strings. a c b c e e f d b f e g { aeef ad bbfe bbfg c }
Construction of Aho Corasick automaton in Linear time for Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa.
String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.
Fast Algorithm for String Matching with k Mismatches by Amihood Amir, Moshe Lewenstein, and Ely Porat, Journal of Algorithms, to appear, 2003/2004 Speaker:
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t.
Constant-Time LCA Retrieval Presentation by Danny Hermelin, String Matching Algorithms Seminar, Haifa University.
Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Exact String Matching Algorithms. Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
McCreight's suffix tree construction algorithm
Tries 5/27/2018 3:08 AM Tries Tries.
COMP9319 Web Data Compression and Search
Andrzej Ehrenfeucht, University of Colorado, Boulder
Ukkonen's suffix tree construction algorithm
String Data Structures and Algorithms: Suffix Trees and Suffix Arrays
Suffix trees.
Reachability on Suffix Tree Graphs
String Data Structures and Algorithms
String Data Structures and Algorithms
Suffix trees and suffix arrays
Tries 2/27/2019 5:37 PM Tries Tries.
String Matching with k Mismatches
Presentation transcript:

Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein Modified by: Hsing-Yen Ann Date: Nov. 26, 2004

Exact String Matching Input: T = t 1... t n P = p 1 … p m Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A …

Input: T = t 1... t n P = p 1 … p m Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A … 3 Exact String Matching

Input: T = t 1... t n P = p 1 … p m Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A … 3 7 Exact String Matching

Input: T = t 1... t n P = p 1 … p m Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A … Exact String Matching

Input: T = t 1... t n P = p 1 … p m Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A … Answer: {3,7,11,..} Exact String Matching

Approximate String Matching Idea: Find all text locations where distance from pattern is sufficiently small. distance metric:HAMMING DISTANCE Let S = s 1 s 2 … s m R = r 1 r 2 … r m Ham(S,R) = The number of locations j where s j r j Example: S = ABCABC R = ABBAAC Ham(S,R) = 2

String Matching with Mismatches Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C …

String Matching with Mismatches Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C … 2 Ham(P,T 1 ) = 2

String Matching with Mismatches Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C … 2, 4 Ham(P,T 2 ) = 4

String Matching with Mismatches Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C … 2, 4, 6 Ham(P,T 3 ) = 6

String Matching with Mismatches Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2 Ham(P,T 4 ) = 2

String Matching with Mismatches Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, …

Input: T = t 1... t n, P = p 1 … p m String Matching with k Mismatches Output: Every i in T s.t. Ham(P, t i t i+1 … t i+m-1 ) k Example: k = 2 P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, …

Input: T = t 1... t n, P = p 1 … p m String Matching with k Mismatches Output: Every i in T s.t. Ham(P, t i t i+1 … t i+m-1 ) k Example: k = 2 P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, …

Input: T = t 1... t n, P = p 1 … p m String Matching with k Mismatches Output: Every i in T s.t. Ham(P, t i t i+1 … t i+m-1 ) k Example: k = 2 P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, … Y,N,N,Y, …

Naïve Algorithm (for counting mismatches or k-mismatches problem) Running Time: O(nm) n = |T|, m = |P| - Goto each location of text and compute hamming distance of P and T i

The Kangaroo Method (for k-mismatches) Landau – Vishkin 1986 Galil – Giancarlo 1986

Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }

Trie (Cont) Assume no string is a prefix of another a b c e e f d b f e g Each string corresponds to a leaf.

Compressed Trie Compress unary nodes, label edges by strings a b c e e f d b f e g a bbf c eef d e g 

Suffix tree Suffix tree of string s: a compressed trie of all suffixes of s Prefix-free: add a special character, say $, at the end of s

Suffix tree (Example) Let s = abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ { $ b$ ab$ bab$ abab$ } a b a b $ a b $ b $ $ $

Suffix Tree properties - Succint in space - O(n). - Can be built in O(n) time. McCreight, Weiner, Ukkonen, Farach-Colton b 1 2 a b a b $ a b $ 3 $ 4 $ 5 $

Exact string matching 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ Given a pattern P = ab we traverse the tree according to the pattern. s=abab$

Exact string matching 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ Leaves correspond to locations of appearance! s=abab$ 1 3

Exact string matching 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ Prepare Tree: O(n) time Find matches: O(m + occ) time occ = # of matches s=abab$ 1 3

Lowest common ancestors A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes s = abbaab$ 1 3 a b a a b a b $ b 5 $ 2 b 4 b $ a 6 $ 7 $ b $ a a a b $

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes 1 3 a b a a b a b $ b 5 $ 2 b 4 b $ a 6 $ 7 $ b $ a a a b $ s = abbaab$ aab$

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes 1 3 a b a a b a b $ b 5 $ 2 b 4 b $ a 6 $ 7 $ b $ a a a b $ s = abbaab$ aab$ abbaab$

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes 1 3 a b a a b a b $ b 5 $ 2 b 4 b $ a 6 $ 7 $ b $ a a a b $ s = abbaab$ aab$ abbaab$

LCA/LCP properties a 1 3 b a a b a b $ b 5 $ 2 b 4 b $ a 6 $ 7 $ b $ a a a b $ Preprocesssing time : O(n) Query Time: O(1) Harel & Tarjan 1984, Schieber & Vishkin 1988, Berkman & Vishkin 1993

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T$ -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i Finding LCP(s, P 0, T i )

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T$ -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i Length of LCP(s, P 0, T i ) = 4

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T$ -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i Kangrooing distance = LCP(s, P 0, T i ) +1 = 5

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T$ -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i Finding LCP(s, P 5, T i+5 )

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T$ -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i Length of LCP(s, P 5, T i+5 ) = 2

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T$ -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i Kangrooing distance = LCP(s, P 5, T i+5 ) +1 = 3

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T$ -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i Finding LCP(s, P 8, T i+8 )

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T$ -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i Length of LCP(s, P 8, T i+8 ) = 3

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T$ -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i Next iteration: i = i + 1

The Kangaroo Method (for k-mismatches) Preprocess: Build suffix tree of both P and T - O(n+m) time LCA preprocessing - O(n+m) time Check P at given text location Kangroo jump till next mismatch - O(k) time Overall time for naïve approach: O(nk)

2004/11/22Hsing-Yen Ann Faster Algorithms for Four Different Cases Large alphabet At least 2k different alphabets in pattern P. O(n) Small alphabet At most different alphabets in pattern P. General alphabets - many frequent symbols At least frequent symbols General alphabets - few frequent symbols Less than frequent symbols