String Matching with k Mismatches

Slides:



Advertisements
Similar presentations
Speaker: C. C. Lin Adviser: R. C. T. Lee
Advertisements

Parameterized Matching Amir, Farach, Muthukrishnan Orgad Keller Modified by Ariel Rosenfeld.
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
Sparse Compact Directed Acyclic Word Graphs
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Two implementation issues Alphabet size Generalizing to multiple strings.
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
SUFFIX TREES From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses.
1 2 Dimensional Parameterized Matching Carmit Hazay Moshe Lewenstein Dekel Tsur.
Goodrich, Tamassia String Processing1 Pattern Matching.
Function Matching Amihood Amir Yonatan Aumann Moshe Lewenstein Ely Porat Bar Ilan University.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Suffix trees.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Faster Algorithm for String Matching with k Mismatches Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp Date.
Suffix trees. Trie A tree representing a set of strings. a c b c e e f d b f e g { aeef ad bbfe bbfg c }
Construction of Aho Corasick automaton in Linear time for Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa.
String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Fast Algorithm for String Matching with k Mismatches by Amihood Amir, Moshe Lewenstein, and Ely Porat, Journal of Algorithms, to appear, 2003/2004 Speaker:
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t.
Constant-Time LCA Retrieval Presentation by Danny Hermelin, String Matching Algorithms Seminar, Haifa University.
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
The TRIE Amihood Amir.
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Exact String Matching Algorithms. Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
McCreight's suffix tree construction algorithm
COMP9319 Web Data Compression and Search
Andrzej Ehrenfeucht, University of Colorado, Boulder
Ukkonen's suffix tree construction algorithm
Pattern Matching With Don’t Cares Clifford & Clifford’s Algorithm
String matching.
String Data Structures and Algorithms: Suffix Trees and Suffix Arrays
Suffix trees.
Reachability on Suffix Tree Graphs
String Data Structures and Algorithms
String Data Structures and Algorithms
Suffix trees and suffix arrays
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
Chap 3 String Matching 3 -.
String Processing.
Presentation transcript:

String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

String Matching with k Mismatches Landau – Vishkin 1986 Galil – Giancarlo 1986 Abrahamson 1987 Amir - Lewenstein - Porat 2000 Clifford – Fontaine – Porat – Sach – Starikovskaya 2016

Exact String Matching Input: T = t1 . . . tn P = p1 … pm Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A…

Exact String Matching Input: T = t1 . . . tn P = p1 … pm Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3

Exact String Matching Input: T = t1 . . . tn P = p1 … pm Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3 7

Exact String Matching Input: T = t1 . . . tn P = p1 … pm Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3 7 11

Exact String Matching Input: T = t1 . . . tn P = p1 … pm Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A… Answer: {3,7,11,..}

Exact String Matching Problem: Matching not exact in applications of: Computational Biology Musicology Text Editing Meteorology etc. Need other definitions of string matching!

Approximate String Matching Idea: Find all text locations where distance from pattern is sufficiently small. distance metric: HAMMING DISTANCE Let S = s1s2…sm R = r1r2…rm Ham(S,R) = The number of locations j where sj rj Example: S = ABCABC R = ABBAAC Ham(S,R) = 2

String Matching with Mismatches Input: T = t1 . . . tn P = p1 … pm Output: For each i in T Ham(P, titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C…

String Matching with Mismatches Input: T = t1 . . . tn P = p1 … pm Output: For each i in T Ham(P, titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2 Ham(P,T1) = 2

String Matching with Mismatches Input: T = t1 . . . tn P = p1 … pm Output: For each i in T Ham(P, titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2, 4 Ham(P,T2) = 4

String Matching with Mismatches Input: T = t1 . . . tn P = p1 … pm Output: For each i in T Ham(P, titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2, 4, 6 Ham(P,T3) = 6

String Matching with Mismatches Input: T = t1 . . . tn P = p1 … pm Output: For each i in T Ham(P, titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2 Ham(P,T4) = 2

String Matching with Mismatches Input: T = t1 . . . tn P = p1 … pm Output: For each i in T Ham(P, titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

String Matching with k Mismatches Input: T = t1 . . . tn, P = p1 … pm Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k Example: k = 2 P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

String Matching with k Mismatches Input: T = t1 . . . tn, P = p1 … pm Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k Example: k = 2 P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

String Matching with k Mismatches Input: T = t1 . . . tn, P = p1 … pm Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k Example: k = 2 P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, … Y,N,N,Y,…

Naïve Algorithm (for counting mismatches or k-mismatches problem) - Goto each location of text and compute hamming distance of P and Ti Running Time: O(nm) n = |T|, m = |P|

The Kangaroo Method (for k-mismatches) Landau – Vishkin 1986 Galil – Giancarlo 1986

Trie A tree representing a set of strings. c { a aeef b ad bbfe bbfg e

Trie (Cont) Assume no string is a prefix of another c a b e b d e f f Each string corresponds to a leaf. a b e b d e f f e g

Compressed Trie Compress unary nodes, label edges by strings c c a a b  c a a b e b d bbf d eef e f f e g e g

Suffix tree Suffix tree of string s: a compressed trie of all suffixes of s Prefix-free: add a special character, say $, at the end of s

Suffix tree (Example) Let s = abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ $ { $ b$ ab$ bab$ abab$ } a b b $ a a $ b b $ $

Suffix Tree properties 1 2 a b $ 3 4 5 b Succint in space - O(n). - Can be built in O(n) time. McCreight, Weiner, Ukkonen, Farach-Colton

Exact string matching s=abab$ $ a b b $ a a $ b b $ $ 5 $ a a $ b 4 b $ $ 3 2 1 Given a pattern P = ab we traverse the tree according to the pattern.

Exact string matching s=abab$ 1 3 $ a b b $ a a $ b b $ $ 1 3 a b b 5 $ a a $ b 4 b $ $ 3 2 1 Leaves correspond to locations of appearance!

Exact string matching s=abab$ 1 3 $ a b b $ a a $ b b $ $ 1 3 a b b 5 $ a a $ b 4 b $ $ 3 2 1 Prepare Tree: O(n) time Find matches: O(m + occ) time occ = # of matches

Lowest common ancestors A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes $ s = abbaab$ a b 7 $ a a b b b $ a a 6 b a b $ 4 b a $ $ a 3 b 5 $ 2 1

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes $ s = abbaab$ aab$ a b 7 $ a a b b b $ a a 6 b a b $ 4 b a $ $ a 3 b 5 $ 2 1

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes $ s = abbaab$ aab$ abbaab$ a b 7 $ a a b b b $ a a 6 b a b $ 4 b a $ $ a 3 b 5 $ 2 1

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes $ s = abbaab$ aab$ abbaab$ a b 7 $ a a b b b $ a a 6 b a b $ 4 b a $ $ a 3 b 5 $ 2 1

LCA/LCP properties Preprocesssing time : O(n) Query Time: O(1) 3 b a $ 5 2 4 6 7 LCA/LCP properties a Preprocesssing time : O(n) Query Time: O(1) Harel & Tarjan 1984, Schieber & Vishkin 1988, Berkman & Vishkin 1993

The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T Do up to k LCP queries for every text location Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) Preprocess: Build suffix tree of both P and T - O(n+m) time LCA preprocessing - O(n+m) time Check P at given text location Kangroo jump till next mismatch - O(k) time Overall time: O(nk)