Download presentation
Presentation is loading. Please wait.
1
String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)
2
String Matching with Mismatches Landau – Vishkin 1986 Galil – Giancarlo 1986 Abrahamson 1987 Amir - Lewenstein - Porat 2000
3
Approximate String Matching problem: Find all text locations where distance from pattern is sufficiently small. distance metric:HAMMING DISTANCE Let S = s 1 s 2 … s m R = r 1 r 2 … r m Ham(S,R) = The number of locations j where s j r j Example: S = ABCABC R = ABBAAC Ham(S,R) = 2
4
Example: P = A B B A A C T = A B C A A B C A C … Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Problem 1: Counting mismatches
5
Example: P = A B B A A C T = A B C A A B C A C … 2 Ham(P,T 1 ) = 2 Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Counting mismatches
6
Example: P = A B B A A C T = A B C A A B C A C … 2, 4 Ham(P,T 2 ) = 4 Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Counting mismatches
7
Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C … 2, 4, 6 Ham(P,T 3 ) = 6 Counting mismatches
8
Example: P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2 Ham(P,T 4 ) = 2 Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Counting mismatches
9
Example: P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, … Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Counting mismatches
10
Example: k = 2 P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, … Problem 2: k-mismatches Input: T = t 1... t n P = p 1 … p m Output: Every i where Ham(P, t i t i+1 … t i+m-1 ) ≤ k
11
Example: k = 2 P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, … 1, 0, 0, 1, Problem 2: k-mismatches Input: T = t 1... t n P = p 1 … p m Output: Every i where Ham(P, t i t i+1 … t i+m-1 ) ≤ kh
12
Naïve Algorithm (for counting mismatches or k-mismatches problem) Running Time: O(nm) n = |T|, m = |P| - Goto each location of text and compute hamming distance of P and T i
13
The Kangaroo Method (for k-mismatches) Landau – Vishkin 1986 Galil – Giancarlo 1986
14
The Kangaroo Method (for k-mismatches) -Create suffix tree (+ lca) for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i
15
The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i
16
The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i
17
The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i
18
The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i
19
The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i
20
The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i
21
The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T - Do up to k LCP queries for every text location Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i
22
The Kangaroo Method (for k-mismatches) Preprocess: Build suffix tree of both P and T - O(n+m) time LCA preprocessing - O(n+m) time Check P at given text location Kangroo jump till next mismatch - O(k) time Overall time: O(nk)
23
How do we do counting in less than O(nm) ?
24
Lets start with binary strings 0 1 0 1 1 0 1 1 P = 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1... T = If we pad the pattern and the text with zeros this is like a convolution of two vectors of length m+n We can do this using FFT in O(nlog(n)) time
25
a b a c c a c b a c a b a c c P = a b b b c c c a a a a b a c b... T = And if the strings are not binary ?
26
a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask a b b b c c c a a a a b a c b... T =
27
a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T =
28
a b a c c a c b a c a b a c c P = PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T =
29
a b a c c a c b a c a b a c c P = PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T = a b b b c c c a a a a b a c b... not-a mask
30
a b a c c a c b a c a b a c c P = PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T = a b b b c c c a a a a b a c b... not-a mask 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1T not a
31
a b a c c a c b a c a b a c c P = PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T = 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1T not a
32
a b a c c a c b a c a b a c c P = PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T = 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1T not a Multiply P a and T (not a) to count mismatches using “a” PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1T not a...
33
a b a c c a c b a c a b a c c P = PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T = 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1T not a Multiply P a and T not a to count mismatches using a PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1T not a...
34
Boolean Convolutions (FFT) Method
35
Running Time: One boolean convolution - O(n log m) time # of matches of all symbols - O(n| | log m) time Boolean Convolutions (FFT) Method
36
How do we do counting in less than O(nm) ? Lets count matches rather than mismatches
37
a b c d e b b d b g d e f h d c c a b g h h... P = T = counter increment For each character you have a list of offsets where it occurs in the pattern, When you see the char in the text, you increment the appropriate counters.
38
a b c d e b b d b g d e f h d c c a b g h h... P = T = counter increment For each character you have a list of offsets where it occurs in the pattern, When you see the char in the text, you increment the appropriate counters.
39
a b c d e b b d b g d e f h d c c a b g h h... P = T = counter increment This is fast if all characters are “rare”
40
Partition the characters into rare and frequent Rare: occurs ≤ c times in the pattern For rare characters run this scheme with the counters Takes O(nc) time
41
Frequest chars You have at most m/c of them Do a convolution for each Total cost O(m/c n log(n)).
42
Fix c Complexity:
43
Frequent Symbol: a symbol that appears at least times in P. Back to the k-mismatch problem Want to beat the O(nk) kangaroo bound
44
Few (≤√k) frequent symbols Do the counters scheme for non-frequent Convolve for each frequent O(n log n) O(n )
45
(≥√k) frequent symbols Intuition: There cannot be too many places where we match
46
(≥√k) frequent symbols - Consider frequent symbols. - For each of them consider the first appearances. Do the counters scheme just for these symbols and occurrences
47
k = 4, = 4 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask c-mask a b b b c c c a a a a b a c b... T = use a-mask
48
Example of Masked Counting k = 4, = 4 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask c-mask a b b b c c c a a a a b a c b... T = a b a c c a c b a c a b a c c d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0... 1 counter
49
Example of Masked Counting k = 4, = 4 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask c-mask a b b b c c c a a a a b a c b... T = a b a c c a c b a c a b a c c d 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0... 1 counter
50
Counting Stage: Run through text and count occurrences of all marks. Time: O(n ). For location i of T, if counter i < k then no match at location i. Why? The total # of elements in all masks is 2 = 2k. Important Observations: 1) Sum of all counters 2 n 2) Every counter whose value is less than k already has more than k errors.
51
How many locations remain? Sum of all counters: 2n Value of potential matches > k The Kangaroo Method. How do we check these locations? Use Kangaroo Method Time: O(k) per location Overall Time: O( ) = O( ) # of potential matches:
52
Additional Points Can reduce to O( n )
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.