Presentation is loading. Please wait.

Presentation is loading. Please wait.

String Matching with k Mismatches

Similar presentations


Presentation on theme: "String Matching with k Mismatches"— Presentation transcript:

1 String Matching with k Mismatches
Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld

2 String Matching with k Mismatches
Landau – Vishkin Galil – Giancarlo Abrahamson Amir - Lewenstein - Porat Clifford – Fontaine – Porat – Sach – Starikovskaya

3 Exact String Matching Input: T = t1 . . . tn P = p1 … pm
Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A…

4 Exact String Matching Input: T = t1 . . . tn P = p1 … pm
Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3

5 Exact String Matching Input: T = t1 . . . tn P = p1 … pm
Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A…

6 Exact String Matching Input: T = t1 . . . tn P = p1 … pm
Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A…

7 Exact String Matching Input: T = t1 . . . tn P = p1 … pm
Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A… Answer: {3,7,11,..}

8 Exact String Matching Problem: Matching not exact in applications of:
Computational Biology Musicology Text Editing Meteorology etc. Need other definitions of string matching!

9 Approximate String Matching
Idea: Find all text locations where distance from pattern is sufficiently small. distance metric: HAMMING DISTANCE Let S = s1s2…sm R = r1r2…rm Ham(S,R) = The number of locations j where sj rj Example: S = ABCABC R = ABBAAC Ham(S,R) = 2

10 String Matching with Mismatches
Input: T = t tn P = p1 … pm Output: For each i in T Ham(P, titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C…

11 String Matching with Mismatches
Input: T = t tn P = p1 … pm Output: For each i in T Ham(P, titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2 Ham(P,T1) = 2

12 String Matching with Mismatches
Input: T = t tn P = p1 … pm Output: For each i in T Ham(P, titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2, 4 Ham(P,T2) = 4

13 String Matching with Mismatches
Input: T = t tn P = p1 … pm Output: For each i in T Ham(P, titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2, 4, 6 Ham(P,T3) = 6

14 String Matching with Mismatches
Input: T = t tn P = p1 … pm Output: For each i in T Ham(P, titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2 Ham(P,T4) = 2

15 String Matching with Mismatches
Input: T = t tn P = p1 … pm Output: For each i in T Ham(P, titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

16 String Matching with k Mismatches
Input: T = t tn, P = p1 … pm Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k Example: k = 2 P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

17 String Matching with k Mismatches
Input: T = t tn, P = p1 … pm Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k Example: k = 2 P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

18 String Matching with k Mismatches
Input: T = t tn, P = p1 … pm Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k Example: k = 2 P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, … Y,N,N,Y,…

19 Naïve Algorithm (for counting mismatches or k-mismatches problem)
- Goto each location of text and compute hamming distance of P and Ti Running Time: O(nm) n = |T|, m = |P|

20 The Kangaroo Method (for k-mismatches) Landau – Vishkin 1986
Galil – Giancarlo

21 Trie A tree representing a set of strings. c { a aeef b ad bbfe bbfg e

22 Trie (Cont) Assume no string is a prefix of another c a b e b d e f f
Each string corresponds to a leaf. a b e b d e f f e g

23 Compressed Trie Compress unary nodes, label edges by strings c c a a b
c a a b e b d bbf d eef e f f e g e g

24 Suffix tree Suffix tree of string s:
a compressed trie of all suffixes of s Prefix-free: add a special character, say $, at the end of s

25 Suffix tree (Example) Let s = abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ $ { $ b$ ab$ bab$ abab$ } a b b $ a a $ b b $ $

26 Suffix Tree properties
1 2 a b $ 3 4 5 b Succint in space - O(n). - Can be built in O(n) time. McCreight, Weiner, Ukkonen, Farach-Colton

27 Exact string matching s=abab$ $ a b b $ a a $ b b $ $
5 $ a a $ b 4 b $ $ 3 2 1 Given a pattern P = ab we traverse the tree according to the pattern.

28 Exact string matching s=abab$ 1 3 $ a b b $ a a $ b b $ $
1 3 a b b 5 $ a a $ b 4 b $ $ 3 2 1 Leaves correspond to locations of appearance!

29 Exact string matching s=abab$ 1 3 $ a b b $ a a $ b b $ $
1 3 a b b 5 $ a a $ b 4 b $ $ 3 2 1 Prepare Tree: O(n) time Find matches: O(m + occ) time occ = # of matches

30 Lowest common ancestors
A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

31 Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes $ s = abbaab$ a b 7 $ a a b b b $ a a 6 b a b $ 4 b a $ $ a 3 b 5 $ 2 1

32 Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes $ s = abbaab$ aab$ a b 7 $ a a b b b $ a a 6 b a b $ 4 b a $ $ a 3 b 5 $ 2 1

33 Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes $ s = abbaab$ aab$ abbaab$ a b 7 $ a a b b b $ a a 6 b a b $ 4 b a $ $ a 3 b 5 $ 2 1

34 Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes $ s = abbaab$ aab$ abbaab$ a b 7 $ a a b b b $ a a 6 b a b $ 4 b a $ $ a 3 b 5 $ 2 1

35 LCA/LCP properties Preprocesssing time : O(n) Query Time: O(1)
3 b a $ 5 2 4 6 7 LCA/LCP properties a Preprocesssing time : O(n) Query Time: O(1) Harel & Tarjan 1984, Schieber & Vishkin 1988, Berkman & Vishkin 1993

36 The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T
Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

37 The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T
Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

38 The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T
Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

39 The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T
Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

40 The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T
Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

41 The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T
Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

42 The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T
Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

43 The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T
Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

44 The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T
Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

45 The Kangaroo Method (for k-mismatches) Create suffix tree for: s = P#T
Do up to k LCP queries for every text location Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

46 The Kangaroo Method (for k-mismatches) Preprocess:
Build suffix tree of both P and T - O(n+m) time LCA preprocessing O(n+m) time Check P at given text location Kangroo jump till next mismatch - O(k) time Overall time: O(nk)


Download ppt "String Matching with k Mismatches"

Similar presentations


Ads by Google