Download presentation
Presentation is loading. Please wait.
Published byWesley Cox Modified over 9 years ago
1
Semi-Numerical String Matching
2
All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as: Arithmetic. Bit – operations. The fast Fourier transform. Semi-numerical String Matching
3
We will survey three examples of such methods: The Random Fingerprint method due to Karp and Rabin. Shift–And method due to Baeza-Yates and Gonnet, and its extension to agrep due to Wu and Manber. A solution to the match count problem using the fast Fourier transform due to Fischer and Paterson and an improvement due to Abrahamson. Semi-numerical String Matching
4
Exact match problem: we want to find all the occurrences of the pattern P in the text T. The pattern P is of length n. The text T is of length m. Karp-Rabin fingerprint - exact match
5
Arithmetic replaces comparisons. An efficient randomized algorithm that makes an error with small probability. A randomized algorithm that never errors whose expected running time is efficient. We will consider a binary alphabet: {0,1}. Karp-Rabin fingerprint - exact match
6
Strings are also numbers, H: strings → numbers. Let s be a string of length n, Definition: let T r denote the n length substring of T starting at position r. Arithmetic replaces comparisons.
7
Strings are also numbers, H: strings → numbers. T = 1 0 1 1 0 1 0 1 P = 0 1 0 1 T = 1 0 1 1 0 1 0 1H(T 5 ) = 5 = P = 0 1 0 1H(P) = 5 T = 1 0 1 1 0 1 0 1H(T 2 ) = 6 ≠ P = 0 1 0 1H(P) = 5 Arithmetic replaces comparisons.
8
Theorem: There is an occurrence of P starting at position r of T if and only if H(P) = H(T r ) Proof: Follows immediately from the unique representation of a number in base 2. Arithmetic replaces comparisons.
9
We can compute H(T r ) from H(T r-1 ) T = 1 0 1 1 0 1 0 1T 1 = 1 0 1 1 T 2 = 0 1 1 0 Arithmetic replaces comparisons.
10
A simple efficient algorithm: Compute H(T 1 ). Run over T Compute H(T r ) from H(T r-1 ) in constant time, and make the comparisons. Total running time O(m)? Arithmetic replaces comparisons.
12
Let ’ s use modular arithmetic, this will help us keep the numbers small. For some integer p The fingerprint of P is defined by H p (P) = H(P) (mod p) Karp-Rabin
13
Lemma: And during this computation no number ever exceeds 2p. Karp-Rabin
14
P = 1 0 1 1 1 1H(P) = 47 p = 7H p (P) = 47 (mod 7) = 5 An example
15
Intermediate numbers are also kept small. We can still compute H(T r ) from H(T r-1 ). Arithmetic: Modular arithmetic: Karp-Rabin
16
Intermediate numbers are also kept small. We can still compute H(T r ) from H(T r-1 ). Arithmetic: Modular arithmetic: Karp-Rabin
17
How about the comparisons? Arithmetic: There is an occurrence of P starting at position r of T if and only if H(P) = H(T r ) Modular arithmetic: If there is an occurrence of P starting at position r of T then H p (P) = H p (T r ) There are values of p for which the converse is not true! Karp-Rabin
18
Definition: If H p (P) = H p (T r ) but P doesn ’ t occur in T starting at position r, we say there is a false match between P and T at position r. If there is some position r such that there is a false match between P and T at position r, we say there is a false match between P and T. Karp-Rabin
19
Our goal will be to choose a modulus p such that p is small enough to keep computations efficient. p is large enough so that the probability of a false match is kept small. Karp-Rabin
20
Definition: For a positive integer u, п(u) is the number of primes that are less than or equal to u. Prime number theorem (without proof): Prime moduli limit false matches
21
Lemma (without proof): if u ≥ 29, then the product of all the primes that are less than or equal to u is greater than 2 u. Example: u = 29, the prime numbers less than or equal to 29 are: 2,3,5,7,11,13,17,19,23,29, their product is 6,469,693,230 ≥ 536,870,912 = 2 29 Prime moduli limit false matches
22
Corollary: If u ≥ 29 and x is any number less than or equal to 2 u, then x has fewer than п(u) distinct prime divisors. Proof: Assume x has k ≥ п(u) distinct prime divisors q 1, …, q k then 2 u ≥ x ≥ q 1 * … * q k but q 1 * … * q k is at least as large as the product of the first п(u) prime numbers. Prime moduli limit false matches
23
Theorem: Let I be a positive integer, and p a randomly chosen prime less than or equal to I. If nm ≥ 29 then The probability of a false match between P and T is less than or equal to п(nm) / п(I). Prime moduli limit false matches
24
Proof: Let R be the set of positions in T where P doesn ’ t begin. We have By the corollary the product has at most п(nm) distinct prime divisors. If there is a false match at position r then p divides thus also divides p must be in a set of size п(nm) but p was chosen randomly out of a set of size п(I). Prime moduli limit false matches
25
Choose a positive integer I. Pick a random prime p less than or equal to I, and compute P ’ s fingerprint – H p (P). For each position r in T, comput H p (T r ) and test to see if it equals H p (P). If the numbers are equal either declare a probable match or check and declare a definite match. Running time: excluding verification O(m). Random fingerprint algorithm
26
The smaller I is, computations are more efficient The larger I is, the probability of a false match decresses. Proposition: When I = nm 2 1. The largest number used in the algorithm requires at most 4(log(n)+log(m)) bits. 2. The probability of a false match is at most 2.53/m. How to choose I
27
Proof: How to choose I
28
An idea: why not choose k primes? Proposition: when k primes are chosen randomly and independently between 1 and I, the probability of a false match is at most Proof: We saw that if p allows and error it is in a set of at most п(nm) integers. A false match can occur only if each of the independently chosen k primes is in a set of size of at most п(nm) integers. Extensions
29
k = 4, n = 250, m = 4000 I = 250*4000 2 < 2 32 An illustaration
30
When k primes are used, the probability of a false match is at most Proof: Suppose a false match occurs at position r. That means that each of the primes must divide |H(P)-H(T r ) | ≤ 2 n. There are at most п(n) primes that divide it. Each prime is chosen from a set of size п(I) and by chance is a part of a set of size п(n). Even lower limits on the error
31
Consider the list L of locations in T where the Karp- Rabin algorithm declares P to be found. A run is a maximal interval of starting locations l 1, l 2, …, l r in L such that every two numbers differ by at most n/2. Let ’ s verify a run. Checking for error in linear time
32
Check the first two declared occurrences explicitly. P = abbabbabbabbab T = abbabbabbabbabbabbabbabbabbax… P = abbabbabbabbab T = abbabbabbabbabbabbabbabbabbax… If there is a false match stop. Otherwise P is semi periodic with period d = l 1 – l 2. Checking for error in linear time
33
d is the minimal period. P = abbabbabbabbab T = abbabbabbabbabbabbabbabbabbax… P = abbabbabbabbab T = abbabbabbabbabbabbabbabbabbax… Checking for error in linear time
34
P = abbabbabbabbab T = abbabbabbabbabbabbabbabbabbax… For each i check that l i+1 – l i = d. Check the last d characters of l i for each i. Checking for error in linear time
35
P = abbabbabbabbab T = abbabbabbabbabbabbabbabbabbax… Checking for error in linear time Check l 1
36
P = abbabbabbabbab T = abbabbabbabbabbabbabbabbabbax… Checking for error in linear time Check l 2 P is semi periodic with period 3.
37
T = abbabbabbabbabbabbabbabbabbax… Checking for error in linear time Check l i+1 – l i = 3
38
For each i check the last 3 characters of l i. P = bab T = abbabbabbabbabbabbabbabbabbax… Checking for error in linear time
39
For each i check the last 3 characters of l i. P = bab T = abbabbabbabbabbabbabbabbabbax… Checking for error in linear time
40
For each i check the last 3 characters of l i. Report a false match or approve the run. P = bab T = abbabbabbabbabbabbabbabbabbax… Checking for error in linear time
41
No character of T is examined more than twice during a single run. Two runs are separated by at least n/2 positions and each run is at least n positions long. Thus no character of T is examined in more than two consecutive runs. Total verification time O(m). Time analysis
42
When we have a false match we start again with a different prime. The expected probability of a false match is O(1/m). We have converted the algorithm to one that never mistakes with expected linear running time. Time analysis
43
It is efficient and simple. It is space efficient. It can be generalized to solve harder problems such as 2-dimensional string matching. It ’ s performance is backed up by a concrete theoretical analysis. Why use Karp-Rabin?
44
The Shift-And Method
45
We start with the exact match problem. Define M to be a binary n by m matrix such that: M(i,j) = 1 iff the first i characters of P exactly match the i characters of T ending at character j. M(i,j) = 1 iff P[1.. i] ≡ T[j-i+1.. j] The Shift-And Method
46
Let T = california Let P = for M = M(i,j) = 1 iff the first i characters of P exactly match the i characters of T ending at character j. How does M solve the exact match problem? The Shift-And Method 123456789m = 10 10000100000 20000010000 n=30000001000
47
How to construct M We will construct M column by column. Two definitions are in order: Bit-Shift(j-1) is the vector derived by shifting the vector for column j-1 down by one and setting the first bit to 1. Example:
48
We define the n-length binary vector U(x) for each character x in the alphabet. U(x) is set to 1 for the positions in P where character x appears. Example: P = abaac How to construct M
49
Initialize column 0 of M to all zeros For j > 1 column j is obtained by How to construct M
50
1 2 3 4 5 6 7 8 9 10 T = x a b x a b a a x a 1 2 3 4 5 P = a b a a c An example j = 1 1234567891010 10 20 30 40 50
51
1 2 3 4 5 6 7 8 9 10 T = x a b x a b a a x a 1 2 3 4 5 P = a b a a c An example j = 2 1234567891010 101 200 300 400 500
52
1 2 3 4 5 6 7 8 9 10 T = x a b x a b a a x a 1 2 3 4 5 P = a b a a c An example j = 3 1234567891010 1010 2001 3000 4000 5000
54
1 2 3 4 5 6 7 8 9 10 T = x a b x a b a a x a 1 2 3 4 5 P = a b a a c An example j = 8 1234567891010 101001011 200100100 300000010 400000001 500000000
55
For i > 1, Entry M(i,j) = 1 iff 1) The first i-1 characters of P match the i-1characters of T ending at character j-1. 2) Character P(i) ≡ T(j). 1) is true when M(i-1,j-1) = 1. 2) is true when the i ’ th bit of U(T(j)) = 1. The algorithm computes the and of these two bits. Correctness
56
1 2 3 4 5 6 7 8 9 10 T = x a b x a b a a x a a b a a c Correctness 1234567891010 10100101101 20010010000 30000001000 40000000100 50000000000 M(4,8) = 1, this is because a b a a is a prefix of P of length 4 that ends at position 8 in T. Condition 1) – We had a b a as a prefix of length 3 that ended at position 7 in T ↔ M(3,7) = 1. Condition 2) – The fourth bit of P is the eighth bit of T ↔ The fourth bit of U(T(8)) = 1.
57
Formally the running time is Θ(mn). However, the method is very efficient if n is the size of a single or a few computer words. Furthermore only two columns of M are needed at any given time. Hence, the space used by the algorithm is O(n). How much did we pay?
58
We extend the shift-and method for finding inexact occurrences of a pattern in a text. Reminder example: T = aatatccacaa P = atcgaa P appears in T with 2 mismatches starting at position 4, it also occurs with 4 mismatches starting at position 2. a a t a t c c a c a a a a t a t c c a c a a a t c g a a a t c g a a agrep: The Shift-And Method with errors
59
Our current goal given k find all the occurrences of P in T with up to k mismatches. We define the matrix M k to be an n by m binary matrix, such that: M k (i,j) = 1 iff At least i-k of the first i characters of P match the i characters up through character j of T. What is M 0 ? How does M k solve the k-mismatch problem? agrep
60
We compute M l for all l=0, …, k. For each j compute M(j), M 1 (j), …, M k (j) For all l initialize M l (0) to the zero vector. The j ’ th column of M l is given by: Computing M k
61
The first i-1 characters of P match a substring of T ending at j-1, with at most l mismatches, and the next pair of characters in P and T are equal. Computing M k ***** ***** j-1 i-1
62
The first i-1 characters of P match a substring of T ending at j-1, with at most l -1 mismatches. Computing M k ***** ***** j-1 i-1
63
We compute M l for all l=1, …, k. For each j compute M(j), M 1 (j), …, M k (j) For all l initialize M l (0) to the zero vector. The j ’ th column of M l is given by: Computing M k
64
1 2 3 4 5 6 7 8 9 10 T = x a b x a b a a x a P = a b a a c M0=M0= Example: M 1 1234567891010 11111111111 20010010110 30001001001 40000100100 50000000010 1234567891010 10100101101 20010010000 30000001000 40000000100 50000000000
65
1 2 3 4 5 6 7 8 9 10 T = x a b x a b a a x a P = a b a a Example: M 1 1234567891010 11111111111 20010010110 30001001001 40000100100 50000000010
66
Formally the running time is Θ(kmn). Again, the method is practically efficient for small n. Still only a constant number of columns of M are needed at any given time. Hence, the space used by the algorithm is O(n). How much did we pay?
67
The match count problem
68
We want to count the exact number of characters that match each of the different alignments of P with T. a a t a t c c a c a a a t c g a a a t c g a a 4 2 The match-count problem
69
We will first look at a simple algorithm which extends the techniques we ’ ve seen so far. Next, we introduce a more efficient algorithm that exploits existing efficient methods to calculate the Fourier transform. We conclude with a variation that gives good performance for unbounded alphabets. The match-count problem
70
We define the matrix MC to be an n by m integer valued matrix, such that: MC(i,j) = The number of characters of P[1..i] that match T[j- I+1,..,j] How does MC solve the match-count problem? Match-count Algorithm 1
71
Initialize column 0 of MC to all zeros For j ≥ 1 column j is obtained by Total of Θ(nm) comparisons and (simple) additions. Computing MC
72
Define a vector W that counts the matching symbols, it’s indices are the possible alignments. T = a b a b c a a aa b a b c a a a P = a b c a a b c a W(1) = 2W(2) = 0 a b a b c a a aa b a b c a a a a b c a a b c a W(3) = 4W(4) = 1 a b a b c a a a a b c a W(5) = 1 Match-count algorithm 2
73
Let’s handle one symbol at a time: T = a b a b c a a a P = a b c a T a = 1 0 1 0 0 1 1 1W a (1) = 1 P a = 1 0 0 1 1 0 1 0 0 1 1 1W a (3) = 2 1 0 0 1 Match-count algorithm 2
74
We have W = W a + W b + W c. Or in the general case Match-count algorithm 2
75
We can calculate W α using a convolution. Let’s rephrase the problem. X = T α padded with n zeros on the right. Y = P α padded with m zeros on the right. We have two vectors X,Y of length m+n. Match-count algorithm 2
76
T a = 1 0 1 0 0 1 1 1 P a = 1 0 0 1 X = 1 0 1 0 0 1 1 1 0 0 0 0 Y = 1 0 0 1 0 0 0 0 0 0 0 0 Match-count algorithm 2
77
In our modified representation: Where the indices are taken modulo n+m. W(1) = < 1 0 1 0 0 1 1 1 0 0 0 0, 1 0 0 1 0 0 0 0 0 0 0 0 > W(2) = < 0 1 0 1 0 0 1 1 1 0 0 0, 0 1 0 0 1 0 0 0 0 0 0 0 > Match-count algorithm 2
78
In our modified representation: Where the indices are taken modulo n+m. This is the convolution of X and the reverse of Y. Using FFT calculating convolution takes time O(m log(m)). Match-count algorithm 2
79
The total running time is O(|∑| m log(m)) What happens if |∑| is large? For example when |∑| =n, we get O(n m log(m)) which is actually worse than the naïve algorithm. Match-count algorithm 2
80
An idea: some symbols might appear more often than others. Use convolutions for the frequent symbols. Use a more simple counting method for the rest. Match-count algorithm 3
81
Say α appears less than c times in P. Record the locations of α in P l 1,…,l r r ≤c. Go over the text, when we see α at location j we increment W(j-l 1 +1), …, W(j-l r+1 +1). Rare symbols
82
T = a b a b c a a a… P = a b c a c l 1 = 3, l 2 = 5 j = 5 → W(5-3+1)++W(5-5+1)++ W(3)++W(1)++ a b a b c a a a…T = a b a b c a a a... a b c a c a b c a c Rare symbols
83
We can do this for all the rare symbols in one sweep of T, for each position in T we make up to c updates in W. Thus handling the rare symbols will cost us O(cm). For the frequent symbols we pay one convolution per symbol so we pay at most O(n/c m log(m)). How much did we pay?
84
We choose the c that gives us the best balance The total running time is Determining c
85
Dan Gusfield, Algorithms on Strings, Trees and Graphs. Cambridge Univ. Press, Cambridge,1997. References
86
The end
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.