Fast Fourier Transform Algorithms in Action Fast Fourier Transform Haim Kaplan, Uri Zwick Tel Aviv University March 2016 Last updated: March 28, 2017
String Matching abraabracadabracadabraabara abracadabra abracadabra Given a text of length 𝑛 and a pattern of length 𝑚, find all occurrences of the pattern in the text. The naïve algorithm runs in 𝑂 𝑚𝑛 time. Several classical algorithms run in 𝑂 𝑚+𝑛 time. [Knuth-Morris-Pratt (1977)] [Boyer-Moore (1977)]
More String Matching Problems abraabracadabracadabraabara abracadabra abracadabra Count the number of matches/mismatches in each alignment of the pattern with the text. (Find all aligments with at most 𝑘 mismatches.) Allow a wildcard (“don’t care”) (∗) that match any (single) symbol in the pattern and/or text. “Traditional” string matching techniques are not so efficient for these extensions.
(Cross-)Correlation 𝑥 0 𝑥 1 𝑥 2 𝑥 3 𝑦 0 𝑦 1 𝑦 2 𝑦 3 𝑧 −3 = 𝑥 0 𝑦 3 𝑧 −3 = 𝑥 0 𝑦 3 𝑧 −2 = 𝑥 0 𝑦 2 + 𝑥 1 𝑦 3 𝑧 −1 = 𝑥 0 𝑦 1 + 𝑥 1 𝑦 2 + 𝑥 2 𝑦 3 𝑧 0 = 𝑥 0 𝑦 0 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 + 𝑥 3 𝑦 3 𝑧 1 = 𝑥 1 𝑦 0 + 𝑥 2 𝑦 1 + 𝑥 3 𝑦 2 𝑧 2 = 𝑥 2 𝑦 0 + 𝑥 3 𝑦 1 𝑧 3 = 𝑥 3 𝑦 0
(Cross-)Correlation 𝑧 𝑘 = 𝑖 𝑥 𝑖 𝑦 𝑖−𝑘 = 𝑗 𝑥 𝑗+𝑘 𝑦 𝑗 = 𝐱∗ 𝐲 𝑅 𝑘+𝑛−1 A convolution without the initial reversal, with a shift of indices. 𝑧 𝑘 = 𝑖 𝑥 𝑖 𝑦 𝑖−𝑘 = 𝑗 𝑥 𝑗+𝑘 𝑦 𝑗 = 𝐱∗ 𝐲 𝑅 𝑘+𝑛−1 𝑘=−(𝑛−1),…,𝑛−1. The correlation of two vectors of length 𝑛 can be computed in 𝑂 𝑛 log 𝑛 time.
(Cross-)Correlation (unequal lengths) 𝑥 0 𝑥 1 𝑥 2 𝑥 3 𝑥 4 𝑥 5 𝑦 0 𝑦 1 𝑦 2 𝑦 3 𝑧 −3 = 𝑥 0 𝑦 3
(Cross-)Correlation 𝑥 0 𝑥 1 𝑥 2 𝑥 3 𝑥 4 𝑥 5 𝑦 0 𝑦 1 𝑦 2 𝑦 3 𝑧 −3 = 𝑥 0 𝑦 3 𝑧 −2 = 𝑥 0 𝑦 2 + 𝑥 1 𝑦 3
(Cross-)Correlation 𝑥 0 𝑥 1 𝑥 2 𝑥 3 𝑥 4 𝑥 5 𝑦 0 𝑦 1 𝑦 2 𝑦 3 𝑧 −3 = 𝑥 0 𝑦 3 𝑧 −2 = 𝑥 0 𝑦 2 + 𝑥 1 𝑦 3 𝑧 −1 = 𝑥 0 𝑦 1 + 𝑥 1 𝑦 2 + 𝑥 2 𝑦 3
(Cross-)Correlation 𝑥 0 𝑥 1 𝑥 2 𝑥 3 𝑥 4 𝑥 5 𝑦 0 𝑦 1 𝑦 2 𝑦 3 𝑧 −3 = 𝑥 0 𝑦 3 𝑧 −2 = 𝑥 0 𝑦 2 + 𝑥 1 𝑦 3 𝑧 −1 = 𝑥 0 𝑦 1 + 𝑥 1 𝑦 2 + 𝑥 2 𝑦 3 𝑧 0 = 𝑥 0 𝑦 0 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 + 𝑥 3 𝑦 3
(Cross-)Correlation 𝑥 0 𝑥 1 𝑥 2 𝑥 3 𝑥 4 𝑥 5 𝑦 0 𝑦 1 𝑦 2 𝑦 3 𝑧 −3 = 𝑥 0 𝑦 3 𝑧 −2 = 𝑥 0 𝑦 2 + 𝑥 1 𝑦 3 𝑧 −1 = 𝑥 0 𝑦 1 + 𝑥 1 𝑦 2 + 𝑥 2 𝑦 3 𝑧 0 = 𝑥 0 𝑦 0 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 + 𝑥 3 𝑦 3 𝑧 1 = 𝑥 1 𝑦 0 + 𝑥 2 𝑦 1 + 𝑥 3 𝑦 2 + 𝑥 4 𝑦 3
(Cross-)Correlation 𝑥 0 𝑥 1 𝑥 2 𝑥 3 𝑥 4 𝑥 5 𝑦 0 𝑦 1 𝑦 2 𝑦 3 𝑧 −3 = 𝑥 0 𝑦 3 𝑧 −2 = 𝑥 0 𝑦 2 + 𝑥 1 𝑦 3 𝑧 −1 = 𝑥 0 𝑦 1 + 𝑥 1 𝑦 2 + 𝑥 2 𝑦 3 𝑧 0 = 𝑥 0 𝑦 0 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 + 𝑥 3 𝑦 3 𝑧 1 = 𝑥 1 𝑦 0 + 𝑥 2 𝑦 1 + 𝑥 3 𝑦 2 + 𝑥 4 𝑦 3 𝑧 2 = 𝑥 2 𝑦 0 + 𝑥 3 𝑦 1 + 𝑥 4 𝑦 2 + 𝑥 5 𝑦 3 𝑧 3 = 𝑥 3 𝑦 0 + 𝑥 4 𝑦 1 + 𝑥 5 𝑦 2
(Cross-)Correlation 𝑥 0 𝑥 1 𝑥 2 𝑥 3 𝑥 4 𝑥 5 𝑦 0 𝑦 1 𝑦 2 𝑦 3 𝑧 −3 = 𝑥 0 𝑦 3 𝑧 −2 = 𝑥 0 𝑦 2 + 𝑥 1 𝑦 3 𝑧 −1 = 𝑥 0 𝑦 1 + 𝑥 1 𝑦 2 + 𝑥 2 𝑦 3 𝑧 0 = 𝑥 0 𝑦 0 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 + 𝑥 3 𝑦 3 𝑧 1 = 𝑥 1 𝑦 0 + 𝑥 2 𝑦 1 + 𝑥 3 𝑦 2 + 𝑥 4 𝑦 3 𝑧 2 = 𝑥 2 𝑦 0 + 𝑥 3 𝑦 1 + 𝑥 4 𝑦 2 + 𝑥 5 𝑦 3 𝑧 3 = 𝑥 3 𝑦 0 + 𝑥 4 𝑦 1 + 𝑥 5 𝑦 2 𝑧 4 = 𝑥 4 𝑦 0 + 𝑥 5 𝑦 1
(Cross-)Correlation 𝑥 0 𝑥 1 𝑥 2 𝑥 3 𝑥 4 𝑥 5 𝑦 0 𝑦 1 𝑦 2 𝑦 3 𝑧 −3 = 𝑥 0 𝑦 3 𝑧 −2 = 𝑥 0 𝑦 2 + 𝑥 1 𝑦 3 𝑧 −1 = 𝑥 0 𝑦 1 + 𝑥 1 𝑦 2 + 𝑥 2 𝑦 3 𝑧 0 = 𝑥 0 𝑦 0 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 + 𝑥 3 𝑦 3 𝑧 1 = 𝑥 1 𝑦 0 + 𝑥 2 𝑦 1 + 𝑥 3 𝑦 2 + 𝑥 4 𝑦 3 𝑧 2 = 𝑥 2 𝑦 0 + 𝑥 3 𝑦 1 + 𝑥 4 𝑦 2 + 𝑥 5 𝑦 3 𝑧 3 = 𝑥 3 𝑦 0 + 𝑥 4 𝑦 1 + 𝑥 5 𝑦 2 𝑧 4 = 𝑥 4 𝑦 0 + 𝑥 5 𝑦 1 𝑧 5 = 𝑥 5 𝑦 0
(Cross-)Correlation 𝑧 𝑘 = 𝑖 𝑥 𝑖 𝑦 𝑖−𝑘 = 𝑗 𝑥 𝑗+𝑘 𝑦 𝑗 = 𝐱∗ 𝐲 𝑅 𝑘+𝑚−1 𝑧 𝑘 = 𝑖 𝑥 𝑖 𝑦 𝑖−𝑘 = 𝑗 𝑥 𝑗+𝑘 𝑦 𝑗 = 𝐱∗ 𝐲 𝑅 𝑘+𝑚−1 If 𝐱 is of length 𝑛 and 𝐲 of length 𝑚, where 𝑚≤𝑛, then 𝑘=−(𝑚−1),…,𝑛−1. Sometimes, only the values 𝑘=0,…,𝑛−𝑚, corresponding to a full overlap of 𝐱 with a shift of 𝐲, are of interest. Exercise: The correlation of two vectors of length 𝑛 and 𝑚, where 𝑚≤𝑛, can be computed in 𝑂 𝑛 log 𝑚 time.
Counting mismatches [Fischer-Paterson (1974)] Let Σ be the alphabet of the pattern and text. We may assume that Σ ≤𝑚+1. (Why?) For every 𝑎∈Σ create two Boolean strings: 𝑃 𝑎 𝑗 =1 iff 𝑃 𝑗 =𝑎 𝑇 𝑎 𝑖 =1 iff 𝑇 𝑖 ≠𝑎 Correlation of 𝑃 𝑎 and 𝑇 𝑎 counts mismatches involving 𝑎.
abraabracadabracadabraabara Counting mismatches abraabracadabracadabraabara abracadabra 011001101010110101011001010 10010101001
Counting mismatches abraabracadabracadabraabara abracadabra 011001101010110101011001010 10010101001 abraabracadabracadabraabara abracadabra 011001101010110101011001010 10010101001
Counting mismatches Let Σ be the alphabet of the pattern and text. We may assume that Σ ≤𝑚+1. (Why?) For every 𝑎∈Σ create two Boolean strings: 𝑃 𝑎 𝑗 =1 iff 𝑃 𝑗 =𝑎 𝑇 𝑎 𝑖 =1 iff 𝑇 𝑖 ≠𝑎 Correlation of 𝑃 𝑎 and 𝑇 𝑎 counts mismatches involving 𝑎. Summing over all 𝑎∈Σ we get the total no. of mismatches. Complexity: 𝑂( Σ 𝑛 log 𝑚 ) word operations. (Each word assumed to hold Θ log 𝑛 bits.) Fast only if Σ is small.
Counting mismatches with wildcards [Fischer-Paterson (1974)] For every 𝑎∈Σ create two Boolean strings: 𝑃 𝑎 𝑗 =1 iff 𝑃 𝑗 =𝑎 𝑇 𝑎 𝑖 =1 iff 𝑇 𝑖 ≠𝑎 and 𝑇 𝑖 ≠ ∗ Complexity: 𝑂( Σ 𝑛 log 𝑚 ) word operations.
Counting mismatches with wildcards abraabraca*abracadabraabara abracada*ra 011001101000110101011001010 10010101001 abraabra*adabracadabraabara abracada*ra 011001100010110101011001010 10010101001
Counting mismatches with wildcards If we only want to find exact matches, replace each character 𝑎∈Σ by a specific log 2 |Σ| bit string
Counting mismatches with wildcards b r ∗ c 001 010 011 ∗∗∗ 100 Count mismatches of the binary strings as before (2 convolutions) A result of 0 corresponds to a match Complexity drops to 𝑂( log Σ 𝑛 log 𝑚 ). Can we get rid of the dependence on |Σ| ?
𝐿 2 -matching [Lipsky-Porat (2011)] Standard string matching uses the Hamming distance. Two characters either match or they do not. 𝑎 is not closer to 𝑏 than to 𝑧. Suppose that each “character” is a real number. We want to find approximate matches. For each 𝑘=0,1,…,𝑛−𝑚 we want to compute 𝑑 𝑘 = 𝑗=0 𝑚−1 𝑝 𝑗 − 𝑡 𝑘+𝑗 2 𝐿 2 -distance: 𝐱−𝐲 2 = 𝑗=0 𝑚−1 𝑥 𝑗 − 𝑦 𝑗 2
𝐿 2 -matching can be computed in 𝑂(𝑛 log 𝑚 ) time. [Lipsky-Porat (2011)] 𝑗=0 𝑚−1 𝑝 𝑗 − 𝑡 𝑘+𝑗 2 = 𝑗=0 𝑚−1 𝑝 𝑗 2 −2 𝑗=0 𝑚−1 𝑝 𝑗 𝑡 𝑘+𝑗 + 𝑗=0 𝑚−1 𝑡 𝑘+𝑗 2 Constant. 𝑂(𝑚) time. Correlation. 𝑂 𝑛 log 𝑚 time. Easy in 𝑂 𝑛 time. 𝐿 2 -matching can be computed in 𝑂(𝑛 log 𝑚 ) time.
Exact matches with wildcards [Clifford-Clifford (2007)] Replace each character by a positive integer. Replace the wildcard by 0. For each 𝑘=0,1,…,𝑛−𝑚 compute 𝑑 𝑘 = 𝑗=0 𝑚−1 𝑝 𝑗 𝑡 𝑘+𝑗 𝑝 𝑗 − 𝑡 𝑘+𝑗 2 There is an exact match at position 𝑘 iff 𝑑 𝑘 =0.
Exact matches with wildcards [Clifford-Clifford (2007)] 𝑑 𝑘 = 𝑗=0 𝑚−1 𝑝 𝑗 𝑡 𝑘+𝑗 𝑝 𝑗 − 𝑡 𝑘+𝑗 2 = 𝑗=0 𝑚−1 𝑝 𝑗 3 𝑡 𝑘+𝑗 −2 𝑗=0 𝑚−1 𝑝 𝑗 2 𝑡 𝑘+𝑗 2 + 𝑗=0 𝑚−1 𝑝 𝑗 𝑡 𝑘+𝑗 3 Compute three correlations of appropriate sequences in 𝑂 𝑛 log 𝑚 time. Running time is independent of |Σ| ! Assuming that each character fits in an Θ log 𝑛 -bit word and that operations on such words takes constant time.