1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT
2 What’s SNN? SNN ≈ Text Indexing with mismatches Text Indexing: Construct a data structure on a text T[1..n], s.t. Given query P[1..m], finds occurrences of P in T Text indexing with mismatches: Given P, find the substrings of T that are equal to P except ≤R chars. Motivation: e.g., computational bio (BLAST) T= GAGTAACTCAATA P= AGTA T= GAGTAACTCAATA
3 Outline General approach View: Near Neighbor in Hamming Focus: reducing space Background Locality-Sensitive Hashing (LSH) Solution Reducing query & preprocessing Redesign LSH Concluding remarks
4 Approach (Or, why SNN?) SNN = a near neighbor problem in Hamming metric with m dimensions: Construct data structure on D={all substrings of T of length m}, s.t. Given P, find a point in D that is at distance ≤R from P Use a NN data structure for Hamming D={GAGT, AGTA, GTAA, …. AATA} T= GAGTAACTCAATA P= AGTA
5 Approximate NN Exact NN problem seems hard (i.e., hard w/o exponential space or O(n) query time) Approximate NN is easier Defined for approximation c=1+ε as OK to report a point at distance ≤cR (when there is a point at distance ≤R) QuerySpace [KOR98, IM98]poly(log n, m)n O(1/ε^2) LSH [IM98]n 1/c +mn 1+1/c R cR q
6 Our contribution Problem: need m in advance for NN Have to construct a data structure for each m≤M Here: approx SNN data structure for unknown m Without degradation in space or query time Our algorithm for SNN based on LSH: Supports patterns of length m≤M Optimal* space: n 1+1/c Optimal* query time: n 1/c Slightly worse preprocessing time if c>3 (* Optimal w.r.t. LSH, modulo subpoly factors) Also extends to l 1
7 Outline General approach View: Near Neighbor in Hamming Focus: reducing space Background Locality-Sensitive Hashing (LSH) Solution Reducing query & preprocessing Redesign LSH Concluding remarks
8 Locality-Sensitive Hashing Based on a family of hash functions {g} For points P[1..m], Q[1..m]: If dist(P,Q) ≤ R,Pr g [g(P)=g(Q)] = “medium” If dist(P,Q) > cR,Pr g [g(P)=g(Q)] = “low” Idea: Construct L hash tables with random g 1, g 2, … g L For query P, look at buckets g 1 (P), g 2 (P)… g L (P) Space: L*n Query time: L
9 LSH for Hamming Hash function g: Projection on k random coordinates E.g.: g 1 (“AGTA”)=“AA” (k=2) L=#hash tables=n 1/c k=|log n / log(1-cR/m)| < m * log n T= GAGTAACTCAATAD={GAGT, AGTA, GTAA, …, AATA} HT 1 : GT->GAGT AA->AGTA, AATA GA->GTAA … P= AGTA R=1
10 Outline General approach View: Near Neighbor in Hamming Focus: reducing space Background Locality-Sensitive Hashing (LSH) Solution Reducing query & preprocessing Redesign LSH Concluding remarks
11 Unknown m Bad news k dependent on m! Distinct m distinct hash tables T= GAGTAACTCAATAD={GAG, AGT, …, ACT, …} HT 1 : GG-> GAG AT-> AGG, ACT,… … P= AGT R=1 g 1 (“AGT”)=“AT”
12 Solution Let’s just reuse the same data structure for all m g(“AGTA”)=“AA” On “AGT” have to guess last char g(“AGT?”)=g(“AGT?”) = “A?” Like in [exact] text indexing… T= GAGTAACTCAATAD={GAGT, AGTA, … ACTA, …} HT 1 : GT->GAGT AA->AGTA, AATA GA->GTAA AC->ACTC … P= AGT R=1
13 Tries*! Replace HT 1 with trie on g 1 (suffixes) Stop search when outside P Same analysis! T= GAGTAACTCAATAD={GAGT, AGTA, … ACTA, …} HT 1 : GT->GAGT AA->AGTA, AATA GA->GTAA AC->ACTC … P= AGT R=1 A G A C AGTA AATA ACTC T AACT A T … … … AGT AGTA * Tries have been used with LSH before in [MS02], but in a different context
14 Resulting performance Space: n 1+1/c (using compressed tries, one trie takes n space) Optimal! Query time: n 1/c * m (m=length P) Not [yet] really optimal: originally, could do dim-reduction Can improve to n 1/c + mn o(1) Preprocessing time: n 1+1/c * M (M=max m) Not optimal (optimal = n 1+1/c ) Can improve to n 1+1/c + M 1/3 * n 1+o(1) Optimal for c<3
15 Outline General approach View: Near Neighbor in Hamming Focus: reducing space Background Locality-Sensitive Hashing (LSH) Solution Reducing query & preprocessing Redesign LSH Concluding remarks
16 Better query & preprocessing Redesign LSH to improve query and preprocessing: Query: n 1/c * m n 1/c + mn o(1) Preprocessing: n 1+1/c * M n 1+1/c + n 1+o(1) * M Idea for new LSH Use same # of hash tables/tries (#=L= n 1/c ) But use “less randomness” in choosing hash functions g 1, g 2, …g L S.t., each g i looks random, but g’s are not independent
17 New LSH scheme Old scheme: Choose L hash functions g i Each g i = projection on k random coordinates New scheme: Construct the L functions g i from a smaller number of “base” hash functions A “base” hash function = projection on k/2 random coordinates {g i,i =1..L} = all pairs of “base” hash functions Need only ~L 1/2 “base” hash functions!
18 Example k=4 w= #base fns=4 L=(w choose 2)=(4 choose 2)=6 u1=u1= u2=u2= u3=u3= u4=u4= g 1 = = g 2 = = g 3 = =......
19 Saving time Can save time since there are less “base” hash functions E.g.: computing fingerprints Want to compute FP(g i (P)) for i=1..L FP(g i (P))=(Σ j P[j] * χ j i * 2 j ) mod prime Old way Would take L * m time for L functions g New way Takes L 1/2 * m time for L 1/2 functions u i Need only L time to combine FP(u(P)) into FP(g(P)) If g=, then FP(g(P))=(FP(u 1 (P))+FP(u 2 (P))) mod prime Total: L + L 1/2 * m
20 Better query & preproc (2) E.g., for query Use fingerprints to leap faster in the trie Yields time n 1/c + n 1/(2c) * m (since L= n 1/c ) To get n 1/c + n o(1) * m, generalize: g = tuple of t base functions a base function = k/t random coordinates Other details similar to fingerprints
21 Better preprocessing (3) Preprocessing, can get n 1+1/c + n 1+o(1) * M Can get n 1+1/c + n 1+o(1) * M 1/3 Can construct a trie in n * M 1/3 (instead on n * M) Using FFT, etc
22 Outline General approach View: Near Neighbor problem in Hamming metric Focus: reducing space Background Locality-Sensitive Hashing (LSH) Solution = LSH + Tries Reducing query & preprocessing Redesign LSH Concluding remarks
23 Conclusions Problem: Substring Near Neighbor (a.k.a., text indexing with mismatches) Approach: View as NN in m-dimensional Hamming Use LSH Challenge: Variable-length pattern w/o degradation in performance Solution: Space/query optimal (w.r.t. LSH) Preprocessing optimal (w.r.t. LSH) for c<3
24 Extensions Extends to l 1 Nontrivial since a need a quite different LSH functions Preprocessing slightly worse n 1+1/c + n 1+o(1) * M 2/3 Using “Less-than-matching” problem [Amir-Farach’95]
25 Remarks Other approaches? Or, why LSH for SNN? Since better SNN better NN… And LSH is the “best” known algorithm for high-dimensional NN (using reasonable space)
26 Thanks!