1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

Slides:



Advertisements
Similar presentations
Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Advertisements

The beauty of prime numbers vs the beauty of the random Ely Porat Bar-Ilan University Israel.
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Searching on Multi-Dimensional Data
MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.
Similarity Search in High Dimensions via Hashing
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Goodrich, Tamassia String Processing1 Pattern Matching.
Property Matching and Weighted Matching Amihood Amir, Eran Chencinski, Costas Iliopoulos, Tsvi Kopelowitz and Hui Zhang.
1 Lecture 18 Syntactic Web Clustering CS
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Summer School on Hashing’14 Locality Sensitive Hashing Alex Andoni (Microsoft Research)
Optimal Data-Dependent Hashing for Approximate Near Neighbors
Asynchronous Pattern Matching - Address Level Errors Amihood Amir Bar Ilan University 2010.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University March 29 th, 2010.
Fast Set Intersection in Memory Bolin Ding Arnd Christian König UIUC Microsoft Research.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Filter Algorithms for Approximate String Matching Stefan Burkhardt.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Nearest Neighbor Paul Hsiung March 16, Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
Quantum Computing MAS 725 Hartmut Klauck NTU
Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tight Lower Bounds for Data- Dependent Locality-Sensitive Hashing Alexandr Andoni (Columbia) Ilya Razenshteyn (MIT CSAIL)
Tries 07/28/16 11:04 Text Compression
Lecture 21: Hash Tables Monday, February 28, 2005.
13 Text Processing Hongfei Yan June 1, 2016.
Sublinear Algorithmic Tools 3
Lecture 11: Nearest Neighbor Search
Lecture 10: Sketching S3: Nearest Neighbor Search
Lecture 7: Dynamic sampling Dimension Reduction
Near(est) Neighbor in High Dimensions
Data-Dependent Hashing for Nearest Neighbor Search
Near-Optimal (Euclidean) Metric Compression
Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University
Locality Sensitive Hashing
Searching Similar Segments over Textual Event Sequences
Overcoming the L1 Non-Embeddability Barrier
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
CS5112: Algorithms and Data Structures for Applications
Lecture 15: Least Square Regression Metric Embeddings
Minwise Hashing and Efficient Search
President’s Day Lecture: Advanced Nearest Neighbor Search
Sequences 5/17/ :43 AM Pattern Matching.
Donghui Zhang, Tian Xia Northeastern University
Presentation transcript:

1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT

2 What’s SNN? SNN ≈ Text Indexing with mismatches Text Indexing: Construct a data structure on a text T[1..n], s.t. Given query P[1..m], finds occurrences of P in T Text indexing with mismatches: Given P, find the substrings of T that are equal to P except ≤R chars. Motivation: e.g., computational bio (BLAST) T= GAGTAACTCAATA P= AGTA T= GAGTAACTCAATA

3 Outline General approach View: Near Neighbor in Hamming Focus: reducing space Background Locality-Sensitive Hashing (LSH) Solution Reducing query & preprocessing Redesign LSH Concluding remarks

4 Approach (Or, why SNN?) SNN = a near neighbor problem in Hamming metric with m dimensions: Construct data structure on D={all substrings of T of length m}, s.t. Given P, find a point in D that is at distance ≤R from P  Use a NN data structure for Hamming D={GAGT, AGTA, GTAA, …. AATA} T= GAGTAACTCAATA P= AGTA

5 Approximate NN Exact NN problem seems hard (i.e., hard w/o exponential space or O(n) query time) Approximate NN is easier Defined for approximation c=1+ε as OK to report a point at distance ≤cR (when there is a point at distance ≤R) QuerySpace [KOR98, IM98]poly(log n, m)n O(1/ε^2) LSH [IM98]n 1/c +mn 1+1/c R cR q

6 Our contribution Problem: need m in advance for NN Have to construct a data structure for each m≤M Here: approx SNN data structure for unknown m Without degradation in space or query time Our algorithm for SNN based on LSH: Supports patterns of length m≤M Optimal* space: n 1+1/c Optimal* query time: n 1/c Slightly worse preprocessing time if c>3 (* Optimal w.r.t. LSH, modulo subpoly factors) Also extends to l 1

7 Outline General approach View: Near Neighbor in Hamming Focus: reducing space Background Locality-Sensitive Hashing (LSH) Solution Reducing query & preprocessing Redesign LSH Concluding remarks

8 Locality-Sensitive Hashing Based on a family of hash functions {g} For points P[1..m], Q[1..m]: If dist(P,Q) ≤ R,Pr g [g(P)=g(Q)] = “medium” If dist(P,Q) > cR,Pr g [g(P)=g(Q)] = “low” Idea: Construct L hash tables with random g 1, g 2, … g L For query P, look at buckets g 1 (P), g 2 (P)… g L (P) Space: L*n Query time: L

9 LSH for Hamming Hash function g: Projection on k random coordinates E.g.: g 1 (“AGTA”)=“AA” (k=2) L=#hash tables=n 1/c k=|log n / log(1-cR/m)| < m * log n T= GAGTAACTCAATAD={GAGT, AGTA, GTAA, …, AATA} HT 1 : GT->GAGT AA->AGTA, AATA GA->GTAA … P= AGTA R=1

10 Outline General approach View: Near Neighbor in Hamming Focus: reducing space Background Locality-Sensitive Hashing (LSH) Solution Reducing query & preprocessing Redesign LSH Concluding remarks

11 Unknown m Bad news k dependent on m! Distinct m  distinct hash tables T= GAGTAACTCAATAD={GAG, AGT, …, ACT, …} HT 1 : GG-> GAG AT-> AGG, ACT,… … P= AGT R=1 g 1 (“AGT”)=“AT”

12 Solution Let’s just reuse the same data structure for all m g(“AGTA”)=“AA” On “AGT”  have to guess last char g(“AGT?”)=g(“AGT?”) = “A?” Like in [exact] text indexing… T= GAGTAACTCAATAD={GAGT, AGTA, … ACTA, …} HT 1 : GT->GAGT AA->AGTA, AATA GA->GTAA AC->ACTC … P= AGT R=1

13 Tries*! Replace HT 1 with trie on g 1 (suffixes) Stop search when outside P Same analysis! T= GAGTAACTCAATAD={GAGT, AGTA, … ACTA, …} HT 1 : GT->GAGT AA->AGTA, AATA GA->GTAA AC->ACTC … P= AGT R=1 A G A C AGTA AATA ACTC T AACT A T … … … AGT AGTA * Tries have been used with LSH before in [MS02], but in a different context

14 Resulting performance Space: n 1+1/c (using compressed tries, one trie takes n space) Optimal! Query time: n 1/c * m (m=length P) Not [yet] really optimal: originally, could do dim-reduction Can improve to n 1/c + mn o(1) Preprocessing time: n 1+1/c * M (M=max m) Not optimal (optimal = n 1+1/c ) Can improve to n 1+1/c + M 1/3 * n 1+o(1) Optimal for c<3

15 Outline General approach View: Near Neighbor in Hamming Focus: reducing space Background Locality-Sensitive Hashing (LSH) Solution Reducing query & preprocessing Redesign LSH Concluding remarks

16 Better query & preprocessing Redesign LSH to improve query and preprocessing: Query: n 1/c * m  n 1/c + mn o(1) Preprocessing: n 1+1/c * M  n 1+1/c + n 1+o(1) * M Idea for new LSH Use same # of hash tables/tries (#=L= n 1/c ) But use “less randomness” in choosing hash functions g 1, g 2, …g L S.t., each g i looks random, but g’s are not independent

17 New LSH scheme Old scheme: Choose L hash functions g i Each g i = projection on k random coordinates New scheme: Construct the L functions g i from a smaller number of “base” hash functions A “base” hash function = projection on k/2 random coordinates {g i,i =1..L} = all pairs of “base” hash functions Need only ~L 1/2 “base” hash functions!

18 Example k=4 w= #base fns=4 L=(w choose 2)=(4 choose 2)=6 u1=u1= u2=u2= u3=u3= u4=u4= g 1 = = g 2 = = g 3 = =......

19 Saving time Can save time since there are less “base” hash functions E.g.: computing fingerprints Want to compute FP(g i (P)) for i=1..L FP(g i (P))=(Σ j P[j] * χ j i * 2 j ) mod prime Old way Would take L * m time for L functions g New way Takes L 1/2 * m time for L 1/2 functions u i Need only L time to combine FP(u(P)) into FP(g(P))  If g=, then FP(g(P))=(FP(u 1 (P))+FP(u 2 (P))) mod prime Total: L + L 1/2 * m

20 Better query & preproc (2) E.g., for query Use fingerprints to leap faster in the trie Yields time n 1/c + n 1/(2c) * m (since L= n 1/c ) To get n 1/c + n o(1) * m, generalize: g = tuple of t base functions a base function = k/t random coordinates Other details similar to fingerprints

21 Better preprocessing (3) Preprocessing, can get n 1+1/c + n 1+o(1) * M Can get n 1+1/c + n 1+o(1) * M 1/3 Can construct a trie in n * M 1/3 (instead on n * M) Using FFT, etc

22 Outline General approach View: Near Neighbor problem in Hamming metric Focus: reducing space Background Locality-Sensitive Hashing (LSH) Solution = LSH + Tries Reducing query & preprocessing Redesign LSH Concluding remarks

23 Conclusions Problem: Substring Near Neighbor (a.k.a., text indexing with mismatches) Approach: View as NN in m-dimensional Hamming Use LSH Challenge: Variable-length pattern w/o degradation in performance Solution: Space/query optimal (w.r.t. LSH) Preprocessing optimal (w.r.t. LSH) for c<3

24 Extensions Extends to l 1 Nontrivial since a need a quite different LSH functions Preprocessing slightly worse n 1+1/c + n 1+o(1) * M 2/3 Using “Less-than-matching” problem [Amir-Farach’95]

25 Remarks Other approaches? Or, why LSH for SNN? Since better SNN  better NN… And LSH is the “best” known algorithm for high-dimensional NN (using reasonable space)

26 Thanks!