Lecture 10: Sketching S3: Nearest Neighbor Search

Slides:



Advertisements
Similar presentations
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Advertisements

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Efficient Private Approximation Protocols Piotr Indyk David Woodruff Work in progress.
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
A Nonlinear Approach to Dimension Reduction Robert Krauthgamer Weizmann Institute of Science Joint work with Lee-Ad Gottlieb TexPoint fonts used in EMF.
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Fast Algorithms For Hierarchical Range Histogram Constructions
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
Dimensionality Reduction
Data-Powered Algorithms Bernard Chazelle Princeton University Bernard Chazelle Princeton University.
Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)
Oded Regev Tel-Aviv University On Lattices, Learning with Errors, Learning with Errors, Random Linear Codes, Random Linear Codes, and Cryptography and.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
Approximate Nearest Subspace Search with applications to pattern recognition Ronen Basri Tal Hassner Lihi Zelnik-Manor Weizmann Institute Caltech.
Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Approximate schemas Michel de Rougemont, LRI, University Paris II.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
Conjunctive Filter: Breaking the Entropy Barrier Daisuke Okanohara *1, *2 Yuichi Yoshida *1*3 *1 Preferred Infrastructure Inc. *2 Dept. of Computer Science,
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
Data Stream Algorithms Lower Bounds Graham Cormode
Summer School on Hashing’14 Dimension Reduction Alex Andoni (Microsoft Research)
Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.
Computational Geometry
An Optimal Algorithm for Finding Heavy Hitters
New Characterizations in Turnstile Streams with Applications
Lecture 22: Linearity Testing Sparse Fourier Transform
Fast Dimension Reduction MMDS 2008
Lecture 18: Uniformity Testing Monotonicity Testing
Sublinear Algorithmic Tools 3
Orthogonal Range Searching and Kd-Trees
Lecture 11: Nearest Neighbor Search
Sublinear Algorithmic Tools 2
LSI, SVD and Data Management
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
K Nearest Neighbor Classification
Sketching and Embedding are Equivalent for Norms
Lecture 4: CountSketch High Frequencies
Randomized Algorithms CS648
Lecture 7: Dynamic sampling Dimension Reduction
Lecture 16: Earth-Mover Distance
CIS 700: “algorithms for Big Data”
Linear sketching with parities
CSCI B609: “Foundations of Data Science”
Y. Kotidis, S. Muthukrishnan,
Near-Optimal (Euclidean) Metric Compression
Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University
Locality Sensitive Hashing
Linear sketching over
CSCI B609: “Foundations of Data Science”
cse 521: design and analysis of algorithms
Linear sketching with parities
Indistinguishability by adaptive procedures with advice, and lower bounds on hardness amplification proofs Aryeh Grinberg, U. Haifa Ronen.
Dimension versus Distortion a.k.a. Euclidean Dimension Reduction
CS5112: Algorithms and Data Structures for Applications
Lecture 6: Counting triangles Dynamic graphs & sampling
Lecture 15: Least Square Regression Metric Embeddings
CMPS 3130/6130 Computational Geometry Spring 2017
President’s Day Lecture: Advanced Nearest Neighbor Search
Ronen Basri Tal Hassner Lihi Zelnik-Manor Weizmann Institute Caltech
Sublinear Algorihms for Big Data
Presentation transcript:

Lecture 10: Sketching S3: Nearest Neighbor Search Left to the title, a presenter can insert his/her own image pertinent to the presentation.

Plan PS2 due yesterday, 7pm Sketching Nearest Neighbor Search Scriber?

Sketching 𝑆: ℜ 𝑑 → short bit-strings ℓ 2 , ℓ 1 norm: 𝑂( 𝜖 −2 ) words given 𝑆(𝑥) and 𝑆(𝑦), should be able to estimate some function of 𝑥 and 𝑦 With 90% success probability (𝛿=0.1) ℓ 2 , ℓ 1 norm: 𝑂( 𝜖 −2 ) words Decision version: given 𝑟 in advance… Lemma: ℓ 2 , ℓ 1 norm: 𝑂( 𝜖 −2 ) bits 𝑦 𝑥 𝑆 𝑆 010110 010101 Distinguish between ||𝑥−𝑦||≤𝑟 ||𝑥−𝑦||> 1+𝜖 𝑟 Estimate ||𝑥−𝑦| | 2

Sketching: decision version Consider Hamming space: 𝑥,𝑦∈ 0,1 𝑑 Lemma: for any 𝑟>0, can achieve 𝑂(1/ 𝜖 2 ) -bit sketch. [on blackboard]

Conclusion Dimension reduction: In ℓ 1 : no dimension reduction [Johnson-Lindenstrauss’84]: a random linear projection into 𝑘 dimensions preserves ||𝑥−𝑦| | 2 , up to 1+𝜖 approximation with probability ≥1− 𝑒 −Ω 𝜖 2 𝑘 Random linear projection: Can be 𝐺𝑥 where 𝐺 is Gaussian, or ±1 entry-wise Hence: preserves distance between 𝑛 points as long as 𝑘=Θ 1 𝜖 2 log 𝑛 Can do faster than 𝑂(𝑑𝑘) time Using Fast Fourier Transform In ℓ 1 : no dimension reduction But can do sketching Using 𝑝-stable distributions (Cauchy for 𝑝=1) Sketching: decision version, constant 𝛿=0.1: For ℓ 1 , ℓ 2 , can do with 𝑂 1 𝜖 2 bits!

Section 3: Nearest Neighbor Search

Approximate NNS 𝑐-approximate 𝑟-near neighbor: given a query point 𝑞 assuming there is a point 𝑝 ∗ within distance 𝑟, report a point 𝑝 ′ ∈𝐷 s.t. 𝑝′−𝑞 𝑝′−𝑞 ≤𝑐𝑟 𝑟 𝑝 ∗ 𝑐𝑟 𝑞 𝑝 ′

NNS: approach 1 Boosted sketch: Let 𝑆 = sketch for the decision version (90% success probability) new sketch 𝑊 : keep 𝑘=𝑂( log 𝑛 ) copies of 𝑆 estimator is the majority answer of the 𝑘 estimators Sketch size: 𝑂( 𝜖 −2 log 𝑛 ) bits Success probability: 1− 𝑛 −2 (Chernoff) Preprocess: compute sketches 𝑊(𝑝) for all the points 𝑝∈𝐷 Query: compute sketch 𝑊(𝑞), and compute distance to all points using sketch Time: improved from 𝑂(𝑛𝑑) to 𝑂(𝑛 𝜖 −2 log 𝑛 )

NNS: approach 2 Query time below 𝑛 ? Theorem [KOR98]: 𝑂(𝑑 𝜖 −2 log 𝑛) query time and 𝑛 𝑂 1/ 𝜖 2 space for 1+𝜖 approximation. Proof: Note that 𝑊(𝑞) has 𝑤=𝑂 𝜖 −2 log 𝑛 bits Only 2 𝑤 possible sketches! Store an answer for each of 2 𝑤 = 𝑛 𝑂 𝜖 −2 possible inputs In general: if a distance has constant-size sketch, admits a poly-space NNS data structure! Space closer to linear? approach 3 will require more specialized sketches…