Lecture 10: Sketching S3: Nearest Neighbor Search

Slides:

Advertisements

Similar presentations

Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT

Advertisements

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.

Efficient Private Approximation Protocols Piotr Indyk David Woodruff Work in progress.

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.

A Nonlinear Approach to Dimension Reduction Robert Krauthgamer Weizmann Institute of Science Joint work with Lee-Ad Gottlieb TexPoint fonts used in EMF.

Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)

1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Fast Algorithms For Hierarchical Range Histogram Constructions

Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.

Dimensionality Reduction

Data-Powered Algorithms Bernard Chazelle Princeton University Bernard Chazelle Princeton University.

Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)

Oded Regev Tel-Aviv University On Lattices, Learning with Errors, Learning with Errors, Random Linear Codes, Random Linear Codes, and Cryptography and.

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.

Approximate Nearest Subspace Search with applications to pattern recognition Ronen Basri Tal Hassner Lihi Zelnik-Manor Weizmann Institute Caltech.

Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.

How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.

Geometric Problems in High Dimensions: Sketching Piotr Indyk.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Approximate schemas Michel de Rougemont, LRI, University Paris II.

NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.

Conjunctive Filter: Breaking the Entropy Barrier Daisuke Okanohara *1, *2 Yuichi Yoshida *1*3 *1 Preferred Infrastructure Inc. *2 Dept. of Computer Science,

Geometric Problems in High Dimensions: Sketching Piotr Indyk.

Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.

Data Stream Algorithms Lower Bounds Graham Cormode

Summer School on Hashing’14 Dimension Reduction Alex Andoni (Microsoft Research)

Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.

Computational Geometry

An Optimal Algorithm for Finding Heavy Hitters

New Characterizations in Turnstile Streams with Applications

Lecture 22: Linearity Testing Sparse Fourier Transform

Fast Dimension Reduction MMDS 2008

Lecture 18: Uniformity Testing Monotonicity Testing

Sublinear Algorithmic Tools 3

Orthogonal Range Searching and Kd-Trees

Lecture 11: Nearest Neighbor Search

Sublinear Algorithmic Tools 2

LSI, SVD and Data Management

COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.

K Nearest Neighbor Classification

Sketching and Embedding are Equivalent for Norms

Lecture 4: CountSketch High Frequencies

Randomized Algorithms CS648

Lecture 7: Dynamic sampling Dimension Reduction

Lecture 16: Earth-Mover Distance

CIS 700: “algorithms for Big Data”

Linear sketching with parities

CSCI B609: “Foundations of Data Science”

Y. Kotidis, S. Muthukrishnan,

Near-Optimal (Euclidean) Metric Compression

Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University

Locality Sensitive Hashing

Linear sketching over

CSCI B609: “Foundations of Data Science”

cse 521: design and analysis of algorithms

Linear sketching with parities

Indistinguishability by adaptive procedures with advice, and lower bounds on hardness amplification proofs Aryeh Grinberg, U. Haifa Ronen.

Dimension versus Distortion a.k.a. Euclidean Dimension Reduction

CS5112: Algorithms and Data Structures for Applications

Lecture 6: Counting triangles Dynamic graphs & sampling

Lecture 15: Least Square Regression Metric Embeddings

CMPS 3130/6130 Computational Geometry Spring 2017

President’s Day Lecture: Advanced Nearest Neighbor Search

Ronen Basri Tal Hassner Lihi Zelnik-Manor Weizmann Institute Caltech

Sublinear Algorihms for Big Data

Presentation transcript:

Lecture 10: Sketching S3: Nearest Neighbor Search Left to the title, a presenter can insert his/her own image pertinent to the presentation.

Plan PS2 due yesterday, 7pm Sketching Nearest Neighbor Search Scriber?

Sketching 𝑆: ℜ 𝑑 → short bit-strings ℓ 2 , ℓ 1 norm: 𝑂( 𝜖 −2 ) words given 𝑆(𝑥) and 𝑆(𝑦), should be able to estimate some function of 𝑥 and 𝑦 With 90% success probability (𝛿=0.1) ℓ 2 , ℓ 1 norm: 𝑂( 𝜖 −2 ) words Decision version: given 𝑟 in advance… Lemma: ℓ 2 , ℓ 1 norm: 𝑂( 𝜖 −2 ) bits 𝑦 𝑥 𝑆 𝑆 010110 010101 Distinguish between ||𝑥−𝑦||≤𝑟 ||𝑥−𝑦||> 1+𝜖 𝑟 Estimate ||𝑥−𝑦| | 2

Sketching: decision version Consider Hamming space: 𝑥,𝑦∈ 0,1 𝑑 Lemma: for any 𝑟>0, can achieve 𝑂(1/ 𝜖 2 ) -bit sketch. [on blackboard]

Conclusion Dimension reduction: In ℓ 1 : no dimension reduction [Johnson-Lindenstrauss’84]: a random linear projection into 𝑘 dimensions preserves ||𝑥−𝑦| | 2 , up to 1+𝜖 approximation with probability ≥1− 𝑒 −Ω 𝜖 2 𝑘 Random linear projection: Can be 𝐺𝑥 where 𝐺 is Gaussian, or ±1 entry-wise Hence: preserves distance between 𝑛 points as long as 𝑘=Θ 1 𝜖 2 log 𝑛 Can do faster than 𝑂(𝑑𝑘) time Using Fast Fourier Transform In ℓ 1 : no dimension reduction But can do sketching Using 𝑝-stable distributions (Cauchy for 𝑝=1) Sketching: decision version, constant 𝛿=0.1: For ℓ 1 , ℓ 2 , can do with 𝑂 1 𝜖 2 bits!

Section 3: Nearest Neighbor Search

Approximate NNS 𝑐-approximate 𝑟-near neighbor: given a query point 𝑞 assuming there is a point 𝑝 ∗ within distance 𝑟, report a point 𝑝 ′ ∈𝐷 s.t. 𝑝′−𝑞 𝑝′−𝑞 ≤𝑐𝑟 𝑟 𝑝 ∗ 𝑐𝑟 𝑞 𝑝 ′

NNS: approach 1 Boosted sketch: Let 𝑆 = sketch for the decision version (90% success probability) new sketch 𝑊 : keep 𝑘=𝑂( log 𝑛 ) copies of 𝑆 estimator is the majority answer of the 𝑘 estimators Sketch size: 𝑂( 𝜖 −2 log 𝑛 ) bits Success probability: 1− 𝑛 −2 (Chernoff) Preprocess: compute sketches 𝑊(𝑝) for all the points 𝑝∈𝐷 Query: compute sketch 𝑊(𝑞), and compute distance to all points using sketch Time: improved from 𝑂(𝑛𝑑) to 𝑂(𝑛 𝜖 −2 log 𝑛 )

NNS: approach 2 Query time below 𝑛 ? Theorem [KOR98]: 𝑂(𝑑 𝜖 −2 log 𝑛) query time and 𝑛 𝑂 1/ 𝜖 2 space for 1+𝜖 approximation. Proof: Note that 𝑊(𝑞) has 𝑤=𝑂 𝜖 −2 log 𝑛 bits Only 2 𝑤 possible sketches! Store an answer for each of 2 𝑤 = 𝑛 𝑂 𝜖 −2 possible inputs In general: if a distance has constant-size sketch, admits a poly-space NNS data structure! Space closer to linear? approach 3 will require more specialized sketches…